Create your own MLFlow custom flavors for model registry

That was my experience: no connection with anyone in the industry for the job I want to know more about. What can I do? I will share with you two methods that worked for me.

Winning My Second Hackathon

May 02, 2019

On March 24, 2019, I won HackathonCLT. Here are my reflections.

Agile in Data Science: Why My Scrum Doesn't Work?

April 03, 2019

Scrum is one of the most popular Agile framework. Data scientist from different teams, however, have been telling me that Scrum does not work for data science projects. Why is that?

5 Great Books I'm Learning from in 2019

March 05, 2019

In the last year and a half, I read about 30 books. Here are five great books that I read more than once and found myself going back again and again for practical applications.

A Python Tutorial on Automating Boring Data Workflow—The PyderPuffGirls

February 22, 2019

Back in November, I started an internal training for my colleagues on walking through basic automation in Python for data analysis. This post is a guideline on how to use the training material.

Document Analysis as Python Code with Great Expectations

February 16, 2019

If you have pulled your hairs out fixing the performance of a model built by someone else and at the same time wondering “why is this happening to me?”, then this post may be for you.

Making Excel Charts, Formulas, and Tables with Python—PyderPuffGirls Episode 8

February 05, 2019

In this last post of the series, I will continue on automating Excel with Python and show you how to use a few commands outside the Home Tab.

Agile in Data Science: What is Waterfall and Agile?

January 30, 2019

There has been a lot of talk about Agile around me recently. For that, I want to summarize my understanding of Agile in data science into a series. This post covers where the discussion usually starts: Waterfall versus Agile.

Replace the Home Tab in Excel with Python—PyderPuffGirls Episode 7

January 24, 2019

In this post, I will show you the how to automate Excel away using the Python package openpyxl, starting with every button on the Home Tab. No more drag and drop!

Make a Workflow Config with YAML—PyderPuffGirls Episode 6

January 17, 2019

This post follows up on last post to introduce another convenient tool for writing maintainable code—the configuration file. In particular, I will show you a specific config file format, YAML, and how it works in Python.

Untangle the SQL Mess with Jinja—PyderPuffGirls Episode 5

January 07, 2019

Writing and maintaining complex SQL query sucks. How can I use Python to make my life easier?

Filling Up Your Inbox with Goodies—PyderPuffGirls Episode 4

December 17, 2018

In the previous posts, I covered how to set up scheduled jobs for SQL queries. How can I get a note when things go wrong? How can I send the result to myself? Email, of course! So, in this post, I will show you how to send emails with Python.

Don't Wait, Schedule and Relax Instead—PyderPuffGirls Episode 3

December 12, 2018

The purpose of automation is to let machine do things while us humans rest. In this post, I will show you how to schedule a job with a Python module schedule.

How to Query a Database in Python—PyderPuffGirls Episode 2

December 03, 2018

In PyderPuffGirl Episode 1, I showed you how to open a SQL query in Python. How can I submit said query through Python to a database? Moreover, how can I get the result of the query as a file in Python?

A Python Tutorial for the Bored Me—PyderPuffGirls Episode 1

November 26, 2018

This is the Episode 1 of the PyderPuffGirls†—a tutorial on automating the boring parts of data analysis that we are going through in the next 8 weeks. I’m writing this tutorial for people that had at least one false start in learning Python, just like me two years ago.

Becoming a Better Data Scientist: Testing with pytest

November 19, 2018

This story has happened to me too many times: First, I started a project. Then I explored relevant data and wrote some code that fulfills business needs. I was proud of my code and the initial results.

Improving the Validation and Test Split

August 25, 2018

In this post, I am going to provide my views on the steps of train-validation-test in building a machine learning model. In particular, I’ll share some thoughts on the test set. This sounds like beating a dead horse – and in some ways it is – but there are more to dig into when working on real problems.

Let's Read Together - Testing Rubrics for Machine Learning

March 25, 2018

How to run tests in data science was a question I had for quite some time. As far as I know, testing is standard recipe among software developers, but there seems to be no such recipe for data scientists.

PySpark Dataframe Basics

March 04, 2018

In this post, I will use a toy data to show some basic dataframe operations that are helpful in working with dataframes in PySpark or tuning the performance of Spark jobs. The list is by no means exhaustive, but they are the most common ones I used.

My First Hackathon

February 20, 2018

Last November, I went to my first hackathon, the Queen City Hackathon, in Charlotte. Here’s my review on the event with a lesson I learned from it.

A Guide to Calibration Plots in Python

February 07, 2018

When I build a machine learning model for classification problems, one of the questions that I ask myself is why is my model not crap? Sometimes I feel that developing a model is like holding a grenade, and calibration is one of my safety pins. In this post, I will walk through the concept of calibration, then show in Python how to make diagnostic calibration plots.

How to Turn Python Functions into PySpark Functions (UDF)

January 29, 2018

Here’s the problem: I have a Python function that iterates over my data, but going through each row in the dataframe takes several days. If I have a computing cluster with many nodes, how can I distribute this Python function in PySpark to speed up this process — maybe cut the total time down to less than a few hours — with the least amount of work?

My Job Search as a PhD Student

January 23, 2018

This post is a review of my data science job search last year. When I started my job search last spring, I tried to read up on other’s experience so I could know what to expect, but they were far and few in between, and even fewer were told from a PhD student’s perspective. It is scary to tread down an unknown path, so I want to share my own experience.

How to Install and Run PySpark in Jupyter Notebook on Windows

December 30, 2017

When I write PySpark code, I use Jupyter notebook to test my code before submitting a job on the cluster. In this post, I will show you how to install and run PySpark locally in Jupyter Notebook on Windows. I’ve tested this guide on a dozen Windows 7 and 10 PCs in different languages.

Separating Exploration and Product in a Data Science Project

December 17, 2017

A Reddit comment

Multiple Window Clauses in Hive Queries

December 01, 2017

Recently, I had to write a Hive query that contains multiple window functions. As the number of windows grew, the query became an unreadable mess.

My Meetup Experience - Developer Launchpad

October 29, 2017

This post follows my previous post How I Learned Data Science Skills as a Math PhD. I said that meetups are awesome but didn’t give my reasons. Here, I am going to share my experience on a particular meetup, the Developer Launchpad Nashville, as an example of why meetups were awesome to me and the roles they played when I was changing career.

How I Learned Data Science Skills as a Math PhD

October 14, 2017

This post is a review on how I picked up data science skills when I was a PhD student in mathematics. The time span was from January 2016 to May 2017, before I got my job. I break it into three periods: before, during, and after my summer internship. Here, I will talk about what I did, and not what I think I should have done — that’s for another post.