5 Great Books I'm Learning from in 2019

March 05, 2019

In the last year and a half, I read about 30 books. Here are five great books that I read more than once and found myself going back again and again for practical applications.

Read More

Agile in Data Science: What is Waterfall and Agile?

January 30, 2019

There has been a lot of talk about Agile around me recently. For that, I want to summarize my understanding of Agile in data science into a series. This post covers where the discussion usually starts: Waterfall versus Agile.

Read More

A Python Tutorial for the Bored Me—PyderPuffGirls Episode 1

November 26, 2018

This is the Episode 1 of the PyderPuffGirls†—a tutorial on automating the boring parts of data analysis that we are going through in the next 8 weeks. I’m writing this tutorial for people that had at least one false start in learning Python, just like me two years ago.

Read More

Improving the Validation and Test Split

August 25, 2018

In this post, I am going to provide my views on the steps of train-validation-test in building a machine learning model. In particular, I’ll share some thoughts on the test set. This sounds like beating a dead horse – and in some ways it is – but there are more to dig into when working on real problems.

Read More

PySpark Dataframe Basics

March 04, 2018

In this post, I will use a toy data to show some basic dataframe operations that are helpful in working with dataframes in PySpark or tuning the performance of Spark jobs. The list is by no means exhaustive, but they are the most common ones I used.

Read More

My First Hackathon

February 20, 2018

Last November, I went to my first hackathon, the Queen City Hackathon, in Charlotte. Here’s my review on the event with a lesson I learned from it.

Read More

A Guide to Calibration Plots in Python

February 07, 2018

When I build a machine learning model for classification problems, one of the questions that I ask myself is why is my model not crap? Sometimes I feel that developing a model is like holding a grenade, and calibration is one of my safety pins. In this post, I will walk through the concept of calibration, then show in Python how to make diagnostic calibration plots.

Read More

How to Turn Python Functions into PySpark Functions (UDF)

January 29, 2018

Here’s the problem: I have a Python function that iterates over my data, but going through each row in the dataframe takes several days. If I have a computing cluster with many nodes, how can I distribute this Python function in PySpark to speed up this process — maybe cut the total time down to less than a few hours — with the least amount of work?

Read More

My Job Search as a PhD Student

January 23, 2018

This post is a review of my data science job search last year. When I started my job search last spring, I tried to read up on other’s experience so I could know what to expect, but they were far and few in between, and even fewer were told from a PhD student’s perspective. It is scary to tread down an unknown path, so I want to share my own experience.

Read More

How to Install and Run PySpark in Jupyter Notebook on Windows

December 30, 2017

When I write PySpark code, I use Jupyter notebook to test my code before submitting a job on the cluster. In this post, I will show you how to install and run PySpark locally in Jupyter Notebook on Windows. I’ve tested this guide on a dozen Windows 7 and 10 PCs in different languages.

Read More

How I Learned Data Science Skills as a Math PhD

October 14, 2017

This post is a review on how I picked up data science skills when I was a PhD student in mathematics. The time span was from January 2016 to May 2017, before I got my job. I break it into three periods: before, during, and after my summer internship. Here, I will talk about what I did, and not what I think I should have done — that’s for another post.

Read More