This story has happened to me too many times: First, I started a project. Then I explored relevant data and wrote some code that fulfills business needs. I was proud of my code and the initial results.
In this post, I am going to provide my views on the steps of train-validation-test in building a machine learning model. In particular, I’ll share some thoughts on the test set. This sounds like beating a dead horse – and in some ways it is – but there are more to dig into when working on real problems.
How to run tests in data science was a question I had for quite some time. As far as I know, testing is standard recipe among software developers, but there seems to be no such recipe for data scientists.
In this post, I will use a toy data to show some basic dataframe operations that are helpful in working with dataframes in PySpark or tuning the performance of Spark jobs. The list is by no means exhaustive, but they are the most common ones I used.
Last November, I went to my first hackathon, the Queen City Hackathon, in Charlotte. Here’s my review on the event with a lesson I learned from it.
When I build a machine learning model for classification problems, one of the questions that I ask myself is why is my model not crap? Sometimes I feel that developing a model is like holding a grenade, and calibration is one of my safety pins. In this post, I will walk through the concept of calibration, then show in Python how to make diagnostic calibration plots.
Here’s the problem: I have a Python function that iterates over my data, but going through each row in the dataframe takes several days. If I have a computing cluster with many nodes, how can I distribute this Python function in PySpark to speed up this process — maybe cut the total time down to less than a few hours — with the least amount of work?
This post is a review of my data science job search last year. When I started my job search last spring, I tried to read up on other’s experience so I could know what to expect, but they were far and few in between, and even fewer were told from a PhD student’s perspective. It is scary to tread down an unknown path, so I want to share my own experience.
When I write PySpark code, I use Jupyter notebook to test my code before submitting a job on the cluster. In this post, I will show you how to install and run PySpark locally in Jupyter Notebook on Windows. I’ve tested this guide on a dozen Windows 7 and 10 PCs in different languages.
A Reddit comment
Recently, I had to write a Hive query that contains multiple window functions. As the number of windows grew, the query became an unreadable mess.
This post follows my previous post How I Learned Data Science Skills as a Math PhD. I said that meetups are awesome but didn’t give my reasons. Here, I am going to share my experience on a particular meetup, the Developer Launchpad Nashville, as an example of why meetups were awesome to me and the roles they played when I was changing career.
This post is a review on how I picked up data science skills when I was a PhD student in mathematics. The time span was from January 2016 to May 2017, before I got my job. I break it into three periods: before, during, and after my summer internship. Here, I will talk about what I did, and not what I think I should have done — that’s for another post.