In the last year and a half, I read about 30 books. Here are five great books that I read more than once and found myself going back again and again for practical applications.
Back in November, I started an internal training for my colleagues on walking through basic automation in Python for data analysis. This post is a guideline on how to use the training material.
If you have pulled your hairs out fixing the performance of a model built by someone else and at the same time wondering “why is this happening to me?”, then this post may be for you.
In this last post of the series, I will continue on automating Excel with Python and show you how to use a few commands outside the Home Tab.
There has been a lot of talk about Agile around me recently. For that, I want to summarize my understanding of Agile in data science into a series. This post covers where the discussion usually starts: Waterfall versus Agile.
In this post, I will show you the how to automate Excel away using the Python package openpyxl, starting with every button on the Home Tab. No more drag and drop!
This post follows up on last post to introduce another convenient tool for writing maintainable code—the configuration file. In particular, I will show you a specific config file format, YAML, and how it works in Python.
Writing and maintaining complex SQL query sucks. How can I use Python to make my life easier?
In the previous posts, I covered how to set up scheduled jobs for SQL queries. How can I get a note when things go wrong? How can I send the result to myself? Email, of course! So, in this post, I will show you how to send emails with Python.
The purpose of automation is to let machine do things while us humans rest. In this post, I will show you how to schedule a job with a Python module
In PyderPuffGirl Episode 1, I showed you how to open a SQL query in Python. How can I submit said query through Python to a database? Moreover, how can I get the result of the query as a file in Python?
This is the Episode 1 of the PyderPuffGirls†—a tutorial on automating the boring parts of data analysis that we are going through in the next 8 weeks. I’m writing this tutorial for people that had at least one false start in learning Python, just like me two years ago.
This story has happened to me too many times: First, I started a project. Then I explored relevant data and wrote some code that fulfills business needs. I was proud of my code and the initial results.
In this post, I am going to provide my views on the steps of train-validation-test in building a machine learning model. In particular, I’ll share some thoughts on the test set. This sounds like beating a dead horse – and in some ways it is – but there are more to dig into when working on real problems.
How to run tests in data science was a question I had for quite some time. As far as I know, testing is standard recipe among software developers, but there seems to be no such recipe for data scientists.
In this post, I will use a toy data to show some basic dataframe operations that are helpful in working with dataframes in PySpark or tuning the performance of Spark jobs. The list is by no means exhaustive, but they are the most common ones I used.
Last November, I went to my first hackathon, the Queen City Hackathon, in Charlotte. Here’s my review on the event with a lesson I learned from it.
When I build a machine learning model for classification problems, one of the questions that I ask myself is why is my model not crap? Sometimes I feel that developing a model is like holding a grenade, and calibration is one of my safety pins. In this post, I will walk through the concept of calibration, then show in Python how to make diagnostic calibration plots.
Here’s the problem: I have a Python function that iterates over my data, but going through each row in the dataframe takes several days. If I have a computing cluster with many nodes, how can I distribute this Python function in PySpark to speed up this process — maybe cut the total time down to less than a few hours — with the least amount of work?
This post is a review of my data science job search last year. When I started my job search last spring, I tried to read up on other’s experience so I could know what to expect, but they were far and few in between, and even fewer were told from a PhD student’s perspective. It is scary to tread down an unknown path, so I want to share my own experience.
When I write PySpark code, I use Jupyter notebook to test my code before submitting a job on the cluster. In this post, I will show you how to install and run PySpark locally in Jupyter Notebook on Windows. I’ve tested this guide on a dozen Windows 7 and 10 PCs in different languages.
A Reddit comment
Recently, I had to write a Hive query that contains multiple window functions. As the number of windows grew, the query became an unreadable mess.
This post follows my previous post How I Learned Data Science Skills as a Math PhD. I said that meetups are awesome but didn’t give my reasons. Here, I am going to share my experience on a particular meetup, the Developer Launchpad Nashville, as an example of why meetups were awesome to me and the roles they played when I was changing career.
This post is a review on how I picked up data science skills when I was a PhD student in mathematics. The time span was from January 2016 to May 2017, before I got my job. I break it into three periods: before, during, and after my summer internship. Here, I will talk about what I did, and not what I think I should have done — that’s for another post.