Let's Read Together - Testing Rubrics for Machine Learning
How to run tests in data science was a question I had for quite some time. As far as I know, testing is standard recipe among software developers, but there seems to be no such recipe for data scientists.
Two months ago I came across a paper from Google Research, which was the first paper on testing in data science that I can relate to, and I want to invite you to read it with me — it’s only 4 pages long!
You can find the paper here: What’s your ML test score? A rubric for ML production systems
Tests for data, not just code
When I think of tests, I’m thinking of checklists in any development or production environment. For example, when I bring my car to a dealership for a big maintenance, I’d prefer that my mechanics follow a checklist and put all the screws back in before I drive my car out of the shop.
Similarly, in software development, there are basic requirements that a piece of code should meet before being put in production. The only difference is that, with code, I can and should let the computer check those boxes for me. As long as I know what the specs are, I can come up with a good checklist and automate as many tests as possible so I don’t miss an important screw. From what I know, there’s a prevalence of good testing practices in software development: unit test, integration tests, property-based tests, etc.
Test with a robot whenever I can
There is, however, an additional problem to test for in data science. There are solid testing frameworks for code, but as a data scientist, most of my time is spent dealing with data instead of code. How should I go about testing when the emphasis is on the data and not on the code?
Test ideas from the paper
The authors of this paper break a machine learning pipeline into different stages and provided several ideas on how to test for each. They also provided a rubrics of 12 points so I can score my test practices.
I think the basic question to ask is the following: how much can I cover with test and how many tests can I automate? I’m going to pick a few ideas from the paper that I liked the most.
2.2 Test the relationship between each feature and the target, and the pairwise correlations between individual signals.
It is important to have a thorough understanding of the individual features used in a given model; this is a minimal set of tests, more exploration may be needed to develop a full understanding.
These tests may be run by computing correlation coefficients, by training models with one or two features, or by training a set of models that each have one of k features individually removed.
Most of my projects are built on data from multiple sources. I am not sure if a thorough understanding of individual feature is possible, especially when the data come from a vendor. But I think I should know what groups of features are in my model and how they interact with each other — what if I remove all geographical data from my model? What about adding behavioral data to the model? How are those two groups of features related?
I have not tried setting up automated tests on this one yet, but I can see that early automation of such tests can reduce future exploration time.
2.4 Test that a model does not contain any features that have been manually determined as unsuitable for use.
A feature might be unsuitable when it’s been discovered to be unreliable, overly expensive, etc. Tests are needed to ensure that such features are not accidentally included (e.g. via copy-paste) into new models.
This one reminds me of a conversation I had with @RobHarrigan where one time he discovered a magic feature in building a model. It turned out the feature was magic because it was discontinued after a company acquisition — it was all NULL after a specific time — hence useless in production.
I’ve seen in my own projects that business changed the data logic at some point while and I was totally unaware of this change. How one can avoid this pitfall depends on the company, but for any data science team, I think there’s a lot of value in maintaining a knowledge repo where all known data logic changes in a company are documented.
3.4 Test the effect of model staleness.
If predictions are based on a model trained yesterday versus last week versus last year, what is the impact on the live metrics of interest?
All models need to be updated eventually to account for changes in the external world; a careful assessment is important to guide such decisions.
How often should I update my model? One thing I learned in my project was that real data change in time, and production model suffers. I’ve been testing the decay of my model’s performance by stretching the date differential between training and test set in the dev environment. This gives me some idea of how often I should update my model.
There are data that became stale by nature, and in this case, re-train on new data solves the problem. But the trickiest question for me is what to do when the data go through a concept drift, which means the underlying distribution of my target variable has changed. An example would be the housing price when the government passed a new law or the number of home runs when Major League decided to juice the baseball.
Everyone digs the long ball - just ask Greg Maddux
I am not sure how to deal with concept drift other than active monitoring in production. Maybe having a few simple models ready to roll back to so I can minimize short-term loss when I see the quality of prediction drops, while at the same time going back to the exploration stage to see if it’s worthwhile to build a new model.
3.5 Test against a simpler model as a baseline.
Regularly testing against a very simple baseline model, such as a linear model with very few features, is an effective strategy both for confirming the functionality of the larger pipeline and for helping to assess the cost to benefit tradeoffs of more sophisticated techniques.
I have not thought of using outputs from simple models as a way to test the data pipeline! I know that I can test individual features with summary statistics and their correlation, but maybe a model’s performance metrics can also be viewed as summary statistics for the data?
So…what’s Chang’s score?
I scored my projects with this rubrics — and I’m horrible at testing. All my projects are between 0 to 2 points, and thus all of them are research projects instead of production systems. Most of my tests are manual and not automated. In other words, my tests have low coverage and I wasted a lot of time checking the boxes myself instead of having a script that handles the tests. Oh well, more things to strive for.
I’d love to hear what you got out of this paper. Do you have an idea you like from the paper or from your work? I think data people don’t talk enough about testing online, so feel free to leave a comment that we can start a discussion.