Document Analysis as Python Code with Great Expectations

If you have pulled your hairs out fixing the performance of a model built by someone else and at the same time wondering “why is this happening to me?”, then this post may be for you.

The nightmare

The following story is one of the worst nightmare in data science:

  • A model that has been working for months started to lose its performance.
  • Management saw the drop in a dashboard and started asking why.
  • But this model was built by someone who has left the company.

You spent weeks debugging, only to find out that it was due to a small change in business logic change that you are not aware of. The performance drop came from the fact that the data has been in a different state than before.

You couldn’t go back doing what you were doing because there was time pressure. As a result, you lost the momentum on your own project.

Even worse, no one thanked you for fixing the model because it was supposed to work in the first place.

It was a lose-lose deal from the outset and everyone was upset.

experiment Is my problem in the data or somewhere else?

What is the problem

The problem comes down to one thing:

Building a new model is easier than maintaining a model

and the knowledge of the builder did not pass down to the maintainer.

When I build a new model, I probably spend more than 60% of my time understanding all the edges in my data to curate a dataset for my model. There are lots of insights gained about business in this process.

Where did those insights go? Jupyter Notebooks, of course. Even if I remember to write down all my findings, which is questionable, this process still creates a problem.

No one is going to read through all my notebooks when the model is given to somebody else

Automate the knowledge transfer

This is one of the hardest problem in a data science team–how do I transfer knowledge?

If I spend 80% of the time on exploratory analysis, then I should be able to capture and automate my insights into code, so the next person that takes on my project can

  • read through the code to gain the same insights, and
  • see a warning when bad things happen and check against what I already know about the data.

But there was no good tool that handles this problem. Most of the teams I know monitor the data pipeline with their own tools, and there is no consensus on how to transfer knowledge of this kind. Unit test does not help–the problem is not in the code but in the data.

Comes Great Expectations

great_expectations was the first Python package that I saw that was perfect for this task. It came from a tweet:

After playing with it, I can say that I loved this package and the idea behind it. Therefore, in this post, I will cover the absolute basics of getting started.

Goal

My goal for this post is to show you how to

  • set basic expectations and save the config, and
  • load expectations and validate on a new dataset.

Installation

  • Project page: https://github.com/great-expectations/great_expectations
pip install --user great_expectations

Throughout the post, I will use great_expectations as ge:

import great_expectations as ge

Data

The dataset I’m using is a Kaggle dataset called San Francisco Street-Use Permits. I picked this one because I wanted to find a real life dataset that can emulate the situation I have in mind. In this case, let’s say I have a model that uses Permit Type as a feature, then what if the city government started issuing a new type of permit? How can I catch that change in my data pipeline?

kaggle-data-screenshot

Using Great expectations

The idea behind great_expectations is that each dataset should be paired with a set of expectations—they are tests for data during runtime.

The library does not throw an error if a new batch of data in the dataset violates the expectations, as how I want to use this piece of information depends on the context. Depending on the situation, I might

  • Reject the new batch of data, and only use the old and validated ones
  • Raise a warning in the logs
  • Send an email at 3 AM to the person on call for the pipeline to take a look

But it is unlikely that I want to throw an error and stop the whole pipeline.

Steps

  1. Loading data
  2. Setting an expectation on a pandas dataframe
  3. Exporting and importing expectations

1. Loading data

For now, great_expectations sits on top of pandas and pair the expectations with pandas dataframes. So the first step is to convert a pandas dataframe into a great_expectations dataframe (i.e. making a subclass.)

ge-dataframe

Therefore, I can still use all the methods like .head(), .groupby() for my dataframe.

There are two ways to load a dataframe into great_expectations:

Method 1: Read from a csv

df_ge = ge.read_csv('sf-street-use-permits/street-use-permits.csv')

Method 2: Convert from pandas dataframe

This is undocumented, but it is what ge.read_csv is doing under the hood.

import pandas as pd
df = pd.read_csv('sf-street-use-permits/street-use-permits.csv')
df_ge = ge.dataset.PandasDataset(df)

Both of them will give me df_ge, which is a great_expectations dataframe for me to set expectations. I use method 2 more because most of my data is in .parquet.

2. Setting and getting expectation

An expectation is a runtime test that has no meaning unless paired with a dataframe. There are many built in expectations

or I can write my own

Here, I will only demonstrate a simple expectation

  • expect_column_values_to_be_in_set

by taking a subset of my dataframe df_ge with only 2 permit types, then set an expectation to validate the street use permity type by:

  1. Setting the expectation to cover only 1 out of 2 permit types to see how it fails.
  2. Setting the expectation to cover both permit types to see how it passes.
permit_subset = ['Excavation', 'Wireless']
df_excavation_and_wireless = df_ge[df_ge['Permit Type'].isin(permit_subset)]

The dataframe df_excavation_and_wireless has 4505 rows and two permit types:

  • Excavation
  • Wireless

Failed

When I say that this dataframe should only contain 1 permit type Excavation, then it fails:

fail_type = ['Excavation']
df_excavation_and_wireless.expect_column_values_to_be_in_set('Permit Type', fail_type)

fail-expectation

When I try to get the expectation config from the data,

df_excavation_and_wireless.get_expectations_config()

get-failed

Pay attention to the empty list of expectations in the config. In this case, the expectation I added did not pass, and great_expectations discards it afterwards. If I really want to get it into my config, then I can take a subset the dataframe and force this expectation to pass on the subset instead.

Succeeded

And if I set the expectation to cover both permit types instead, it succeeds

success_type = ['Excavation', 'Wireless']
df_excavation_and_wireless.expect_column_values_to_be_in_set('Permit Type', success_type)
df_excavation_and_wireless.get_expectations_config()

succeeded-expectation

This time, the passed expectation is stored in the expectation config for me to reuse.

Exporting and importing expectations

After I set the expectations, I want to save them so I can reuse on new data, so I need to export and import my configs.

Exporting

All passed expectations are stored in the config that is paired with the dataframe. I can either

  • export it into a Python dictionary, or
  • save it as a json file
# as dictionary
config = df_excavation_and_wireless.get_expectations_config()
# as json file
df_excavation_and_wireless.save_expectations_config('ew_config.json')

I can also wipe my saved expectations:

df_excavation_and_wireless.remove_expectation()

Importing

To import and use the expectation config, load the json file as dictionary and pass it into the .validate() method:

For example,

import json

with open('ew_config.json') as file:
  validate_config = json.load(file)

df_excavation_and_wireless.validate(expectations_config=ew_config.json)

A workflow example

Let’s go through another example to demonstrate all the points that I said I will cover:

  1. Loading data
  2. Setting an expectation on a pandas dataframe
  3. Exporting and importing expectations

In this example, I break my data into two pieces:

  • the “before” data: df_pre, and
  • the “after” data: df_post,

to simulate the situation of having new data (post) coming into the pipeline built on old knowledge (pre).

import pandas as pd

df = pd.read_csv('sf-street-use-permits/street-use-permits.csv')
df_ge = ge.dataset.PandasDataset(df)

df_pre = df_ge.sample(3500, random_state=1234) # 3500 rows for pre-data
df_post = df_ge[~df_ge.index.isin(df_pre.index)]

I know that the set of Permit Types in those two dataframes are different

pre_types = set(df_pre['Permit Type'])
post_types = set(df_post['Permit Type'])
try:
    assert pre_types == post_types
except AssertionError:
    # The difference in permit types
    print(set.symmetric_difference(pre_types, post_types))

sym-diff

but let’s see how great_expectations catches this change.

Setting expectation and save

First, I will set the expectation on the before data, then export to expectation to a file:

possible_permits = df_pre['Permit Type'].unique()
df_pre.expect_column_values_to_be_in_set('Permit Type', possible_permits)

workflow-pre

Then save the expectation to saved_config.json:

df_pre.save_expectations_config('saved_config.json')

Load expectation and validate

Now I can load the expectation and validate on the post data df_post.

import json

with open("my_config.json") as file:
    saved_config = json.load(file)

Note that this gives me the same config as what I had:

passed

Finally, I can validate with this config on df_post:

df_post.validate(expectations_config=my_config, result_format='BOOLEAN_ONLY')

result

I am using the BOOLEAN_ONLY result format so it doesn’t print out everything, but there are other result formats with more detail. You can find possible formats in the documentation:

Follow-up questions

These are the question that I am working on:

  • What are “high value expectations?” In writing software tests, there are low value tests and high value tests. Can we generalize the idea to expectations?
  • What’s the best way to integrate this into a workflow that relies mostly on SQL/Hadoop instead of Python?
  • How do I use it with Airflow or any other workflow engine?

If you have experience in any of the above, please let me know.

Additional resources

Other thoughts

  • In my discussion with Eugene Mandel, who works with Abe Gong and James Campbell, the main authors of the library, he said that each expectation is a falsifiable statement—now I feel like I’m doing science.

  • I see this as an extension of Design by Contract and programming with assertions from software engineering but in the context of data.

  • The package’s name comes from Charles Dickens’s novel of the same name. The protagonist in the book is Pip. So, if you google pip great expectations you will not see a single entry about Python. Search for something else.


I’d love to hear your thoughts and stories on this. What do you want to know? Please leave a comment below ⬇️ or tweet at @ChangLeeTW.

Written on February 16, 2019