Why do method chaining? Because data cleaning is essentially a graph. 

Instead of jumping back and forth, it is easier if all cleaning of one dataset happens in one place. However, due to the lack of easy-to-use custom methods, it is cumbersome.

## Basic cleaning

In [1]:
import pandas as pd

In [2]:
from janitor import then

In [3]:
from functools import partial

## The dataset

In [4]:
raw_avocados = pd.read_csv('avocado-prices.zip', index_col=0)

In [5]:
raw_avocados.sample(5)

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
36,2017-04-23,2.08,3223.81,0.0,109.87,0.0,3113.94,958.29,2155.65,0.0,organic,2017,Syracuse
42,2017-03-12,1.19,34526.66,3661.47,3665.38,4.91,27194.9,162.44,27032.46,0.0,organic,2017,Portland
26,2015-06-28,1.16,135739.94,78066.95,19522.69,866.73,37283.57,29903.31,7380.26,0.0,conventional,2015,Jacksonville
38,2017-04-09,1.24,26121.58,11723.26,530.28,0.0,13868.04,13741.99,126.05,0.0,organic,2017,DallasFtWorth
17,2017-09-03,1.5,7148.98,222.83,202.3,0.0,6723.85,6707.19,16.66,0.0,organic,2017,NewOrleansMobile


## Using `janitor.then()`

In [6]:
def get_yearly_sum_by_PID(df, PID):
    output = (
        df
        [['year', str(PID)]]
        .groupby(['year'], as_index=False)
        .agg({
            str(PID): 'sum'
        })
        .sort_values('year')
    )
    return output

In [7]:
from janitor import then
from functools import partial

df_by_pid = (
    raw_avocados
    .then(partial(get_yearly_sum_by_PID, PID=4770))
)

In [8]:
df_by_pid

Unnamed: 0,year,4770
0,2015,142772400.0
1,2016,159879800.0
2,2017,91217510.0
3,2018,22932590.0


## Comparison with normal pandas

What I would do if there is no pyjanitor?

[janitor.remove_columns](https://pyjanitor.readthedocs.io/reference/janitor.functions/janitor.remove_columns.html#janitor.remove_columns)

In [9]:
drop_cols = ['Small Bags', 'Large Bags', 'XLarge Bags']
# pandas style
df_no_bags_pd = raw_avocados.drop(drop_cols, axis=1)
# pyjanitor style
df_no_bags = raw_avocados.remove_columns(drop_cols)
df_no_bags.equals(df_no_bags_pd)

True

[janitor.to_datetime](https://pyjanitor.readthedocs.io/reference/janitor.functions/janitor.to_datetime.html#janitor.to_datetime)

In [10]:
# pandas style
df_dt = raw_avocados.assign(Date=lambda _df: pd.to_datetime(_df['Date']))
# pyjanitor style
df_dt2 = raw_avocados.to_datetime('Date')
df_dt.equals(df_dt2)

True

In [11]:
df_dt[['Date']].dtypes

Date    datetime64[ns]
dtype: object

[janitor.dropnotnull](https://pyjanitor.readthedocs.io/reference/janitor.functions/janitor.dropnotnull.html#janitor.dropnotnull)

In [12]:
import numpy as np
nan = np.nan

In [13]:
test_df = pd.DataFrame({
    'a': [1, nan, 3],
    'b': ['x', 'y', 'z']
})

In [14]:
test_df

Unnamed: 0,a,b
0,1.0,x
1,,y
2,3.0,z


In [15]:
test_out1 = test_df.dropnotnull('a')
test_out2 = test_df[lambda _df: pd.isnull(_df['a'])]
test_out1.equals(test_out2)

True

## Custom chaining function

In [16]:
import pandas_flavor as pf

## Use with other package e.g. great_expectations

In [17]:
import great_expectations as ge

In [18]:
ge_avocados = ge.read_csv('avocado.csv')

In [19]:
df_ge = (
    ge_avocados
    .then(partial(get_yearly_sum_by_PID, PID=4046))
)
df_ge

Unnamed: 0,year,4046
0,2015,1709450000.0
1,2016,1525123000.0
2,2017,1652038000.0
3,2018,460499700.0


In [20]:
type(df_ge)

pandas.core.frame.DataFrame