Don't Wait, Schedule and Relax Instead—PyderPuffGirls Episode 3

The purpose of automation is to let machine do things while us humans rest. In this post, I will show you how to schedule a job with a Python module schedule.

Requirements

The tool I’m using here is a Python module called schedule.

To install, run

pip install --user schedule

Using schedule

Here is a working example. Suppose that I have a sql_job that I want to run on every Sunday at 5:00 PM, then I can write

import schedule
import time

def sql_job():
    print("I'm pretending to be a SQL job!")

schedule.every().sunday.at('17:00').do(sql_job)

while True:
    schedule.run_pending()
    time.sleep(1)

Let’s break it down.

Defining jobs, or Python functions

In schedule, what I meant by a job in the do() is, in fact, a Python function.

A Python function is define with the def keyword.

A Python function is defined by its input and output. I may go into more detail about Python functions later. For now, let’s keep it simple.

Warning — scope of variables

In Python, there is a concept called variable scope. I think it is easier to understand by examples—what happens when I have 2 functions that share the same variable name in their definitions?

Case 1

In a clean Jupyter notebook where test_value has not been defined, if I define

def func1():
    test_value = 1
    print(test_value)

def func2():
    print(test_value)

then func1 finds test_value inside itself and prints it out. But when I execute func2, Python doesn’t know what to do with it—it doesn’t know what the test_value means in func2. The variable is not in the function’s scope.

case 1

Case 2

Let’s try again. This time, I will define test_value outside the functions:

test_value = 2

def func1():
    test_value = 1
    print(test_value)

def func2():
    print(test_value)

Here’s what I get when I execute the two functions.

case 2

As I can see,

  • func1 ignores what I defined outside and gets test_value from itself.
  • func2 does not have test_value inside so it gets the value from outside.

In writing Python code, scope issues (some people call it namespace) can lead to annoying bugs.

However, this is why people use an IDE like PyCharm. In a few weeks, I will make a short introduction to IDE and demonstrate why coding in an IDE helps avoiding scope issues. It is 2018, and there is no reason to write Python in a text editor unless necessary.

One input, multiple inputs, default values

In the case of func1 and func2 above, if a function has no input argument, then Python assumes the function can find all the variables—either within its definition or from outside—to get the job done.

Most of the time, I’d like to have an argument in my function. For example, I can calculate the area of a circle from its radius:

import math

def area_circle(radius):
    area = math.pi * (radius ** 2)  # ** stands for square. math.pi is the value of pi
    return area

area_circle(2)

Running the code gives me

Note: The naming convention for functions in Python is using all lowercase and underscores.

I can also define a function that takes multiple arguments as input and set default values.

def volume_box(length, width, height=2):
    return length * width * height

In this case, the default value for height is 2. If I put in volume_box(2, 3), it works the same as putting in volume_box(2, 3, 2) because Python supplies the missing argument with the default value.

Multiple arguments

Another thing that I can do is passing the arguments by their names (keywords). If I do it this way, then I can change the order of arguments:

Change order of arguments

Not returning anything

Every function has to return something in Python. But what happens when I skip the return statement at the bottom?

For example, I can print out what I want to eat today:

def eat(food):
    print(f'I want some {food}')

When I check what this function returns

x = eat('cake')
x

it shows nothing. In fact, it returns a NoneType, or a None object (the NULL object in Python) by default.

Instead of leaving it out, I would say it is better to

Use pass instead

The effect of leaving out return is the same as adding a pass at the bottom. I think this is a better approach when I don’t want to return anything because making the intent clear means more readable code:

def eat(food):
    print(f'I want some {food}')
    pass

Putting code into a script

So far, I have been running everything in a Jupyter notebook. To schedule jobs, it is easier to put things inside a Python script that ends with the file extension .py.

Let’s say if I have this piece of code:

import schedule
import time
import datetime

def sql_job():
    current_time = datetime.datetime.now()
    print(f"I'm pretending to be a SQL job! Ran at {current_time}")

schedule.every(3).seconds.do(sql_job)

while True:
    schedule.run_pending()
    time.sleep(1)

Then I can save the whole things as example.py and execute it.

What is that time.sleep thing in the code?

It is used to prevent the infinite loop while True to use up all the resources of the CPU. See the comments below this StackOverflow answer.

Running a script

To run the Python script, open a

then run

python example.py

and move the window somewhere and leave it running—don’t close it!

If you want to interrupt the execution, press Ctrl+C on Windows (sometimes Ctrl+D or Ctrl+Z, just try everything.)

Note: to run the script on a remote Unix server that does not have a graphical user interface, one way to keep the it running without putting the process to the background is using the command line tool screen. I will show you how to use screen in another post.

Note (12/12/18): I realized I still haven’t get to a master post on solving the “python command not found issue” yet, so bear with me here. I’ll come back and fix it!

Trick for later

There is a small trick that you will see in a lot of Python modules. I can modify my code to this and it will still run the same.

import schedule
import time
import datetime

def sql_job():
    current_time = datetime.date.now()
    print(f"I'm pretending to be a SQL job! Ran at {current_time}")

if __name__ == '__main__':

    schedule.every(5).seconds.do(sql_job)

    while True:
        schedule.run_pending()
        time.sleep(1)

The benefit of adding __name__ == '__main__' is that I can import sql_job into another piece of Python code.

If I have time, I want to talk about how to package Python code, so I think it is reasonable to put it out ahead of time. This is a great StackOverflow answer that explains what this line does.

An example for better Mondays

Let’s say I have a SQL query that generates a business report for delivery on Monday mornings. I know the data is usually in the database on Monday by 2:00 AM, so I should be able to get the report while I’m sleeping. Therefore, I want to schedule my job on every Monday at 2:00 AM to

  • execute a SQL command and get the result back, then
  • save the result to a file with the Monday date in the filename.

It makes my Monday better because I don’t have to run the query first thing on a Monday morning, hoping that my query runs and will not fail after two hours. I can just check the result and see if I need to make changes or investigate before sending it out.

The code

This is an example that builds on top of what I showed in the last post, How to Query a Database in Python.

I am going to call this scheduling script weather-report.py, which selects the daily high and low temperature from every city in the last 7 days.

import datetime
import schedule
import time
import os

def get_weather_report():
    """
    Makes a 7-day weather report before the Monday report date.
    """
    report_date = datetime.date.today()
    start_date = report_date - datetime.timedelta(days=7)
    end_date = report_date - datetime.timedelta(days=1)

    weather_query = f'''
    select city, temp_lo, temp_hi, date
    from weather
    where date between '{start_date}' and '{end_date}'
    ;
    '''

    username = os.environ['PSQL_USERNAME']
    password = os.environ['PSQL_PASSWORD']

    engine = create_engine(f'''postgresql+psycopg2://{username}:{password}@localhost:5432/mydb''')
    df = pd.read_sql_query(weather_query, engine)
    df.to_csv(f'{report_date}_weather_report.csv', index=False)
    print(f'Finished report for {report_date}'.)


if __name__ == '__main__':

    schedule.every().monday.at('02:00').do(get_weather_report)

    while True:
        schedule.run_pending()
        time.sleep(1)

Now I just need to open up command line and run

python weather-report.py

on Friday and come back to my report on Monday.

Questions

What about logging?

In this post, I only used print statements to print to screen and skipped saving the logs to a file. Most of the time, I only care about the logs if my job fails. Therefore, I am going to introduce logging when I talk about how to send emails with attachments in Python. Not today.

There is a logging module in Python, but tuning the logger levels will be out of scope for this tutorial.

What about other scheduling tools?

There are a plenty of tools out there for scheduling. For example,

  • in Unix, there is Cron
  • in Windows, there is Task Scheduler

Another option is using a workflow engine, such as Apache Airflow. In my opinion, however, the workflow engines out there are more or less designed for production batch data jobs that run for many hours.

I chose schedule because my goal for this series of tutorials was to provide the easiest configurable tools to automate the boring data stuff away. I do plan to write a supplemental post on how to use Cron and Task Scheduler and the tradeoffs later, but for now, let’s stick to the simple option.

What if my machine crashes?

This section is a work in progress

Cron solution

If it is a remote Unix machine that is maintained by IT and you already know how to use Cron, try using the @reboot special string in crontab to make sure the Python script is executed after the system reboots. For example, see this question. I will cover Cron in another post.

What’s next?

In the next post, I will show you how to email the results and close the boring report cycle.


The PyderPuffGirls

Please feel free to post in the comments section or tweet at @ChangLeeTW for questions or comments.

Written on December 12, 2018