A Python Tutorial on Automating Boring Data Workflow—The PyderPuffGirls
Back in November, I started an internal training for my colleagues on walking through basic automation in Python for data analysis. This post is a guideline on how to use the training material.
I started this tutorial because I believe that most Python tutorials are about writing applications and not scripts. But I write more scripts than applications at my job—how can I automate away my data analysis workflow with scripts?
This post has two parts:
- A guide to the PyderPuffGirls posts
- How to solve the most common Windows Python setup issues
I might make another post on how to use the command line on Windows if we decide to start a round two training.
Disclaimer Too many people have asked me this question so let me clarify where this name came from: when we started, we wanted to give this series a name. After a heated discussion, my colleague Keith Finch came up with the winner. I took and ran with it 😂. As a prize, Keith gets a permanent spot in the post:
Keith with the Python Staff of Enlightenment†
PyderPuffGirls has a total of 8 episodes that form 3 groups.
Group 1: Automating Excel reports
This group will take you from getting data in a database to an Excel spreadsheet for your users. If this looks like part of your job:
- Run SQL queries
- Download data to PC
- Turn the data into an Excel spreadsheet
then you might find the following posts helpful.
- Episode 1: A Python Tutorial for the Bored Me - why I wrote this series
- Episode 2: How to Query a Database in Python - how to run queries
- Episode 7: Replace the Home Tab in Excel with Python - how to make Excel reports
- Episode 8: Making Excel Charts, Formulas, and Tables with Python
Note: I am planning to update Episode 7 later with a clip in Excel on formatting strings based on Jared’s suggestion.
Group 2: Getting rid of two clicks
These two posts aim to remove two mouse clicks: when to start, and how to give the report to others via email.
In other words, this two posts turn your workflow into an asset you can run it while you sleep:
- Episode 3: Don’t Wait, Schedule and Relax Instead - how to use
- Episode 4: Filling Up Your Inbox with Goodies - how to send email via SMTP
Group 3: Reducing SQL headaches
These two posts deal with the problem of simplifying a long and repetitive SQL query that leads to mistakes and errors.
- Episode 5: Untangle the SQL Mess with Jinja - templating the queries
- Episode 6: Make a Workflow Config with YAML - metaprogramming
Appendix: Troubleshooting Windows
There are a few things for Python users on Windows that will take a while to figure out for the first time. I am putting the most common issues here.
Where do I install Python?
When installing Python, I recommend installing in the
AppData folder. For example, on my laptop, I put in
The reason is that
C:/Program Files, the default in Python installation, requires local admin rights—which can be a big headache if you work a corporate job. Putting it in
AppData\Local means you don’t have to worry about this.
Python/Pip is not recognized
A common problem on Windows is that running
pip in a command line gives
python is not recognized as an internal or external command
pip is not recognized as an internal or external command
The problem is that, when you type a command into Windows command line (cmd, PowerShell, Git Bash, etc), it will look scan all the folders defined in the “PATH variable” to find the corresponding executable (
.exe files). But
pip is not added to PATH variable by default. Therefore, what I need to do is to add them into my PATH variable.
Steps for Windows 10:
- Type “path” in the search bar
- Choose “Edit environment variable for your account”
- In user variable, select PATH and click Edit
- Select “New” and put the Python base folder and the
Scriptsfolder in there.
For example, if I installed Python in
C:\Users\chang\AppData\Local\Programs\Python\Python37\, then I can check if I do have a
python.exe in there:
In this case, the two folders I need to add into my PATH are
Steps for Windows 7:
The steps for Windows 7 is almost identical to Windows 10, except that paths in PATH variable is concatenated into a single string separated by a semicolon
;. I usually copy the string to Notepad++, edit it, then paste it back in.
Scripts is the equivalent of
bin folder for Unix-based systems.
File path in Windows
Windows and Unix use different delimiter for the file paths:
- Windows uses backslashes e.g.
- Unix uses forward slashes e.g.
In Python, backslashes are reserved for escaping characters, that is, to tell Python that anything that follows a backslash is used with a special purpose.
This creates a problem for Windows file paths and many
FileNotFound errors. Let’s say I want to read in a file called
C:\test stuff\words.txt, what do I put in
file_path = ??? with open(file_path) as file: text = file.readlines(file_path)
There are several options:
1. Use two backslashes
file_path = 'C:\\test stuff\\words.txt'
2. Use raw string
A string with an
r prefix is a raw string. This means anything is read as-is and nothing is escaped—so backslash does nothing.
file_path = r'C:\test stuff\words.txt'
3. Use forward slash
Most of the time, I can simply change all the backslashes to forward slash. Python will understand what is going on and pick the correct file
file_path = 'C:/test stuff/words.txt'
Use more pathlib
In using Python, I prefer using the
Path object from
pathlib for better cross-platform compatibility. This is, however, out of scope. Therefore, I’ll give you a post if you want to learn more about it:
- Trey Hunner: Why you should be using pathlib
Additional tools in the toolkit
I found that there are many tools that can help you automate the workflow, especially if you have a Windows machine. In particular,
To install Git, go to the official website.
When you install, make sure you choose “Add Path” so you get the full benefit of the Unix tools such as
nano, etc. Otherwise, you can add the
bin folder into the PATH variable for the same effect.
The native Windows command line tool
cmd is cumbersome, and PoweShell uses its own syntaxes that can be confusing to first-time users. I prefer using third-party command line tools that imitates an Unix command line like Git Bash or Cmder.
Git Bash is a good alternative to cmd but unfortunately Git bash can’t run python. On most Windows computers, running
python some_script.py in Git Bash leads to a frozen command line.
So I prefer
which a Terminal simulator is much easier to navigate than
Thanks to Brennan Hodge, Jared Cline, Caroline Collins, Rebecca Johnson, David Crane, Chris Ruesch, David Pirie and Keith Finch—the PyderPuffGirls—for joining me on this amazing trip!
Back in Episode 1, a redditor asked me to provide email subscription, so I finally set it up. If you like this series and want to keep up with other series and posts, then you might want to consider subscribing to my mailing list.
Please leave a comment and help me make it better!