The Alt-Ac Job Beat Newsletter Post 13

Hi Everyone,

This post I am going to discuss in more detail a concept in software engineering often referred to as CICD -- continuous integration and continuous deployment. (It is not just police departments that come up with acronyms!) To me, understanding the concept of CICD is the most important software engineering idea to improve the quality of your work.

The general idea behind CICD is that when you edit your code, you have automated tests that tell if the code is correct. If your code passes the tests, it can be deployed into the production environment. (I realize these are likely to be many new pieces of terminology, hence the entire post.)

Many people in our field who do analytics this is all foreign -- they get a request, do some analysis, and then hand off the results. This is to me the biggest difference distinguishing groups that do ad-hoc data analytics and more serious software engineers that use predictive models. Consider two scenarios, in which a police department asks me to build a predictive model to help them identify chronic offenders on a monthly basis.

Scenario 1: I write computer code to create the predictions and give back a ranked list in month 1. Next month I need to update my code to return a new list Scenario 2: I write computer code that automates the process of generating predictions at a given point in time. I then set up the code to run once a month.

Scenario 1 is how the majority of data analysts approach this, but Scenario 2 is much better from a software engineering perspective. Why? Automating the process is itself an auditing step -- if you need to come back and say "why are the results between month 1 and 2 so discrepant?", in the first scenario you might not have a trail of the changes that were made. In scenario 2, you often as part of the creating the code also create tests to see if the code is working as expected given known inputs. Finally, scenario 2 should be entirely self contained, so if you leave the police department, the code should just run as usual.

I believe there are many errors in data analysis code, but because we do not have known inputs (e.g. I ran the numbers and they look generally plausible) no one notices them. Just the approach of Scenario 2 tends to catch these errors more frequently than people doing ad-hoc analytics, and is a goal to strive for. Scenario 1 will eventually result in errors (putzing with new code every month). Writing tests with known inputs/outputs is a way to catch when you make errors.

I know the ideal of having automated code is not always possible, and it definitely takes more long term planning to accomplish. But again is in my opinion the most important software engineering concept to improve the quality of your work.

Writing tests for code

What does it mean to write a test for code? Imagine you have a simple function, f, that takes two inputs and adds them together:

def f(a,b):
    return a + b

To test this function, you take known inputs and test to make sure the answer is correct. For example, you may then do (just think of assert here as testing to make sure the statement is true, and gives an error if it is not true).

assert 5 = f(3,2)
assert 0 = f(-1,1)
assert 1 = f(0,1.0)

This may seem trivial, about the only special thing here I show is that the final example mixes up an integer and a float. (When testing floating point numbers, you often want to test the relative error, but you can figure that out later.) Other things you can test for though are graceful errors when something gets inputted into your program that you did not expect.

assert f(0,None) is None
assert f('a','b') is None

These tests will fail in the above function, because the function just assumes you will give it two numbers. You would need to rewrite your function to return None when it is given non numeric inputs.

This may seem trivial, but examples where it is not as trivial are say you write a new computational method. It is a good idea to test your code on an input you know the correct answer to. For data analysis with complicated queries, you can test the queries on known inputs as well to make sure the SQL behaves as you expect it to. Often these should include missing data or potential edge cases to make sure your code behaves correctly, even when the inputs don't.

In terms of writing unit tests for actual code, for R check out Jacob Kaplan's book. For python I use pytest. My favorite intro to pytest is this LinkedIn Learning course Unit Testing in Python with Jasmine Omeke. But just as easy I think is to check out actual projects. I have created a simple python package, retenmod, to showcase off python projects (and make it not uber complicated).

The final part about this -- to test code you need to write functions -- is itself an additional part of professional software development. Many data analysis projects have code that is just a series of transformations on data and then the subsequent analysis. Professional software development does not look like that, you write your code in modularized functions, and then the final script is often just running those functions in sequence.

Writing functions itself is a way to make your code reusable, as well as improve the quality of that code.

Using Github actions

Github actions is a tool that when you change your code, github will run tests (on its own virtual machines) like I talked about above on your code. Github actions will then send error messages if your code fails. This is the "continuous" part of CICD, if your change passes the tests, it gets pushed to where it needs to go.

To see some github actions in-situ, again you can check out my example python package, retenmod. This is a fairly simple python package I created as a way to showcase how to do different aspects of python package creation and management. Here I will walk through a simple github action script to describe what it does.

To first create an action, in the github website you need to create it online (click the Actions button next to the Pull request button). This will create a folder .github, and within that folder, you can create different workflow yml files. YAML files are just plain text scripts that have instructions formatted in a particular way.

Here is a basic workflow example, and I will discuss each part in turn:

# Part 1 setting up when action is triggered
# when code in the src folder is edited
on:
  push:
    - 'src/**'

# Part 2
# Setting up the job
job:
  runs-on: ubuntu-latest
  timeout-minutes: 15
  steps:
    - name: Getting github contents
      uses: actions/checkout@v2
    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: 3.9
        cache: 'pip'
    - name: Install python packages
      run: pip install -r requirements.txt
    - name: Python tests
      run: |
         # run multiple commands
         flake8 --ignore=E203 ./src
         pytest --disable-warnings ./tests

For part 1, you tell the yaml file when to run. For a few different examples, you can run the workflow when certain files are edited, here only when any file inside of the src folder is edited will this workflow run. Another example is you can run the workflow when pushing any changes to a particular branch, e.g. run a workflow when pushing to main. The final example is you can run workflows on a regular schedule (I use github actions to do some simple webscraping once a night or week for example).

Part 2 is where most of the work happens. First, we have the runs-on: ubuntu-latest. Github spins up what are called virtual machines to run your code. These machines can be different operating systems. For example, you could use macOS-latest for macs, or windows-latest. This is a bit more complicated example for my R package ptools, but you can run your tests on multiple operating systems (e.g. do a loop over each operating system).

The big thing to note here I see trips up some people, github runs these tests on a virtual machine. So if your tests query a local database, that won't work on github actions, since it will not have access to that local data. Tests need to be self-contained (these types of tests are specifically called unit-tests, there are other types of tests to work with your data, such as integration tests, just focus on unit tests though for beginners!)

Second we have timeout-minutes. Github actions is very generous, you can run quite a bit of work for free. But, you want to make sure to not make a mistake in your code, and have the workflow run indefinitely. So it is good practice to set a timeout limit (my examples of scraping data, you would not want to do tasks that take any more than a few minutes in github actions).

Then we have steps, the first two steps set up the environment, and run special actions that are created by github. The first one, checkout, is pretty much in every github action. This copies the files from the github repository into the virtual machine. The second, setup-python, installs a version of python on the virtual machine. (There are also user contributed actions, e.g. to install R is uses: r-lib/actions/setup-r@v1.)

The python action has a second part to it, with, and the with are just arguments to the specific action, and will change for each action. For the python action, you can set up a cache for the installed packages, as well as specify the python version. Caching packages is a easy thing you should do, so every time the action runs, you don't need to redownload the big pandas package, it is already saved. (Here is an example of creating a manual cache for large downloads, for a large language model, but it can really be anything.)

The third step installs the python packages. The file requirements.txt should be available in the root of your project to install the packages. The run command just runs whatever shell script you want.

The fourth step runs the python tests. You can have multiple commands on different lines, and here it runs the python linting tool flake first (you might also want to check out the black library), and then runs pytest.

There are many more things about github actions, you can push artifacts once tests pass (such as building a python wheel file and pushing it to pypi). Another I have not covered is the idea behind secrets -- you can have environment variables that have tokens or passwords to access special things (such as reading/writing from another location, like an AWS S3 bucket).

Wrapping Up

I know this was quite a bit of info for a single post. In learning the code, to me the hard part is not writing code, it is understanding the broader aspects of the system you are building. How your code fits into other parts of the system is the hard part, not writing the actual text file with the commands. Understanding CICD, unit tests, and creating github actions to do those tests in an automated fashion to me is a big step up in maturity, from running ad-hoc scripts, to really understanding software engineering and making sure you are producing reliable code.

So spending some time learning to make unit tests and create github actions to me is worth the effort.

Best, Andy Wheeler