Data science is no longer a niche topic at companies. Everyone from the CEO to the intern knows about how valuable it is to take a scientific approach to dealing with data. Consequently, many people not directly in software engineering fields are starting to write more code, often in the form of interactive notebooks, such as Jupyter. Software engineers have typically been huge advocates of build systems, static analysis of code, and generating repeatable processes to enforce quality. What about business people who are writing code Jupyter Notebooks? What processes can they use to make their data science, machine learning, and AI code more reliable?
Data science project quality
One way to improve software quality for Data Science is to create a project structure that ensures quality and repeatability. To do this, some ideas can be taken from the traditional software engineering world. Brian Kernigan, co-author of the AWK programming language and “K and R C”, summarized the true nature of software development in the book, Software Tools, when he stated, “Controlling complexity is the essence of software development.”
In a previous article I wrote on code quality about software engineering project quality in Python, I said the following:
“The first step in the process of writing high quality code is to re-examine the entire thought process of how an individual or team develops software. Often in failed, or troubled, software development projects, the software was developed in a reactionary stream of consciousness where the focus of the software development was on getting a problem solved in any manner possible. In a successful software project, the developer is thinking not only about how to solve the problem at hand, but additionally about the process involved in solving the problem.
A successful software developer will devise a way to run tests in an easily automated fashion, so they can continuously prove the software works. They are aware of the dangers of needless complexity. They are humble in their approach, seek critical review, and expect refactoring at every step of the way. They continuously think about how they can ensure their software is testable, readable, and maintainable.”
The same statement is true of data science projects; there needs to be an automated way to ensure quality is enforced. Fortunately, with a service like CircleCI and open source libraries this is easily achievable. In the sections below, this will be demonstrated step by step.
Data science project automated testing setup
One of the best ways to have a proper automated testing setup for a Data Science project is to set it up properly from the start. What does that look like?
- Create a GitHub project. Create a new project like this example repo.
- Create a
.circleci
directory with aconfig.yml
file in it. This is an exampleconfig.yml
you could refer to. - Create a
.gitignore
file. It is important to ignore non-essential files. - Create a
README.md
file. A goodREADME.md
should be able to show how a user builds the project and what the project does. Including a badge that shows the status of the CircleCI build is very helpful as well, like this example. - Create a
Makefile
. AMakefile
is a common way to run steps in a build process and has been around for decades… for a reason… they just work. We will cover how to set this up for a data science project. - Other important files and directories that are optional are: library directory, command-line tools, requirements.txt and tests directory.
A good place to start is to look at a Makefile
as a template. The contents of myrepo/Master/Makefile
are shown below and can be found here:
setup:
python3 -m venv ~/.myrepo
install:
pip install -r requirements.txt
test:
python -m pytest -vv --cov=myrepolib tests/*.py
python -m pytest --nbval notebook.ipynb
lint:
pylint --disable=R,C myrepolib cli web
all: install lint test
The key steps are: setup
, install
, test
, lint
, and all
(runs everything). The setup step creates an optional virtual environment, which could later be sourced by running the command:
source ~/.myrepo/bin/activate`
The install
step, which can be run as make install
, installs the packages listed in the requirements.txt file. An example is found here.
The lint
step, which makes sense if libraries, command-line tools, or web apps are created, can be run with: make install
. It wouldn’t make sense to run on just Jupyter notebooks, but it does help to maintain the quality of the code associated with the project. Below is an example output from lint that can also be found here:
(.myrepo) ➜ myrepo git:(master) ✗ make lint
pylint --disable=R,C myrepolib cli web
No config file found, using default configuration
--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)
The final and most important step is running make test
. This uses pytest
along with the nbval
plugin. The output is shown below and can also be found here.
(.myrepo) ➜ myrepo git:(master) ✗ make test
python -m pytest -vv --cov=myrepolib tests/*.py
============================================================ test session starts ============================================================
platform darwin -- Python 3.6.4, pytest-3.3.0, py-1.5.2, pluggy-0.6.0 -- /Users/noahgift/.myrepo/bin/python
cachedir: .cache
rootdir: /Users/noahgift/src/myrepo, inifile:
plugins: cov-2.5.1, nbval-0.7
collected 1 item
tests/test_myrepo.py::test_func PASSED [100%]
---------- coverage: platform darwin, python 3.6.4-final-0 -----------
Name Stmts Miss Cover
-------------------------------------------
myrepolib/__init__.py 1 0 100%
myrepolib/repomod.py 11 4 64%
-------------------------------------------
TOTAL 12 4 67%
========================================================= 1 passed in 0.02 seconds ==========================================================
python -m pytest --nbval notebook.ipynb
============================================================ test session starts ============================================================
platform darwin -- Python 3.6.4, pytest-3.3.0, py-1.5.2, pluggy-0.6.0
rootdir: /Users/noahgift/src/myrepo, inifile:
plugins: cov-2.5.1, nbval-0.7
collected 4 items
notebook.ipynb .... [100%]
===================================================================== warnings summary ======================================================================
notebook.ipynb::Cell 0
/Users/noahgift/.myrepo/lib/python3.6/site-packages/jupyter_client/connect.py:157: RuntimeWarning: Failed to set sticky bit on '/var/folders/vl/sskrtrf17nz4nww5zr1b64980000gn/T': [Errno 1] Operation not permitted: '/var/folders/vl/sskrtrf17nz4nww5zr1b64980000gn/T'
RuntimeWarning,
-- Docs: http://doc.pytest.org/en/latest/warnings.html
=========================================================== 4 passed, 1 warnings in 2.08 seconds ============================================================
It is worth explaining how the nbval
plugin works. In a nutshell, it runs the Jupyter notebook for you and ensures that all of the cells execute. There are two modes that can be used: one that actually checks the output of each cell, and one that doesn’t. The method that checks the output of each cell can be a bit tricky to get to work because many times random images or output are in cells, and the tests will fail on each subsequent run.
With all of that out of the way, there is very little left to do to get CircleCI running. That information is covered in the CircleCI docs. A final cherry on the sundae would be to get a badge working, which is also covered in the official documentation here, and there is an example in the repo shared for this article.
Summary
This article showed how to bootstrap a data science project, setup the Github structure, run tests, and then send it off to CircleCI to do the build. There is a video I created on YouTube that shows exactly how to setup and test this project here. Another great resource to follow would be to read about how I use CircleCI throughout the book Pragmatic AI: An Introduction to Cloud-based Machine Learning. Links to that are in the references.
Noah Gift is a lecturer and consultant at both UC Davis Graduate School of Management MSBA program and the Graduate Data Science program, MSDS, at Northwestern where he teaches and designs graduate machine learning, AI, and data science courses and consulting on machine learning and cloud architecture for students and faculty. He has published close to 100 technical publications including two books on subjects ranging from cloud machine learning to DevOps. His most recent book is Pragmatic AI: An introduction to Cloud-Based Machine Learning (Pearson, 2018).
References
- Example Circle CI Repo Discussed in Article
- CircleCI Project Setup Video
- Pragmatic AI: An Introduction to Cloud-based Machine Learning Source Code and Book
- Writing clean, testable, high quality code in Python