Data science is no longer a niche topic at companies. Everyone from the CEO to the intern knows about how valuable it is to take a scientific approach to dealing with data. Consequently, many people not directly in software engineering fields are starting to write more code, often in the form of interactive notebooks, such as Jupyter. Software engineers have typically been huge advocates of build systems, static analysis of code, and generating repeatable processes to enforce quality. What about business people who are writing code Jupyter Notebooks? What processes can they use to make their data science, machine learning, and AI code more reliable?
Data science project quality
One way to improve software quality for Data Science is to create a project structure that ensures quality and repeatability. To do this, some ideas can be taken from the traditional software engineering world. Brian Kernigan, co-author of the AWK programming language and “K and R C”, summarized the true nature of software development in the book, Software Tools, when he stated, “Controlling complexity is the essence of software development.”
In a previous article I wrote on code quality about software engineering project quality in Python, I said the following:
“The first step in the process of writing high quality code is to re-examine the entire thought process of how an individual or team develops software. Often in failed, or troubled, software development projects, the software was developed in a reactionary stream of consciousness where the focus of the software development was on getting a problem solved in any manner possible. In a successful software project, the developer is thinking not only about how to solve the problem at hand, but additionally about the process involved in solving the problem.
A successful software developer will devise a way to run tests in an easily automated fashion, so they can continuously prove the software works. They are aware of the dangers of needless complexity. They are humble in their approach, seek critical review, and expect refactoring at every step of the way. They continuously think about how they can ensure their software is testable, readable, and maintainable.”
The same statement is true of data science projects; there needs to be an automated way to ensure quality is enforced. Fortunately, with a service like CircleCI and open source libraries this is easily achievable. In the sections below, this will be demonstrated step by step.
Data science project automated testing setup
One of the best ways to have a proper automated testing setup for a Data Science project is to set it up properly from the start. What does that look like?
- Create a GitHub project. Create a new project like this example repo.
- Create a
.circlecidirectory with a
config.ymlfile in it. This is an example
config.ymlyou could refer to.
- Create a
.gitignorefile. It is important to ignore non-essential files.
- Create a
README.mdfile. A good
README.mdshould be able to show how a user builds the project and what the project does. Including a badge that shows the status of the CircleCI build is very helpful as well, like this example.
- Create a
Makefileis a common way to run steps in a build process and has been around for decades… for a reason… they just work. We will cover how to set this up for a data science project.
- Other important files and directories that are optional are: library directory, command-line tools, requirements.txt and tests directory.
A good place to start is to look at a
Makefile as a template. The contents of
myrepo/Master/Makefile are shown below and can be found here:
setup: python3 -m venv ~/.myrepo install: pip install -r requirements.txt test: python -m pytest -vv --cov=myrepolib tests/*.py python -m pytest --nbval notebook.ipynb lint: pylint --disable=R,C myrepolib cli web all: install lint test
The key steps are:
all (runs everything). The setup step creates an optional virtual environment, which could later be sourced by running the command:
install step, which can be run as
make install, installs the packages listed in the requirements.txt file. An example is found here.
lint step, which makes sense if libraries, command-line tools, or web apps are created, can be run with:
make install. It wouldn’t make sense to run on just Jupyter notebooks, but it does help to maintain the quality of the code associated with the project. Below is an example output from lint that can also be found here:
(.myrepo) ➜ myrepo git:(master) ✗ make lint pylint --disable=R,C myrepolib cli web No config file found, using default configuration -------------------------------------------------------------------- Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)
(.myrepo) ➜ myrepo git:(master) ✗ make test python -m pytest -vv --cov=myrepolib tests/*.py ============================================================ test session starts ============================================================ platform darwin -- Python 3.6.4, pytest-3.3.0, py-1.5.2, pluggy-0.6.0 -- /Users/noahgift/.myrepo/bin/python cachedir: .cache rootdir: /Users/noahgift/src/myrepo, inifile: plugins: cov-2.5.1, nbval-0.7 collected 1 item tests/test_myrepo.py::test_func PASSED [100%] ---------- coverage: platform darwin, python 3.6.4-final-0 ----------- Name Stmts Miss Cover ------------------------------------------- myrepolib/__init__.py 1 0 100% myrepolib/repomod.py 11 4 64% ------------------------------------------- TOTAL 12 4 67% ========================================================= 1 passed in 0.02 seconds ========================================================== python -m pytest --nbval notebook.ipynb ============================================================ test session starts ============================================================ platform darwin -- Python 3.6.4, pytest-3.3.0, py-1.5.2, pluggy-0.6.0 rootdir: /Users/noahgift/src/myrepo, inifile: plugins: cov-2.5.1, nbval-0.7 collected 4 items notebook.ipynb .... [100%] ===================================================================== warnings summary ====================================================================== notebook.ipynb::Cell 0 /Users/noahgift/.myrepo/lib/python3.6/site-packages/jupyter_client/connect.py:157: RuntimeWarning: Failed to set sticky bit on '/var/folders/vl/sskrtrf17nz4nww5zr1b64980000gn/T': [Errno 1] Operation not permitted: '/var/folders/vl/sskrtrf17nz4nww5zr1b64980000gn/T' RuntimeWarning, -- Docs: http://doc.pytest.org/en/latest/warnings.html =========================================================== 4 passed, 1 warnings in 2.08 seconds ============================================================
It is worth explaining how the
nbval plugin works. In a nutshell, it runs the Jupyter notebook for you and ensures that all of the cells execute. There are two modes that can be used: one that actually checks the output of each cell, and one that doesn’t. The method that checks the output of each cell can be a bit tricky to get to work because many times random images or output are in cells, and the tests will fail on each subsequent run.
With all of that out of the way, there is very little left to do to get CircleCI running. That information is covered in the CircleCI docs. A final cherry on the sundae would be to get a badge working, which is also covered in the official documentation here, and there is an example in the repo shared for this article.
This article showed how to bootstrap a data science project, setup the Github structure, run tests, and then send it off to CircleCI to do the build. There is a video I created on YouTube that shows exactly how to setup and test this project here. Another great resource to follow would be to read about how I use CircleCI throughout the book Pragmatic AI: An Introduction to Cloud-based Machine Learning. Links to that are in the references.
Noah Gift is a lecturer and consultant at both UC Davis Graduate School of Management MSBA program and the Graduate Data Science program, MSDS, at Northwestern where he teaches and designs graduate machine learning, AI, and data science courses and consulting on machine learning and cloud architecture for students and faculty. He has published close to 100 technical publications including two books on subjects ranging from cloud machine learning to DevOps. His most recent book is Pragmatic AI: An introduction to Cloud-Based Machine Learning (Pearson, 2018).
- Example Circle CI Repo Discussed in Article
- CircleCI Project Setup Video
- Pragmatic AI: An Introduction to Cloud-based Machine Learning Source Code and Book
- Writing clean, testable, high quality code in Python