Lumigo software engineer Idan Sofer outlines how he utilized CircleCI to proactively root out flaky integration tests in a fully serverless environment.

Integration tests are critical when you’re developing a serverless application. More than ever before the smooth running of your product depends on your code playing nicely with a web of third-party services that are not within your control.

The problem is that writing integration tests in a serverless environment can get complicated very quickly. The number of integration points increases with every resource that is added, and the complexity lies within the configuration of these resources.

Integration tests can fail for several reasons. Sometimes, the tests run long and get timed out. Other times, a resource can be affected by the state of another resource that is invalid. For example, here at Lumigo, we’ve run into an issue where some resources relied on a DynamoDB table that needed to be empty before the start of each test. The problem was that sometimes items would get rejected from the delete-batch and were not deleted, causing the test to fail.

This is an example of a flaky test, where failure is not directly related to the developer’s newly added feature. Problematic tests like these can really slow the team down, and so we wanted to find a better testing strategy to identify flaky tests before the team was blocked by them.

Goals of the integration testing strategy

Let’s begin by defining the goals of this project:

  • We wanted to generate a report of the tests that failed most often so we would know which tests were flaky and needed addressing before the R&D team was blocked when they tried to merge features.
  • Secondly, we wanted to identify the slowest tests we had running by recording the average time for each failed test.
  • Finally, we wanted to use this strategy to identify any tests that interfere with other tests and cause them to fail.

Our integration testing stack

Since we are working with CircleCI, we’ll be using CircleCI Insights to generate the report.

Let’s get started

We have a shared environment that is dedicated to integration testing. Every developer deploys the feature stack when commiting code for a new feature to run the integration tests against. We use this shared environment to run our proactive integration testing strategy simulating the work of our developers. To do this, we replicated the full development cycle that each of our engineers follows when they test a new feature in the integration environment. That means replicating the full deployment of all AWS resources in the stack, configuring them, and running the Mocha integration tests.

CircleCI jobs

We wanted to simulate several developers’ cycles at the same time. To do this, we used CircleCI jobs running in parallel, with each job representing an individual developer.

Each of these jobs ran the same steps (a step is a collection of executable commands) with only one difference - each one used a different variable to define the individual stacks: developer 1, developer 2, etc.

For example:

    executor: my-executor
      - checkout_utils
      - checkout_code
      - prepare_deploy
      - deploy_and_test:
          to: "developer1"

    executor: my-executor
      - checkout_utils
      - checkout_code
      - prepare_deploy
      - deploy_and_test:
          to: "developer2"

    executor: my-executor
      - checkout_utils
      - checkout_code
      - prepare_deploy
      - deploy_and_test:
            to: "developer3"

    executor: my-executor
      - checkout_utils
      - checkout_code
      - prepare_deploy
      - deploy_and_test:
          to: "developer4"

CircleCI commands

Since the steps were the same in each job (with the exception of the variable) we used CircleCI commands to define a sequence of steps to be executed in the job, which enabled us to reuse a single command definition across multiple jobs.

To address the variable used to identify each job individually, we used the commands parameter which allowed us to pass the string value for the variable with a key.

Here’s an example of our deploy-and-test command with parameters:

    description: "Checkout various utilities"
      # checkout git utils
    description: "Checkout code and test it"
      - checkout
      - run:
            # Avoid annoying double runs after deploy.
            # See
            name: Check if tagged
            command: |
              tags=$(git tag -l --points-at HEAD)
              echo "Tags $tags"
              if [[ ! -z "$tags" ]]
                echo "A tagged commit, skip..."
                circleci step halt
      - run: sudo chown -R circleci:circleci /usr/local/bin
      - run: sudo chown -R circleci:circleci /usr/local/lib/python3.7/site-packages

      # Download and cache dependencies
      - restore_cache:
            - v1-dependencies-{{ checksum "requirements.txt" }}
            # fallback to using the latest cache if no exact match is found
            - v1-dependencies-

      - run:
          name: install dependencies
          command: |
            python3 -m venv venv
            . venv/bin/activate
            pip install -r requirements.txt --upgrade
      - run: echo "source venv/bin/activate" >> $BASH_ENV
      - run: pip install pytest-cov
      - run: pre-commit install

      - save_cache:
            - ./venv
          key: v1-dependencies-{{ checksum "requirements.txt" }}
    description: "Install and configure what is needed in order to run deployment scripts"
      # integration-test setup

    description: "Deploy code and test it"
        type: string
        default: "developer1"
      # run the deploy script for the first developer
      - run: |
          set +Eeo pipefail
          cd ../utils/deployment/sls_deploy && python3 --env << >> --branch ${CIRCLE_BRANCH}
      # deploy integration-tests
      - run: cd ../integration-tests && export USER=<< >> && ./scripts/
      # run integration tests
      - run: cd ../integration-tests && npm run test-proactive
      - store_test_results:
          path: ~/integration-tests/src/test/test-results
      - store_artifacts:
          path: ~/integration-tests/src/test/test-results

CircleCI workflows

To accurately simulate our day-to-day deployment and testing process, we needed to run these jobs in parallel.

Using CircleCI workflows - a set of rules for defining a collection of jobs and their run order - we wrote a workflow for running five developer-jobs in parallel and another one that runs them in sequence.

Scheduling the tests

We didn’t want to interfere with the team during working hours, so we scheduled workflows to run at night using Cron syntax in UTC time.

Once we set these workflows to run on the master branch, we were able to successfully simulate our daily feature delivery process.

At this stage, we had something like this:

  version: 2
      - schedule:
          cron: "30 21 * * *"
                - master
      - deploy-and-test-developer1
      - deploy-and-test-developer2
      - deploy-and-test-developer3
      - deploy-and-test-developer4

Generating the test report

Next, we needed to make our report available to CircleCI Insights.

First, to generate JUnit XML, we used the mocha-junit-reporter plugin and set the test script (in package.json) as follows:

"test-proactive": "mkdir -p test-results && MOCHA_FILE=test-results/junit.xml mocha --reporter mocha-junit-reporter --timeout 300000 --recursive *.js ||  (cat test-results/junit.xml && exit 1)",

To upload and store test results for a build, we used a CircleCI step called store_test_result. This collects the test metadata from XML files and uses it to provide insights into your job.


With this pipeline, we can replicate our entire development process each night using CircleCI workflows to simulate a full deploy of our serverless environment (with Serverless Framework) and run all of our integration tests.

We can then collect the test result metadata from an XML file (with mocha-junit-reporter), store it, and view it in CircleCI Insights. This allows us to see which tests are failing and why. Here’s an example of one such report:


It shows that there were three failed tests that were flaky. For example, one of these tests was timed out:


Also of note, for successful job runs, we can see the slowest test under the Test Summary tab:


Having this information available before a real integration test fails and blocks us from merging a completely unrelated feature is incredibly valuable. It not only saves us the time we would spend fixing it, but also the time it would take to identify the cause of the failure.

Happy testing! If you have any questions about implementing this testing strategy in your development workflow, you can find me on Twitter @AiSofer. I’d also love to hear about the tools and techniques you use to identify flaky tests.

Idan Sofer is a software engineer at Lumigo, a SaaS platform for monitoring and debugging serverless applications.

Read more posts by Idan Sofer