In my recent interactions at AWS re:invent and KubeCon, I discovered that too many existing CircleCI users were not aware of, or not using, the very useful and powerful debugging with SSH feature. This feature enables a user to troubleshoot and debug builds on the resources where the build failed.

I’ll start this post with a mock dialogue that most developers more than likely have had at one time or another with their sys admins, SREs, or DevOps colleagues regarding failed CI/CD builds, and it goes a little something like this…

dev: Hey, SRE team. My build is failing and I don’t know what’s happening with the app in the build node. It’s failing on the CI/CD platform, but the build scripts are working fine in my developer environment. Can I have SSH access to the build node on the platform so that I can debug in real-time?

SRE: How about in one hour?

dev: One hour? I was hoping to get this figured out sooner so that I can put these changes to bed and start hacking on new features for the next release. Is there any way you can grant me access to the build node so that I can debug on the actual resource where my build fails?

SRE: Our CI/CD platform doesn’t have SSH access capabilities, and giving you my admin credentials is a security violation.

dev: I thought we installed an SSH plugin that allows us to access the build nodes, or at the very least, allows us to send console commands to the builder node and capture responses in system logs?

SRE: We did have an SSH plugin installed and security flagged it as vulnerable under CVE-2017-2648 which enables Man-in-the-Middle attacks. Security directed us to uninstall the plugin and banned its usage in our CI/CD platform. We currently don’t have SSH capabilities into the nodes that could help you debug in real-time. Sorry.

dev: It’s going to take me forever to debug this build if I only use the unit test and stacktrace logs. I might as well be flying blind.

Although the above dialogue is a generalization, it is based on genuine interactions and situations that I’ve experienced in my career. The above scenario is very common amongst teams.

Accessing pipeline jobs

Debugging code on resources and in environments outside of a developer’s normal development environment can present challenges that consume valuable time. In the previous scenario, the CI/CD platform itself was a huge blocker for the developer and the SRE team because it doesn’t have features that enable a developer to access and troubleshoot failed builds. By not having SSH access to the build nodes, a developer has to resort to debugging the failed builds outside of the CI/CD environment where the build actually is failing. They have to try to replicate the CI/CD environment in their dev environment in order to accurately identify the issue, then attempt to resolve it using only application, stack trace, and system logs. These types of situations are a huge waste of time for developers and SRE teams.

CircleCI allows developers to securely and easily debug their failed builds in real-time on the resources where the failures occurs via the debugging with SSH feature. The SSH feature not only enables real-time debugging, it also provide self-serve mechanisms where developers are allowed to securely access the build environments without having to depend on other teams for access to build resources. This is a huge time saver for developers and ops teams alike.

Troubleshooting a failed pipeline job

Accessing build jobs via SSH can be accomplished very easily within CircleCI. Let’s run through a common build failure scenario to demonstrate how to troubleshoot a failed build job. Below is a sample config.yml that we’ll use in this scenario:

version: 2.1
workflows:
  build_test_deploy:
    jobs:
      - build_test
      - deploy:
          requires:
            - build_test
jobs:
  build_test:
    docker:
      - image: circleci/python:2.7.14
    steps:
      - checkout
      - run:
          name: Install Python Dependencies
          command: |
            pip install --user --no-cache-dir -r requirements.txt
      - run:
          name: Run Tests
          command: |
            python test_hello_world.py
  deploy:
    docker:
      - image: circleci/python:2.7.14
    steps:
      - checkout
      - setup_remote_docker:
          docker_layer_caching: false
      - run:
          name: Build and push Docker image
          command: |       
            pip install --user --no-cache-dir -r requirements.txt          
            ~/.local/bin/pyinstaller -F hello_world.py
            echo 'export TAG=0.1.${CIRCLE_BUILD_NUM}' >> $BASH_ENV
            echo 'export IMAGE_NAME=python-cicd-workshop' >> $BASH_ENV
            source $BASH_ENV
            docker build -t $DOCKER_LOGIN/$IMAGE_NAME -t $DOCKER_LOGIN/$IMAGE_NAME:$TAG .
            echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin
            docker push $DOCKER_LOGIN/$IMAGE_NAME

The sample config.yml specifies a 2-job workflow pipeline that tests the code, builds a Docker image based on the app, and finally publishes that image to Docker Hub. In order to publish the new image to Docker Hub, the job requires Docker Hub credentials. These credentials are very sensitive and are securely represented in the $DOCKER_LOGIN and $DOCKER_PWD secure environment variables. These variables are set as project level environment variables. If the $DOCKER_LOGIN and $DOCKER_PWD environment variables don’t exist at the project level, this build will fail at the Build and push Docker image step.

Identifying our failed job

Failed build

Screenshot of a failed job


The failed job logs indicate that the failure is occurring at the start of the Docker build process which is shown in the log entry below.

invalid argument "/python-cicd-workshop" for "-t, --tag" flag: invalid reference format
See 'docker build --help'.
Exited with code 125

Now that we have an indication of where things are failing in the pipeline, we can easily rerun this failed build with SSH access. When rerunning jobs with SSH access, the pipeline will run again and fail, just as it did previously. But this time, the runtime node will stay alive and provide the user with details on how to access the failed build resource.

Rerun the job with SSH

To gain SSH access to the failed build, rerun the failed job from within the CircleCI dashboard:

  1. Log into the CircleCI dashboard (if you haven’t already)
  2. Click the FAILED job
  3. Click the down button on the top right portion of the dashboard then select Rerun job with SSH to kick off the build.

rerun with ssh

The image below shows an example of the access details the developer must use to access the resource via SSH.

2019-02-26-amr-ssh-details.jpg

The SSH access details are provided at the end of the build:

ssh -p 64535 100.27.19.200

Accessing the build environment

Now that we have SSH access to the build environment, let’s troubleshoot our build. As mentioned before, the log entries indicate that the issue has something to do with the Docker build portion of our deploy steps. The invalid argument "/python-cicd-workshop" portion of the error message shows me that the Docker username is missing from "/python-cicd-workshop".

The Docker image name is defined in this line of the config.yml file and is composed of environment variables:

docker build -t $DOCKER_LOGIN/$IMAGE_NAME -t $DOCKER_LOGIN/$IMAGE_NAME:$TAG .

We know that the Docker image build fails because of an inappropriate image name and we know that the name is composed of environment variables. This indicates that the failure is related to incorrect or nonexistent environment variables. Let’s open a terminal and SSH into the build environment. We want to run a command to test if our assumption about the environment variables is true:

$ printenv |grep DOCKER_LOGIN

The printenv |grep DOCKER_LOGIN command tells the system to show the $DOCKER_LOGIN environment variable and its value. The output of this command will tell us if the $DOCKER_LOGIN variable is set or not. If the command does not return a value, then we know that the system has not set the $DOCKER_LOGIN variable at the initial execution of the build. In this case, no value was returned. That is the cause of our failure.

Fixing the build

We have now verified that we’re missing the $DOCKER_LOGIN environment variable. We can fix the build by adding both the missing $DOCKER_LOGIN and $DOCKER_PWD variables to the project using the CircleCI dashboard. Since the values for these variables are very sensitive, they must be defined and securely stored on the CircleCI platform. You can set the variables using these instructions:

  1. Click Add Project on the CircleCI dashboard in the left menu
  2. Find and click the project’s name in the projects list and click Set Up Project on the right side
  3. Click the Project cog in the top right area of the CircleCI dashboard
  4. In the Build Settings section, click Environment Variables
  5. Click Add Variable

In the Add an Environment Variable dialog box, define the environment variables needed for this build:

  • Name: DOCKER_LOGIN Value: Your Docker Hub User Name
  • Name: DOCKER_PWD Value: Your Docker Hub Password

Setting these environment variables correctly is critical to the build successfully completing.

Rerunning the build

Now that we’ve set the required environment variables for our build, we can rerun the failed build to test that our changes work. Go to the CircleCI dashboard and click Rerun from Beginning to kick off a rebuild of the project and wait for your build to successfully complete.

2019-02-26-amr-ssh-build-success.jpg

Wrapping up

In this post, I highlighted the power of CircleCI’s SSH feature and how any user can securely and easily access their build environments and debug failed builds in real-time. This feature empowers developers to quickly identify and fix their broken builds so that they can focus their time and attention to building new & innovative features.