Troubleshoot self-hosted runner

Cloud

Container runner

The following are errors you could encounter using container runner.

Container fails to start due to disk space

The task remains in the Preparing Environment step while the pod has a warning attached, noting that volume mounting fails due to a lack of disk space.

Events:
  Type     Reason       Age   From               Message
  ----     ------       ----  ----               -------
  Normal   Scheduled    67s   default-scheduler  Successfully assigned default/ccita-62e94fd3faccc34751f72803-0-7hrpk8xv to node3
  Warning  FailedMount  68s   kubelet            MountVolume.SetUp failed for volume "kube-api-access-52lfn" : write /var/snap/microk8s/common/var/lib/kubelet/pods/4cd5057f-df97-41c4-b5ef-b632ce74bf45/volumes/kubernetes.io~projected/kube-api-access-52lfn/..2022_08_02_16_24_55.1533247998/ca.crt: no space left on device

You should ensure there is sufficient disk space.

Pod host node runs out of memory

If the node a pod is hosted on runs out of memory, the task will fail with a failure step named Runner Instance Failure, and a message:

could not run task: launch circleci-agent on "container-0" failed: command terminated with exit code 137.

The pod will have a status of OOMKilled when viewed in Kubernetes with kubectl. You can use task pod configuration to control memory allocation for the job itself.

Pod host node is out of disk space

If the node is full it will have a node.kuberenetes.io/disk-pressure taint, which will prevent new task pods from being scheduled. If all valid nodes for the pod have the same taint, or other conditions that prevent scheduling, the task pod will sit in a pending state until an untainted valid node becomes available. This will show the job as stuck in the Preparing Environment step in the UI.

You need to scale your cluster more effectively to avoid this state.

The node a task is running on abruptly dies

When container runner is hosted on a separate node, the task will still look like it is running in the CircleCI UI until there is a timeout for it. kubectl will also still show the pod as running until the cluster’s liveness probe timeout is hit. The pod will then enter a terminating state that it will become wedged in. At this point the pod will need to be forcefully removed. If force is not used it may cause kubectl to hang:

kubectl delete pod $POD_NAME --force

Image has a bad entrypoint

If the entrypoint specified for the image is invalid, the task will fail with an error:

could not run task: launch circleci-agent on "container-0" failed: command terminated with exit code 139.

Container runner and CircleCI cloud set the entrypoint of the primary container in different ways. On cloud, the entrypoint of the primary container is ignored unless it is preserved using the com.circleci.preserve-entrypoint=true LABEL instruction (see: Adding an entrypoint). In contrast, container runner will always default to a shell (/bin/sh), or the entrypoint specified in the job configuration, if set.

Entrypoints should be commands that run forever without failing. If the entrypoint fails or terminates in the middle of a build, the build will also terminate. If you need to access logs or build status, consider using a background step instead of an entrypoint.

Specify an entrypoint using the Adding an entrypoint documentation to mitigate this error. You can set the entrypoint explicitly as described in Using custom built Docker images.

Image is for a different architecture

If an image for a job uses a different architecture than the node it is deployed on, container runner will give an error:

19:30:12 eb1a4 11412.984ms service-work error=1 error occurred:
        * could not start task containers: pod failed to start: :

The task pod will also show an error status. This will show as a failed job in the CircleCI UI with the error:

could not start task containers: pod failed to start: :

You should correct the underlying architecture used for nodes with jobs to match the architecture for images being used by jobs.

Bad task pod configuration

If the task pod for a resource class is misconfigured, the task will fail once claimed. In the UI the error will be in a Runner Instance Failure step with a message resembling:

could not start task containers: error creating task pod: Pod "ccita-62ea7dff36e977580a329a9d-0-uzz1y8xi" is invalid: [spec.containers[0].resources.limits[eppemeral-storage]: Invalid value: "eppemeral-storage": must be a standard resource type or fully qualified, spec.containers[0].resources.limits[eppemeral-storage]: Invalid value: "eppemeral-storage": must be a standard resource for containers, spec.containers[0].resources.requests[eppemeral-storage]: Invalid value: "eppemeral-storage": must be a standard resource type or fully qualified, spec.containers[0].resources.requests[eppemeral-storage]: Invalid value: "eppemeral-storage": must be a standard resource for containers]

No pod has been created in the Kubernetes cluster. You will need to correct the task pod configuration as described on the Container runner page.

Bash missing

"could not start task containers: exec into build container "container-0" failed: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "bb04485b9ef2386dee5e44a92bfe512ed786675611b6a518c3d94c1176f9a8aa": OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: exec: "/bin/bash": stat /bin/bash: no such file or directory: unknown"

Bash is required for custom images used in jobs executed with a container runner.

Oops, there was an issue with your infrastructure

If you see the message "Oops, there was an issue with your infrastructure. Verify your self-hosted runner infrastructure is operating and try re-running the job. If the issue persist, see our troubleshooting guide" on the job’s page, or if there is no content in the task lifecycle step (as shown), you should consider the potential causes described below:

Image showing the task lifecycle with no content message

Figure 1. Task lifecycle with no content

Pod restart: Check if there were any container agent pod restarts around the time the workflow ran. If the pod was restarted around that time it would have resulted in the job not being processed. In such a case, we recommend rerunning the job once again.

You can check the logs for any of the previous runs using the command kubectl logs -n <namespace> <full pod name> --previous `.

Network connectivity issue: Check the network connectivity of the container agent, especially if the issue is intermittent. The issue can be seen when the container agent has lost network connectivity after claiming the tasks.

We suggest connecting to the pod using the command kubectl exec --stdin --tty -n circleci < full pod name > — /bin/sh and then running a ping test for an extended period of time. We also recommend checking the connection to the links on our FAQ includes a section about the connectivity required for CircleCI’s self-hosted runners.

Resources exhaustion: Check if your pods are reaching their resources limits in the cluster, as the pod could end the job to free up resources. We recommend setting resource limits either within your values.yaml or within your config.yaml.

Refer to the Kubernetes documentation for details of external resource monitoring tools.

Task agent appears not running

The task pod may fail with an Task agent appears not running: /bin/sh: 1: kill: No such process error due a failing liveness probe.

The liveness probe on the task pod checks to see if the PID provided by task agent is running using kill -0 $PID. Task agent will output its PID to a file used by the liveness probe to confirm task agent is running. This probe may fail if the task agent process fails to start, no longer exists or takes longer to initiate than the liveness probe’s default timeouts. You may adjust the liveness probe defaults in the values.yaml for container runners’s Helm chart.

In the example below we have set the liveness probe to allow a 2.5 minute startup time before probing(initialDelaySecond) and 2.5 minutes of failures before the liveness probe fails (the probe will wait 30 seconds between each probe and allow for 5 failure responses before failing the probe) and the task pod is terminated.

agent:
  resourceClasses:
    <namespace>/<resource-class>:
      spec:
        containers:
          - livenessProbe:
              initialDelaySeconds: 150
              periodSeconds: 30
              timeoutSeconds: 15
              successThreshold: 1
              failureThreshold: 5

In the event that the liveness probe fails or the task pod terminates, there is a prestop hook, which attempts to kill any existing task agent by its provided PID. This is to ensure there are no orphaned task agent processes. However, if the PID on file does not map to an existing process, this will not throw and error and will instead log PreStop hook: task agent appears never started or already stopped.

View the logs of a failed task pod

When a task pod fails, it is cleaned up by the container-agent almost immediately. However, sometimes you may want the pod to stick around longer so that you may review the logs and diagnose the failure. You can disable task pod deletion by adding the environment variable KUBE_DISABLE_POD_DELETION_ON_TASK_CLEANUP=true to your container-agent values.yaml.

example:

agent:
  environment:
    KUBE_DISABLE_POD_DELETION_ON_TASK_CLEANUP: true

When KUBE_DISABLE_POD_DELETION_ON_TASK_CLEANUP is set to true, then task pods may dangle until they are manually cleaned or until garbage collection deletes these pods. By default, garbage collection removes resources after they have lived for 5 hours and 5 mins. You may tweak these settings in your container-agent values.yaml.

Machine runner

The following are errors you could encounter using machine runner.

I installed my first self-hosted runner on macOS and the job is stuck in "Preparing Environment", but there are no errors, what should I do?

In some cases, you may need to update the execution permission for the launch-agent so it is executable by root. Try running the following two commands:

sudo chmod +x /opt/circleci/circleci-launch-agent
sudo /opt/circleci/circleci-launch-agent --config=/Library/Preferences/com.circleci.runner/launch-agent-config.yaml

Cancel the job and rerun it. If your job is still not running, file a support ticket.

Debugging with SSH

CircleCI’s machine runners support rerunning a job with SSH for debugging purposes. Instructions on using this feature can be found at Debugging with SSH.

The Rerun job with SSH feature is disabled by default. To enable this feature, see the machine runner configuration reference or the container runner installation guide.

OIDC tokens missing from jobs

If you experience that OIDC token environment variables ($CIRCLE_OIDC_TOKEN, $CIRCLE_OIDC_TOKEN_V2) are missing from jobs, a common cause can be that the default temporary directory (for example, /tmp) is not writable or is mounted as noexec. The system’s temporary directory needs to be writable and allow execution for the OIDC plugin to be downloaded and executed from it.