無料でビルドを開始
CircleCI.comアカデミーブログコミュニティサポート

Container runner performance benchmarks

today7 min read
Cloud
Server v4.3+
このページの内容

Runner benchmarks show the performance tradeoffs of CircleCI self-hosted runners up to a non-acceptable error threshold. From the chart below, you can see there is a trade off between the following:

  • ReplicaSet

  • Concurrency

  • Tasks

  • Queue

  • Run time

Depending on your team’s workload types, for example, high parallelism, fan-in/out etc. you may need to adjust your cluster for high concurrency and tasks, potentially impacting queuing, run time, and other factors.

By publishing our benchmarks we can make measurable improvements to the performance and scale of CircleCI self-hosted runner, and show the impact of those improvements.

The tables below detail the aggregation of results from testing the 3.1 self-hosted runner (GOAT) compared to the same tests run with the 3.0 self-hosted runner. Version 3.1 introduced a major re-architecture of container runner to address performance, stability, and reliability. For more technical background on 3.1, refer to the runner-init project’s README.

Distilling the differences between GOAT and the 3.0 self-hosted runner, there is a net improvement across all four categories, with the most notable being the reduction in queue times. GOAT also demonstrated a minor but notable decrease in failed runs, showing an improvement when run under stress.

In summary, the average improvements of GOAT are as follows:

Average Run TimeMax Run TimeAverage Queue TimeMax Queue Time

5%

1%

56%

49%

In some instances, GOAT showed lower performance than the 3.0 self-hosted runner. In these cases, the differences are on the order of milliseconds and can often be attributed to cluster, network, and compute conditions. While some differences may appear extreme, they are often outliers in the 95th (or higher) percentile. The table above is the result of repeating the experiment four times for each row. When these extremes are considered in the context of the rest of the experiments, the net result is still positive for run times.

In queuing, where the most dramatic performance increase is observed, the results are much more consistent and are less influenced by external factors such as remote API calls.

Runner configuration recommendations

Based on the reference architecture of GKE 1.29.4, using a node pool of 5 E2 medium nodes, and the above benchmarks, we can make several recommendations for container runner cluster configuration for the following:

  • Replica count of the container agent

  • Maximum concurrent task configuration

High performance cluster

  • 3 replicas of container agent

  • 80 concurrent tasks per replica.

This configuration makes a slight trade off in stability, a slightly higher rate of infrastructure failures, to achieve much higher task throughput and to reduce queueing times.

High stability cluster

  • 1 replica of container agent

  • 20 concurrent tasks per replica

This configuration trades off throughput for higher stability, with minimal infrastructure failures. Note this is the default configuration for the container agent Helm chart.

When tuning a cluster for performance there are three main variables to consider: container agent replica count, maximum concurrent tasks per replica, and node pool configuration.

Container agent replica count

The more replicas of container agent, the faster tasks will get claimed, as each replica runs its own collection of claiming loops. Having more replicas is beneficial if you have sudden large backlogs of tasks to run, as tasks will be able to be claimed more quickly, and have a pod spec submitted to the Kubernetes cluster for scheduling. It is worth considering that the more replicas used (and more tasks that are able to launch concurrently) the greater the strain on the K8s control plane, and the more prone you will be to task start failures. CircleCI container runners will attempt to reschedule a task up to three times before declaring an infrastructure failure.

Maximum concurrent tasks per replica

This number in particular is very sensitive to node types and counts. The more tasks that are attempted to launch in a short window, the higher the strain on the Kubernetes cluster’s control plane, as well as the individual kubelets, which are responsible for the pods and containers on a specific node. As node power and count increase, the impact of concurrent tasks on a cluster decreases. The lower the number of maximum concurrent tasks, the greater the reliability of tasks successfully starting and not experiencing an infrastructure failure.

The likelihood of an infrastructure failure for a task decreases as node count and resources are increased, particularly CPU.

Node types and count

The recommendations already presented are based on the reference cluster configuration. As a node pool grows, or is set to an instance type with greater resources, task execution becomes more reliable. When sizing a cluster, you should add headspace beyond that expected for an individual task. The kubelet and container driver share the same resources as the pods on the node, and the more resource starved they become the more prone to long queue times and infrastructure failures tasks become. The more distributed pods are able to be scheduled the less pressure and backlog are applied to the individual kubelets and container engines, resulting in shorter queueing times.

Troubleshooting

Refer to the Troubleshoot Container Runner section of the Troubleshoot Self-hosted Runner guide if you encounter issues installing or using container runner.


Suggest an edit to this page

Make a contribution
Learn how to contribute