Runner provisioner Preview
|
Runner Provisioner is currently in preview. The product, its configuration schema, and its APIs are subject to change before general availability. It is not recommended for production workloads. If you encounter issues or have feedback, see Feedback and Support. |
Runner Provisioner is a Kubernetes controller that automatically scales CircleCI runner VMs using KubeVirt. Runner Provisioner polls the CircleCI API for pending and running tasks, then adjusts a VirtualMachinePool replica count to match demand.
The current preview version is 0.1.0.
Getting access
Runner Provisioner is available to invited preview participants only. To request access, fill out the Runner Provisioner preview access request form.
Once access is granted, you will receive credentials for the image registry referenced in the Quickstart, and access to the Runner Provisioner preview Slack channel.
Feedback and support
Runner Provisioner is available to invited preview participants only. As a preview customer, expect bugs and missing features — this is early-stage software and sharp edges are normal.
In exchange, detailed feedback is expected. Your input directly shapes what gets built before general availability. You have direct access to the CircleCI product and engineering team throughout the preview.
Escalate directly via the #runner-provisioner-preview Slack channel for:
-
Troubleshooting issues
-
Bugs and feature requests
-
General questions
Do not open a support ticket for issues with Runner Provisioner. Issues are routed directly to the product team with a 24-hour internal response target.
Prerequisites
-
A Kubernetes cluster with KubeVirt installed. Refer to the KubeVirt compatibility matrix for the appropriate version for your cluster. Runner Provisioner has been tested with v1.8.
-
kubectlconfigured against your cluster. -
helmv3+. -
A CircleCI runner resource class and its associated resource class token. Create one in the CircleCI web app under Self-Hosted Runners. This token is used by the agent running on the VM to authenticate with CircleCI and claim and execute jobs for that resource class.
-
A CircleCI API token with permission to query runner tasks. This may be a personal API token or a project API token with read-only access.
-
An image pull secret named
regcredin the target namespace. The Helm chart references this by default.
Cluster requirements
Nested virtualization
KubeVirt runs VMs inside Kubernetes pods. Each node that will host runner VMs must expose /dev/kvm — the node itself must support hardware-accelerated virtualization (either bare metal, or a cloud VM with nested virtualization enabled).
Verify KVM is available on a node by checking the virt-handler pod on that node.
Get a list of virt-handler pods:
$ kubectl get pods -n kubevirt -l kubevirt.io=virt-handler
Select any of the pods listed in the output to run the following command:
$ kubectl exec -n kubevirt <virt-handler-pod> -- ls /proc/1/root/dev/kvm
Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init)
/proc/1/root/dev/kvm
If the file is absent, VMs cannot be scheduled on that node regardless of how KubeVirt is configured. On cloud providers, nested virtualization is typically disabled by default and must be explicitly enabled on the node pool or instance group before the nodes are created. Nested virtualization cannot be patched onto existing nodes.
Dedicated node pool for VM workloads (optional)
Running runner VMs on a dedicated node pool, separate from the nodes that run KubeVirt’s own control plane components (virt-operator, virt-api, virt-controller), is recommended. This prevents VM workloads from competing with cluster infrastructure for resources.
Nodes in this pool must have nested virtualization enabled. Nested virtualization but be configured at node or instance creation time and cannot be patched onto existing nodes. Details on how to enable nested virtualization for GCP, AKS, and AWS node pools are covered in the following sections.
Tainted nodes (optional)
Taint the nodes to prevent arbitrary workloads from landing on them while still allowing virt-launcher pods through. For information on Taints and Tolerations, see the Kubernetes Documentation.
Then patch the virt-handler so it can run on the tainted nodes. The KubeVirt operator manages the DaemonSet, so this must go through the KubeVirt CR rather than a direct patch. Replace the toleration key with the taint key you applied to your nodes:
$ kubectl patch kubevirt kubevirt -n kubevirt --type=merge -p='{
"spec": {
"customizeComponents": {
"patches": [
{
"resourceName": "virt-handler",
"resourceType": "DaemonSet",
"patch": "{\"spec\":{\"template\":{\"spec\":{\"tolerations\":[{\"key\":\"CriticalAddonsOnly\",\"operator\":\"Exists\"},{\"key\":\"<your-taint-key>\",\"operator\":\"Exists\",\"effect\":\"NoSchedule\"}]}}}}",
"type": "merge"
}
]
}
}
}'
Use this patch command in the cloud provider examples below.
Example: GKE
On GKE, use gcloud to create the node pool with nested virtualization and the taint applied in one step. GKE requires an n2, n2d, c2, or c2d series machine type. e2 instances do not support nested virtualization. In the command below, the node pool creates nodes with a taint applied using kubevirt as the taint key.
$ gcloud container node-pools create kubevirt-pool \
--cluster=<your-cluster-name> \
--zone=<your-zone> \
--project=<your-project> \
--machine-type=n2-standard-4 \
--num-nodes=3 \
--enable-autoscaling \
--min-nodes=3 \
--max-nodes=10 \
--enable-nested-virtualization \
--node-labels=kubevirt.io/schedulable=true \
--node-taints=kubevirt=true:NoSchedule \
--image-type=cos_containerd \
--disk-size=100
Then install KubeVirt and apply the virt-handler patch from Tainted Nodes using kubevirt as the taint key.
Example: Azure Kubernetes service (AKS)
On AKS, nested virtualization is determined by the VM SKU, not a flag. Use a Standard_D*s_v3 or newer (v4, v5) series VM, which supports nested virtualization. Standard_B series and older Standard_A series do not. In the command below, the node pool creates nodes with a taint applied using kubevirt as the taint key.
$ az aks nodepool add \
--cluster-name <your-cluster-name> \
--resource-group <your-resource-group> \
--name kubevirtpool \
--node-count 3 \
--enable-cluster-autoscaler \
--min-count 3 \
--max-count 10 \
--node-vm-size Standard_D4s_v3 \
--node-taints kubevirt=true:NoSchedule \
--labels kubevirt.io/schedulable=true \
--os-type Linux
Then install KubeVirt and apply the virt-handler patch from Tainted Nodes using kubevirt as the taint key.
Example: AWS EKS
On EKS, KVM support requires bare metal instances — regular EC2 instances, even Nitro-based ones, do not expose /dev/kvm to pods. Use a .metal instance type (for example, m5.metal or c5.metal). An active request for nested virtualization support on EC2 instances remains open.
eksctl does not support taints as CLI flags for clusters it did not create. Use a config file instead:
kubevirt-nodegroup.yamlapiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: <your-cluster-name>
region: <your-region>
vpc:
id: <vpc-id>
securityGroup: <cluster-security-group-id>
subnets:
private:
<az-1>:
id: <subnet-id-1>
<az-2>:
id: <subnet-id-2>
managedNodeGroups:
- name: kubevirt-pool
privateNetworking: true
instanceType: m5.metal
minSize: 3
maxSize: 10
desiredCapacity: 3
labels:
kubevirt.io/schedulable: "true"
taints:
- key: kubevirt
value: "true"
effect: NoSchedule
Fetch the required VPC values from your existing cluster:
$ aws eks describe-cluster --name <your-cluster-name> \
--query 'cluster.resourcesVpcConfig.{vpcId:vpcId,securityGroupId:clusterSecurityGroupId,subnetIds:subnetIds}'
Then apply the node group config:
$ eksctl create nodegroup -f kubevirt-nodegroup.yaml
Then install KubeVirt and apply the virt-handler patch from Tainted Nodes using kubevirt as the taint key.
Configure KubeVirt operator scheduling
By default, KubeVirt’s operator requires nodes with a node-role.kubernetes.io/control-plane label and uses a requiredDuringSchedulingIgnoredDuringExecution affinity. In clusters where this label is not present or the affinity is too restrictive, apply these two fixes after installing KubeVirt.
Remove the hard affinity requirement so the operator can schedule on any node:
$ kubectl patch deployment virt-operator -n kubevirt --type=json \
-p='[{"op":"remove","path":"/spec/template/spec/affinity/nodeAffinity/requiredDuringSchedulingIgnoredDuringExecution"}]'
Label all nodes so KubeVirt install jobs (generated by the operator) can schedule:
$ kubectl label nodes --all node-role.kubernetes.io/control-plane=
| The command above labels all existing nodes. If you have a dedicated VM worker node pool, apply this label to those nodes once they join the cluster. |
To apply the label to nodes in a specific node pool, use the appropriate selector for your cloud provider:
# AWS EKS
$ kubectl label nodes -l eks.amazonaws.com/nodegroup=<nodegroup-name> node-role.kubernetes.io/control-plane=
# GKE
$ kubectl label nodes -l cloud.google.com/gke-nodepool=<pool-name> node-role.kubernetes.io/control-plane=
# AKS
$ kubectl label nodes -l agentpool=<nodepool-name> node-role.kubernetes.io/control-plane=
Quickstart
2. Create the image pull secret
Once access is granted, you will receive credentials for the image registry as described in Getting Access. Use those credentials below.
$ kubectl create secret docker-registry regcred \
--namespace runner-provisioner \
--docker-server=<registry> \
--docker-username=<user> \
--docker-password=<password>
3. Configure values
Create a my-values.yaml file:
my-values.yamlprovisioner:
# CircleCI API token for querying unclaimed/running tasks
circleToken: "your-circle-api-token"
resourceClass:
# Resource class in the format "namespace/name"
name: "my-org/my-runner"
# Runner token for this resource class
token: "your-runner-token"
# Scaling bounds
minReplicas: 3
maxReplicas: 10
# Optional: idle timeout before a waiting VM shuts itself down (e.g. "10m")
# idleTimeout: ""
# KubeVirt VirtualMachineInstanceSpec for each runner VM
spec:
domain:
resources:
requests:
memory: "2Gi"
cpu: "1"
devices:
disks:
- name: disk
disk:
bus: virtio
volumes:
- name: disk
containerDisk:
image: "quay.io/containerdisks/ubuntu:22.04"
The image quay.io/containerdisks/ubuntu:22.04 is an official container disk maintained by the KubeVirt project, providing a pre-built Ubuntu 22.04 OS image for running virtual machines on Kubernetes.
Connecting to a CircleCI Server instance
By default, Runner Provisioner connects to the CircleCI Cloud API at https://runner.circleci.com/. If you are running a self-hosted CircleCI Server instance, set provisioner.circleciAPIAddr to your server’s hostname in my-values.yaml:
my-values.yamlprovisioner:
circleciAPIAddr: "https://your-server-hostname"
circleToken: "your-circle-api-token"
resourceClass:
name: "my-org/my-runner"
token: "your-runner-token"
This value is injected into each VM’s cloud-init script so the runner agent connects to your server instance rather than CircleCI Cloud. Without it, runners will fail to register.
Configuration reference
Configuration field names and defaults may change before general availability. Pin your my-values.yaml to a specific chart version and review the changelog before upgrading.
|
Top-level values
| Key | Default | Description |
|---|---|---|
|
|
Container image |
|
|
Image tag (overridden by |
|
|
SHA digest; takes precedence over tag when set |
|
|
Secrets for pulling the provisioner image |
provisioner.* values
| Key | Default | Description |
|---|---|---|
|
CircleCI API base URL |
|
|
|
Namespace where runner VMs are created |
|
|
CircleCI API token for task polling |
|
|
Name of a pre-existing Secret (see Using an Existing Secret) |
provisioner.resourceClass.* values
| Key | Default | Description |
|---|---|---|
|
|
Resource class in |
|
|
Runner authentication token (required) |
|
|
Duration a VM waits for a job before shutting down (for example, |
|
|
Minimum number of VMs always running |
|
|
Maximum number of VMs allowed |
|
Ubuntu 22.04, 2Gi RAM, 1 CPU |
KubeVirt |
Using an existing secret
If you manage secrets externally (for example, via Vault or Sealed Secrets), set provisioner.existingSecret to the name of a pre-existing Kubernetes Secret. When set, resourceClass.token and circleToken in values are ignored.
The Secret must have two keys:
-
circle-token. The CircleCI API token for task polling. -
config.yaml. The resource class configuration.
config.yamlresourceClass:
"my-org/my-runner":
token: "your-runner-token"
idleTimeout: "10m" # optional
spec:
domain:
resources:
requests:
memory: "2Gi"
cpu: "1"
devices:
disks:
- name: disk
disk:
bus: virtio
volumes:
- name: disk
containerDisk:
image: "quay.io/containerdisks/ubuntu:22.04"
Create the secret with:
$ kubectl create secret generic my-secret \
--namespace runner-provisioner \
--from-literal=circle-token="your-circleci-api-token" \
--from-file=config.yaml=./config.yaml
Then reference it in values:
my-values.yamlprovisioner:
existingSecret: "my-secret"
VM specification notes
The spec field is a KubeVirt VirtualMachineInstanceSpec. The provisioner always appends a cloud-init disk and volume automatically — do not add one yourself.
VM OS support is limited to Debian/Ubuntu and RHEL/CentOS based images. Other Linux distributions are not supported.
The startup script performs the following steps on each VM:
-
Detects the OS and installs
circleci-runnerfrom packagecloud.io. -
Injects the runner auth token into
/etc/circleci-runner/circleci-runner-config.yaml. -
Configures the runner in single-task mode (one job per VM lifetime).
-
Optionally sets
idle_timeoutin the runner config. -
Configures systemd to power off the VM after the runner process exits.
-
Starts the runner service.
Scaling behavior
Desired replicas are calculated as unclaimed tasks plus running tasks, clamped to [minReplicas, maxReplicas].
-
The scaler polls CircleCI every one second.
-
minReplicasVMs are always kept running as a pre-warmed pool. -
When demand drops, excess VMs drain naturally. That is, they pick up no new jobs and shut down after completing their current job (or after
idleTimeoutif set).
idleTimeout
Without idleTimeout, a pre-warmed VM that never receives a job waits indefinitely. Setting idleTimeout (for example, "10m") causes VMs to shut down after that period of inactivity. An idle timeout is useful for:
-
Draining excess pre-scaled VMs when demand drops.
-
Cycling VMs after a spec or config update (old VMs will eventually time out and be replaced).
Role-based access control
The Helm chart creates a ServiceAccount, Role, and RoleBinding scoped to the target namespace. The provisioner requires the following permissions:
| Resource | Verbs |
|---|---|
|
|
|
|
|
|
|
|
Observability
| Endpoint | Port | Purpose |
|---|---|---|
|
|
Readiness probe |
|
|
Liveness probe |
Logs are written to stderr in JSON format.
Confirming the scaler is polling
The scaler emits a log entry on every poll cycle (every one second) as part of a span named worker loop scaler. Each entry includes the following fields:
| Field | Description |
|---|---|
|
Number of queued jobs waiting to be claimed |
|
Number of jobs currently running on runner VMs |
|
Replica count the scaler calculated (unclaimed + running, clamped to |
|
Always |
A healthy idle state (no jobs queued, pool at minReplicas) looks like:
{"loop_name":"scaler","unclaimed_tasks":0,"running_tasks":0,"desired_vms":3}
A healthy active state (jobs queued, scaler responding):
{"loop_name":"scaler","unclaimed_tasks":4,"running_tasks":2,"desired_vms":6}
If desired_vms is not changing in response to queued jobs, check the following:
-
If
unclaimed_tasksis always 0, theCIRCLE_TOKENmay be invalid or pointing at the wrong resource class. -
If
desired_vmsis not increasing past a fixed number, the scaler is hittingmaxReplicas.
Scaler errors appear as log entries with messages like failed to get unclaimed tasks or failed to get running tasks, indicating the provisioner cannot reach the CircleCI API.
Upgrading
Update your my-values.yaml and run:
$ helm upgrade runner-provisioner ./chart \
--namespace runner-provisioner \
--values my-values.yaml
The deployment pod annotation checksum/config is derived from the Secret contents, so a config-only change (for example, a new token or VM spec) triggers a pod deployment automatically.
Configuration changes (tokens, API address, VM spec) are injected into VMs at first boot via cloud-init and are not re-applied to running VMs. After a helm upgrade, existing VMs continue using their original config until they are recreated. Two deployment options are available:
- Graceful deployment — no job interruption
-
Set
idleTimeoutin your values before upgrading. VMs will shut down on their own once they finish their current job and go idle. The pool recreates the VMs with the updated config. Graceful deployment is the right choice when:-
You cannot interrupt in-progress jobs.
-
The deployment is slow and completes only once every existing VM has either run a job to completion or timed out.
-
- Immediate deployment — jobs will be interrupted
-
Delete all VMs after upgrading. The pool recreates them immediately with the updated config. Any jobs running on deleted VMs will fail and must be rerun.
$ kubectl delete vm -n runner-provisioner --all
Troubleshooting
Provisioner pod is not starting
Check the deployment status and pod logs:
$ kubectl get pods -n runner-provisioner
$ kubectl describe pod -n runner-provisioner <pod-name>
$ kubectl logs -n runner-provisioner deployment/runner-provisioner
Common causes:
-
Image pull failure: Verify the
regcredsecret exists in the namespace and credentials are valid. -
Missing secret keys: If using
existingSecret, confirm the secret contains bothcircle-tokenandconfig.yamlkeys. -
Invalid config: A malformed
config.yamlor missing required fields (resourceClass.name,resourceClass.token) will cause the provisioner to exit on startup.
VMs are not being created
If the provisioner is running but no VMs appear:
$ kubectl get virtualmachinepool -n runner-provisioner
$ kubectl describe virtualmachinepool -n runner-provisioner <pool-name>
$ kubectl get vm -n runner-provisioner
Common causes:
-
minReplicasis 0: The pool will have 0 VMs unless there are pending tasks. SetminReplicasto at least 1 to confirm the pool is functional. -
KubeVirt not installed or not ready: Check that KubeVirt components are running:
kubectl get pods -n kubevirt. -
Role-based access control misconfiguration: The provisioner
ServiceAccountmay lack permission to create or updateVirtualMachinePoolresources. Check events on the provisioner pod.
VMs are stuck in pending or never reach running
$ kubectl get vmi -n runner-provisioner
$ kubectl describe vmi -n runner-provisioner <vmi-name>
Common causes:
-
No schedulable nodes: Confirm nodes in the VM worker pool have the label
kubevirt.io/schedulable=trueand thatvirt-handleris running on those nodes:kubectl get pods -n kubevirt -o wide. -
/dev/kvmnot available: Run the KVM check described in Nested Virtualization. If absent, nested virtualization is not enabled on that node. -
Insufficient resources: The VM spec requests more CPU or memory than any single node can provide. Check node capacity:
kubectl describe nodes. -
Taint or toleration mismatch: If nodes are tainted, verify
virt-launcherpods have the matching toleration (configured via thevirt-handlerpatch in Tainted Nodes).
Runner VMs boot but do not claim jobs
SSH into a VM or check its cloud-init output to confirm the runner agent started successfully:
$ kubectl get vmi -n runner-provisioner
$ virtctl console -n runner-provisioner <vmi-name>
Then, inside the VM:
$ sudo systemctl status circleci-runner
$ sudo journalctl -u circleci-runner -n 50
Common causes:
-
Wrong runner token: The resource class token in your values does not match the token in CircleCI. Regenerate the token in the CircleCI web app under Self-Hosted Runners and update your Helm values.
-
Wrong resource class name: The
resourceClass.namein values must match the resource class your jobs target, innamespace/nameformat. -
CircleCI Server not reachable: If using a self-hosted server, confirm
circleciAPIAddris set and that the VM can reach that address. Check runner agent logs for connection errors. -
Cloud-init did not run: If the VM booted from a cached image state, cloud-init may have been skipped. Delete the VM and let the pool recreate it:
kubectl delete vm -n runner-provisioner <vm-name>.
Scaling is not responding to job demand
Check what the provisioner sees from the CircleCI API:
$ kubectl logs -n runner-provisioner deployment/runner-provisioner -f
The provisioner logs the unclaimed and running task counts each poll cycle. If counts are always 0 when jobs are queued:
-
Wrong
CIRCLE_TOKEN: The API token does not have permission to query runner tasks for the configured resource class, or it belongs to the wrong org. -
Wrong
circleciAPIAddr: For CircleCI Server, confirm the API address points to your instance. -
Resource class name mismatch: The provisioner queries tasks for
resourceClass.name. Confirm this matches the resource class your jobs target exactly.
Config changes are not reflected in running VMs
Cloud-init runs only once at first boot. After a helm upgrade that changes tokens, API address, or VM spec, existing VMs will not pick up the new config. Delete them so the pool recreates them:
$ kubectl delete vm -n runner-provisioner --all
New VMs created by the pool will use the updated cloud-init script.
KubeVirt operator pods are not scheduling
If virt-operator, virt-api, or virt-controller pods are stuck in Pending, see the KubeVirt Operator Scheduling section. The most common fix is removing the hard node affinity requirement and labeling nodes:
$ kubectl patch deployment virt-operator -n kubevirt --type=json \
-p='[{"op":"remove","path":"/spec/template/spec/affinity/nodeAffinity/requiredDuringSchedulingIgnoredDuringExecution"}]'
$ kubectl label nodes --all node-role.kubernetes.io/control-plane=
Limitations
Current architectural limits
-
Only one resource class is supported per provisioner deployment. Run multiple deployments for multiple resource classes.
-
VM OS must be Debian/Ubuntu or RHEL/CentOS based.
-
The provisioner requires KubeVirt’s
VirtualMachinePoolAPI (pool.kubevirt.io).
Preview-stage gaps
The following capabilities are not yet available and are planned before general availability:
-
Multi-resource-class support in a single deployment.
-
Metrics endpoint (Prometheus-compatible).
-
Windows guest OS support for runner VMs (the cloud-init startup script is Linux-only).
If any of these are blocking your use case, post in the #runner-provisioner-preview Slack channel.
VM startup latency
When a new VM needs to be provisioned from scratch, expect two to five minutes before a runner is ready to claim a job. This includes scheduling the VM, booting the OS, and running the cloud-init script that downloads and installs the runner agent.
The primary mitigation is minReplicas. Pre-warmed VMs have already completed startup and can claim jobs in seconds. Startup latency only affects jobs that arrive when demand exceeds the pre-warmed pool.
Two factors can push latency toward the higher end or cause provisioning to fail silently:
-
Package downloads: The cloud-init script installs
circleci-runnerfrom packagecloud.io at boot time. Slow or unavailable package repositories will delay or prevent the runner from starting. Baking the runner binary into a custom disk image removes this dependency. -
Cold image pulls: The first time a VM is scheduled on a node, KubeVirt must pull the full container disk image. Subsequent VMs on the same node use the cached image and are significantly faster.