Introduction to Nomad Cluster Operation
CircleCI uses Nomad as the primary job scheduler in CircleCI 2.0. This document provides a basic introduction to Nomad for understanding how to operate the Nomad Cluster in your CircleCI 2.0 installation in the following sections:
- Basic Terminology and Architecture
- Basic Operations
- Scaling the Nomad Cluster
Basic Terminology and Architecture
Nomad Server: Nomad Servers are the brains of the cluster. It receives and allocates jobs to Nomad clients. In CircleCI, a Nomad server is running in your service box as a Docker Container.
Nomad Client: Nomad Clients execute jobs allocated by Nomad servers. Usually a Nomad client runs on a dedicated machine (often a VM) in order to fully take the advantage of its machine power. You can have multiple Nomad clients to form a cluster and the Nomad server allocates jobs to the cluster with its scheduling algorithm.
Nomad Jobs: Nomad Job is a specification provided by users that declares a workload for Nomad. In CircleCI 2.0, a Nomad job corresponds to an execution of CircleCI job/build. If the job/build uses parallelism, say 10 parallelism, then Nomad will run 10 jobs.
Build Agent: Build Agent is a Go program written by CircleCI that executes steps in a job and reports the results. Build Agent is executed as the main process inside a Nomad Job.
This section will give you the basic guide of operating a Nomad cluster in your installation.
nomad CLI is installed in the Service instance. It is pre-configured to talk to the Nomad cluster, so it is possible to use the
nomad command to run the following commands in this section.
Checking the Jobs Status
nomad status command will give you the list of jobs status in your cluster. The
Status is the most important field in the output with the following status type definitions:
running: The status becomes
runningwhen Nomad has started executing the job. This typically means your job in CircleCI is started.
pending: The status becomes
pendingwhen there are not enough resources available to execute the job inside the cluster.
dead: The status becomes
deadwhen Nomad finishes executing the job. The status becomes
deadregardless of whether the corresponding CircleCI job/build succeeds or fails.
Checking the Cluster Status
nomad node-status command will give you the list of Nomad clients. Note that
nomad node-status command also reports both Nomad clients that are currently serving (status
active) and Nomad clients
that were taken out of the cluster (status
down). Therefore, you need to count the number of
active Nomad clients to know the current capacity of your cluster.
nomad node-status -self command will give you more information about the client where you execute the command. Such information includes how many jobs are running on the client and the resource utilization of the client.
As noted in the Nomad Jobs section above, a Nomad Job corresponds to an execution of CircleCI job/build. Therefore, checking logs of Nomad Jobs sometimes helps you to understand the status of CircleCI job/build if there is a problem.
nomad logs -job -stderr <nomad-job-id> command will give you the logs of the job.
Note: Be sure to specify
-stderr flag as most of logs from Build Agent appears in the
nomad logs -job command is useful, the command is not always accurate because the
-job flag uses a random allocation of the specified job. The term
allocation is a smaller unit in Nomad Job which is out of scope of this document. To learn more, please see the official document.
Complete the following steps to get logs from the allocation of the specified job:
Get the job ID with
Get the allocation ID of the job with
nomad status <job-id>command.
Get the logs from the allocation with
nomad logs -stderr <allocation-id>
Scaling the Nomad Cluster
Nomad itself does not provide a scaling method for cluster, so you must implement one. This section provides basic operations regarding scaling a cluster.
Scaling Up the Client Cluster
Scaling up Nomad cluster is very straightforward. To scale up, you need to register new Nomad clients into the cluster. If a Nomad client knows the IP addresses of Nomad servers, then the client can register to the cluster automatically.
HashiCorp recommends using Consul or other service discovery mechanisms to make this more robust in production. For more information, see the following pages in the official documentation for Clustering, Service Discovery, and Consul Integration.
Shutting Down a Nomad Client
When you want to shutdown a Nomad client, you must first set the client to
drain mode. In the
drain mode, the client will finish already allocated jobs but will not get allocated new jobs.
- To drain a client, log in to the client and set the client to drain mode with
node-draincommand as follows:
nomad node-drain -self -enable
- Then, make sure the client is in drain mode with
nomad node-status -self
Alternatively, you can drain a remote node with
nomad node-drain -enable -yes <node-id>
Scaling Down the Client Cluster
To scale your Nomad cluster properly, you need a mechanism for clients to shutdown in
drain mode first. Then, wait for all jobs to be finished before terminating the client.
While there are many ways to achieve this, here is one example of implementing such mechanism by using AWS and ASG.
- Configure ASG Lifecycle Hook that triggers a script when scaling down instances.
- The script makes the instance in drain mode.
- The script monitors running jobs on the instance and waits for them to finish.
- Terminate the instance.