Hyperparameter tuning for LLMs using CircleCI matrix workflows

Hyperparameter tuning is a critical step in optimizing large language models (LLMs). Parameters such as learning rate, batch size, weight decay, and number of training epochs can significantly affect convergence behavior and final model performance. While several approaches like grid search or random search are widely used, executing them manually is inefficient; especially when each training run is compute-intensive. A typical tuning process involves running the same training script multiple times with varying configurations, then comparing results. Doing this manually not only introduces friction but also makes experiment tracking and reproducibility harder to manage at scale.

With CircleCI matrix jobs, you can automate this process. Instead of writing separate scripts or rerunning commands manually, you define a single configuration that spawns parallel jobs; each running your training code with different hyperparameter combinations. This turns hyperparameter tuning into a reproducible, hands-off workflow triggered by a single commit or scheduled job.

To track results across runs, you’ll integrate Weights & Biases (wandb). It captures metrics like loss, learning curves, and evaluation scores, giving you a central dashboard to compare all your experiments in real-time. By combining CircleCI matrix workflows with wandb and Hugging Face’s Trainer, you’ll have a scalable, automated system for tuning LLMs using automated CI/CD pipelines.

Prerequisites

For this tutorial, you need a CircleCI account and wandb account to automate the evaluation and logging process. Refer to this list to set up everything required for this tutorial.

A GitHub account
Knowledge of Python
Download and install Python (if you want to test the script locally)
Create a CircleCI account
A HuggingFace account
A wandb account and an API key for access authentication

Training a sentiment analysis model Using Hugging Face and wandb

Before you build the full hyperparameter tuning pipeline, you need two things in place:

A reproducible training script that accepts external hyperparameters.
A CI workflow that can automatically invoke this script across multiple configurations.

The first step is to create a small but effective sentiment classification model using Hugging Face Transformers. You’ll define a flexible train.py script that accepts training arguments via the command line, logs metrics to Weights & Biases (wandb), and can be called repeatedly with different values. You’ll also set up a virtual environment and dependency files to ensure the project is portable and version-controlled.

To begin, create a new Python project and set up a requirements.txt file. and add the contents below that will include essential packages like:

transformers for model and training APIs
datasets for loading standard benchmarks
wandb for experiment tracking

# File Name: requirements.txt

wandb
transformers
datasets
torch
accelerate>=0.26.0

The train.py script will load a small dataset such as the emotion dataset, tokenize the data, and fine-tune a model like distilgpt2. For demo purposes, you will be using a small LLM and only a portion of the dataset to reduce training time and computational requirements.

# File Name: train.py

import os
import sys
import argparse
import wandb
import time
import uuid
from transformers import (
    GPT2Tokenizer,
    GPT2ForSequenceClassification,
    Trainer,
    TrainingArguments
)
from datasets import load_dataset

def parse_arguments():
    """
    Parses command-line arguments for model training hyperparameters.
    Returns:
        argparse.Namespace: Parsed arguments.
    """

    parser = argparse.ArgumentParser()
    parser.add_argument("--learning_rate", type=float, default=5e-5)
    parser.add_argument("--batch_size", type=int, default=4)
    parser.add_argument("--epochs", type=int, default=5)
    parser.add_argument("--weight_decay", type=float, default=0.0)
    parser.add_argument("--lr_scheduler_type", type=str, default="linear", choices=["linear", "cosine", "constant"])
    parser.add_argument("--adam_beta1", type=float, default=0.9)
    parser.add_argument("--adam_beta2", type=float, default=0.99)
    parser.add_argument("--max_length", type=int, default=64)
    parser.add_argument("--gradient_accumulation_steps", type=int, default=1)
    return parser.parse_args()

def main(args):
    """
    Main training pipeline: handles WandB integration, dataset loading, tokenization,
    model setup, training, evaluation, and logging.
    """

    # Login to Weights & Biases using API key from environment variable
    wandb_api_key = os.getenv("WANDB_API_KEY")
    assert wandb_api_key and wandb_api_key != "", "WANDB API key is required"
    try:
        wandb.login(key=wandb_api_key)
    except Exception as e:
        print("Error logging into wandb. Wrong API key.")
        print(e)
        sys.exit(1)

    # Generate unique run name and group by hourly timestamp
    timestamp = time.strftime("%Y%m%d-%H")
    wandb.init(
        project="llm-hyperparam-tuning",
        group=timestamp,
        name=str(uuid.uuid4()) # Ensure uniqueness across matrix jobs
    )
    wandb.log({"hyperparameters": vars(args)})

    # Load a small public classification dataset and create train/test splits
    dataset = load_dataset("emotion")
    dataset = dataset["train"].train_test_split(test_size=0.3, seed=42)
    # Only using a small subset of the data to reduce training time and computation
    train_data = dataset["train"].select(range(140))
    test_data = dataset["test"].select(range(60))

    # Load tokenizer and set padding token
    tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
    tokenizer.pad_token = tokenizer.eos_token

    # Tokenization function for dataset
    def tokenize(example):
        return tokenizer(example["text"], truncation=True, padding="max_length", max_length=args.max_length)

    # Preprocess and format datasets for PyTorch
    train_tokenized = train_data.map(tokenize).rename_column("label", "labels")
    test_tokenized = test_data.map(tokenize).rename_column("label", "labels")
    train_tokenized.set_format("torch")
    test_tokenized.set_format("torch")

    model = GPT2ForSequenceClassification.from_pretrained("distilgpt2", num_labels=6)
    model.config.pad_token_id = model.config.eos_token_id

    # Define training configuration - We set hyperparameters here from cli arguments
    training_args = TrainingArguments(
        output_dir="./results",
        per_device_train_batch_size=args.batch_size,
        per_device_eval_batch_size=args.batch_size,
        num_train_epochs=args.epochs,
        learning_rate=args.learning_rate,
        weight_decay=args.weight_decay,
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        adam_beta1=args.adam_beta1,
        adam_beta2=args.adam_beta2,
        lr_scheduler_type=args.lr_scheduler_type,
        eval_strategy="epoch",
        save_strategy="no",
        logging_strategy="epoch",
        disable_tqdm=True
    )
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_tokenized,
        eval_dataset=test_tokenized,
        tokenizer=tokenizer
    )

    trainer.train()
    metrics = trainer.evaluate()
    wandb.log(metrics)
    print(f"Training Arguments: {args}")
    print(f"Training complete! Eval loss: {metrics.get('eval_loss')}")

    wandb.finish()

if __name__ == "__main__":
    args = parse_arguments()
    main(args)

To enable flexible experimentation, your training script exposes a series of hyperparameters through command-line arguments using argparse module. This design choice makes the script easily reusable in CI/CD pipelines, allowing you to sweep across a range of values without editing the code. This CLI-first architecture is what makes the script CI-ready. You can now use an external tool like CircleCI to control the experiment lifecycle just by adjusting argument values in matrix jobs. There’s no need to touch the script logic at all.

Here’s what you’re able to configure when running train.py:

learning_rate: Controls how quickly the model adapts to the loss gradient. Common tuning values range from 5e-5 to 2e-4. Lower values converge slowly but stably; higher values train faster but risk overshooting.
batch_size: Defines how many samples are processed per training step. Affects memory usage and convergence stability. Set per device — useful for scaling on GPU instances.
epochs: Sets how many full passes are made through the dataset. Higher values give the model more chances to learn, but also risk overfitting if the dataset is small.
weight_decay: Regularization term that penalizes large weights to help generalization. Values between 0.0 and 0.1 are commonly used.
lr_scheduler_type: Specifies the learning rate schedule. Supported values include linear, cosine and constant.
adam_beta1 and adam_beta2: Beta values used by the Adam optimizer to control the moving averages of gradients and squared gradients. These parameters can subtly influence how responsive the optimizer is to noisy updates.
max_length: Token length cutoff for inputs. Inputs longer than this will be truncated, shorter ones padded. Useful for controlling memory usage and consistency across examples.
gradient_accumulation_steps: Enables gradient accumulation to simulate larger batch sizes than your memory allows. For example, if your GPU can’t fit a batch size of 16, you can use a batch size of 4 and accumulate gradients over 4 steps.

All arguments are parsed via argparse, passed directly to the HuggingFace TrainingArguments, and logged to wandb via wandb.log().

Additionally, the training script initializes a project on Weights & Biases (wandb) and logs both the hyperparameters and training metrics in real time. You also use uuid.uuid4() to generate a unique name for each run and group all jobs by timestamp (up-to an hourly precision) for easy comparison in the wandb project. This allows you to easily track and compare key metrics like training and evaluation loss across different runs. By examining these logged values, you can identify which hyperparameters contribute most effectively to model performance, enabling more informed decisions during hyperparameter tuning.

Note: The training script will read the wandb access token from the environment for account authentication. Refer to this wandb documentation page about how to get your access token. Create a new file called .env and save the access token with the name WANDB_API_KEY as below:

WANDB_API_KEY=<replace-your-key-here>

To execute the training script locally, you can execute the below command:

set -a
source .env
python train.py

Creating the CircleCI config: Matrix jobs and CI pipeline

Now that you have a functional training script, the next step is to define a CircleCI pipeline that runs it across different hyperparameter combinations.

Start by creating a .circleci/config.yml file at the root of your project which defines a job that:

Uses a Python Docker image (e.g. cimg/python:3.10)
Installs project dependencies
Runs the train.py script with CLI parameters passed in via CircleCI job parameters

# File Name: .circleci/config.yml

version: 2.1

executors:
  python-executor:
    docker:
      - image: cimg/python:3.10 # Official CircleCI Python 3.10 image
    working_directory: ~/project

jobs:
  train:
    executor: python-executor

    # Hyperparameters to sweep via matrix
    parameters:
      learning_rate:
        type: string
      batch_size:
        type: string
      epochs:
        type: string
      lr_scheduler_type:
        type: string
    steps:
      - checkout

      - run:
          name: Set up Python
          command: |
            python -m venv venv
            . venv/bin/activate
            pip install --upgrade pip
            pip install -r requirements.txt

      - run:
          name: Train with current hyperparameters
          command: |
            . venv/bin/activate
            python train.py \
              --learning_rate << parameters.learning_rate >> \
              --batch_size << parameters.batch_size >> \
              --epochs << parameters.epochs >> \
              --lr_scheduler_type << parameters.lr_scheduler_type >>

# Define workflow to run all matrix combinations
workflows:
  train-matrix:
    jobs:
      - train:
          # Set possible values of hyperparameters. Spawns differents jobs based on all combinations
          matrix: 
            parameters:
              learning_rate: ["1e-4", "1e-5"]
              batch_size: ["4", "8"]
              epochs: ["2", "4"]
              lr_scheduler_type: ["linear", "cosine"]

CircleCI’s matrix jobs allow you to define a parameterized job and specify a matrix of values for key variables like learning rate, batch size, weight decay, etc. CircleCI will then automatically generate one job for each combination and run them in parallel, assuming enough concurrency is available. For the purposes of the tutorial, you are using only four different hyperparameters each with two sample values. This limits the total number of experiments to 16.

Setting up project on CircleCI

The complete code for this project is available on GitHub. To execute the defined workflow, you can connect the code repository to your CircleCI account. Start by heading to the Projects tab on your CircleCI dashboard and creating a new project. This will redirect you to a page where you can set up your workflow.

If you haven’t already connected your GitHub account to CircleCI, you’ll need to do that first. Once connected, select the relevant repository for this project.

CircleCI will detect the config.yml file that you defined earlier. You can continue with the configuration and set up triggers to control when your pipeline will execute. For this example, I will configure the pipeline to run on pull requests with the “run ci” label, which provides you the flexibility to run hyperparameter search whenever required.

Once the project is created, you can access it to review the workflow details. Before triggering the pipeline, you need to set the required environment variables for execution. To do this, go to Project Settings > Environment Variables, and add the WANDB_API_KEY. The API key is used in the train.py file to authorize your wandb access.

Use the correct key names; they are hard-coded in the Python file. After adding the environment variable, review your settings page.

The pipeline will execute based on the triggers you set, or you can manually trigger the pipeline with modified parameters. Once it has been triggered, you can check the pipeline’s progress and confirm its successful execution. If everything is set up correctly, the process should complete as expected, and the result will be available in your CircleCI dashboard.

In this setup, you launch 16 parallel jobs, each corresponding to a unique combination of hyperparameters defined in your CircleCI matrix. Each job runs independently, allowing you to monitor progress in isolation.

All runs are automatically synced to your wandb project, where you can later review detailed logs, training curves, and evaluation metrics. This makes it easy to compare the impact of each hyperparameter set and identify the optimal configuration for your model. Project graphs and logs will be available.

![Wandb project logs]

Conclusion

By combining CircleCI matrix workflows with Hugging Face and wandb logging, you can fully automate hyperparameter tuning for your LLM training pipelines. Instead of manually rerunning scripts for each configuration, you define a matrix once and let CircleCI spawn independent jobs for every parameter combination on demand. You can even set triggers such as new commits, branch updates, or scheduled intervals to launch automated grid searches after major changes in your training code.

This approach not only removes repetitive manual effort but also makes it easy to scale hyperparameter sweeps across multiple environments. CI/CD pipelines manage the orchestration, execution, and tracking seamlessly, while wandb gives you a unified dashboard to compare and analyze results.

Going forward, you can extend this setup by introducing multi-node or multi-GPU training for larger models, exposing additional hyperparameters like dropout rates or hidden layer sizes, or automatically saving the best-performing model checkpoints during training. With minor adjustments, this matrix-based workflow can evolve into a fully-fledged, production-ready hyperparameter optimization system integrated directly into your machine learning development lifecycle.