Finetuning Gemma 3 on private data with Unsloth and CircleCI

Fine-tuning Large Language Models (LLMs) on private, domain-specific data can unlock significant value for your specific use case. When done correctly, you can create AI apps that understand your organization’s unique context. These apps can speak your brand’s voice and deliver remarkably accurate results that general models cannot match.

However, finetuning is not always the right solution. Many teams rush into this complex technique without exploring simpler alternatives first. Before you invest time and resources into finetuning, you should consider techniques like using mega prompts or few-shot learning. These approaches often deliver excellent results with far less complexity.

There are compelling scenarios where finetuning is not just beneficial, but essential. For example, you may need your model’s output to perfectly match your organization’s unique writing style. Maybe you want to optimize costs by making smaller, more efficient models perform like their larger, more expensive counterparts. Or you may be working with highly specialized domain knowledge that does not exist in pre-trained models.

Open-source models present an exceptional opportunity for finetuning. Some open-source models, like Gemma 3, are highly capable. They come in various size to match your resources contraints and do not have the licensing restrictions that limit many commercial models.

The traditional finetuning process presents significant challenges:

Is often manual and error-prone.
Requires substantial GPU resources.
Lacks the reproducibility and version control that modern development practices demand.

The approach described in this tutorial is particularly powerful. Every time you push updates to your code or data, your pipeline will:

Fine-tune your model,
Validate the results, and
Deploy the improved version to Hugging Face Hub.

This tutorial will lead you through building an automated gemma3-4b finetuning pipeline. To do this, you will use:

Kaggle Notebooks using private Kaggle Datasets for GPU resources and secure data handling.
GitHub for robust version control.
CircleCI for orchestrating the CI/CD workflow.
Hugging Face Hub for hosting and sharing your fine-tuned model.

Prerequisites

Before building your automated Gemma 3 finetuning pipeline, you will need to set up several accounts and acquire credentials for them. The good news is that you will only use services offering generous free tiers. These are perfect for getting started without making any financial commitment.

GitHub account for version control.
Hugging Face account. with write access token for automating uploading of your fine-tuned models.
Kaggle with API access enabled and your kaggle.json credentials file downloaded. Kaggle provides up to 30 hours of free GPU time per week. This is generous enough for most finetuning experiments.
Weights & Biases account for experiment tracking and monitoring. You will need to generate an API key.
CircleCI account connected to your GitHub repository. CircleCI’s free tier provides 6,000 build minutes monthly, more than enough for orchestrating your pipeline.
You should be comfortable with Python basics, including working with libraries and file operations.

For LLM concepts, you’ll need some familiarity with transformer architectures, and tokenization. An understanding of the difference between pre-training and finetuning will be beneficial. If these concepts are new to you, the Hugging Face NLP Course provides excellent background.

You don’t need prior experience with Unsloth, LoRA, or parameter-efficient finetuning techniques; these will be covered as you build the pipeline.

Understanding Gemma 3 models

Before you start finetuning, you need to understand the Gemma 3 model variant that best suits your needs. Google’s Gemma 3 family offers different sizes optimized for various use cases, from mobile deployment to high-performance applications.

The key advantages of Gemma 3 models are their efficient architecture and strong instruction-following capabilities. They’re built on Google’s latest transformer innovations, making them particularly effective for finetuning with parameter-efficient methods like LoRA.

Model Size	Parameters	VRAM Required	Best Use Cases	Training Time (Est.)
Gemma 3-1B	1 billion	4-6 GB	Mobile apps, edge deployment	2-4 hours
Gemma 3-4B	4 billion	8-12 GB	General applications, cost-effective	4-8 hours
Gemma 3-12B	12 billion	16-24 GB	High-quality responses, complex tasks	8-16 hours
Gemma 3-27B	27 billion	32-48 GB	Production systems, maximum performance	16-32 hours

For this tutorial, you’ll use Gemma 3-4B because it strikes the perfect balance between performance and resource requirements. It fits comfortably within Kaggle’s GPU limits while delivering excellent results for most finetuning scenarios.

License and usage

Gemma 3 models are released under Google’s Gemma Terms of Use, which allows commercial use with minimal restrictions. Unlike some open-source models, you can deploy Gemma 3 in production environments without complex licensing concerns. However, you should review the terms if you’re planning large-scale commercial deployment.

The models excel at instruction following, making them ideal for chat applications, Q&A systems, and specialized domain tasks where you need consistent, reliable responses.

Finetuning approaches for Gemma 3

When finetuning Gemma 3, you have several approaches to choose from. Understanding these methods will help you make informed decisions about efficiency, cost, and performance trade-offs. This tutorial covers the following methods:

Parameter-efficient finetuning methods
LoRA and QLoRA techniques
Unsloth optimization for Gemma 3
Recommended hyperparameters and training configurations

Parameter-efficient finetuning methods

Traditional finetuning updates all model parameters, which is computationally expensive and memory-intensive. Parameter-efficient finetuning (PEFT) methods solve this by updating only a small subset of parameters while keeping the original model frozen. Traditional finetuning updates all model parameters, which is computationally expensive and memory-intensive. Parameter-efficient finetuning (PEFT) methods solve this by updating only a small subset of parameters while keeping the original model frozen.

This approach offers several advantages: dramatically reduced memory requirements, faster training times, and the ability to fine-tune large models on consumer hardware. You’ll also avoid catastrophic forgetting, where the model loses its general capabilities during finetuning.

LoRA and QLoRA Techniques

Low-Rank Adaptation (LoRA) is the most popular PEFT method for Gemma 3. Instead of updating the full weight matrices, LoRA adds small, trainable matrices that approximate the changes needed for your specific task. Low-Rank Adaptation (LoRA) is the most popular PEFT method for Gemma 3. Instead of updating the full weight matrices, LoRA adds small, trainable matrices that approximate the changes needed for your specific task.

QLoRA extends this concept by quantizing the base model to 4-bit precision while keeping the LoRA adapters in full precision. This combination allows you to fine-tune large models like Gemma 3-12B on GPUs with limited VRAM.

Here’s why this matters for your pipeline: QLoRA can reduce memory usage by up to 75% compared to full finetuning, making it perfect for Kaggle’s free GPU environment.

Unsloth Optimization for Gemma 3

Unsloth is a specialized optimization library that accelerates Gemma 3 finetuning by 2-5x while reducing memory usage. It implements custom kernels and memory-efficient attention mechanisms specifically designed for modern transformer architectures.

What makes Unsloth particularly valuable is its seamless integration with popular libraries like Transformers and TRL (Transformer Reinforcement Learning), while automatically handling the complex optimizations under the hood. You’ll get significant speed improvements without sacrificing model quality.

Recommended hyperparameters and training configurations

For Gemma 3-4B finetuning with QLoRA, these hyperparameters provide a solid starting point:

# LoRA Configuration
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.1

# Training Parameters
learning_rate: 2e-4
batch_size: 4
gradient_accumulation_steps: 4
max_steps: 500
warmup_steps: 100

# Quantization
load_in_4bit: true
bnb_4bit_compute_dtype: torch.float16

These settings balance training speed with model quality. You can increase lora_rank for more complex tasks or reduce batch_size if you encounter memory issues. The key is to start with these proven defaults and adjust based on your specific dataset and performance requirements.

Finetuning architecture

Your automated Gemma 3 finetuning pipeline connects four essential components to create a seamless workflow from code changes to model deployment. Understanding this architecture will help you troubleshoot issues and extend the system for your specific needs.

High-level pipeline overview

The pipeline follows a event-driven architecture where each component has a specific responsibility. When you push code changes to GitHub, CircleCI orchestrates the entire process: executing your finetuning notebook on Kaggle’s GPUs, monitoring progress through Weights & Biases, and automatically publishing the trained model to Hugging Face Hub.

This design separates concerns effectively: GitHub handles version control, Kaggle provides computational resources, CircleCI manages orchestration, and Hugging Face serves as your model registry. Each service excels at its specific role, creating a robust and scalable system.

Key components and responsibilities

GitHub Repository serves as your source of truth, containing your finetuning notebook, configuration files, and training data. Every change triggers the pipeline, ensuring reproducibility and version control.
Kaggle Environment executes your computationally intensive finetuning jobs using free GPU resources. The platform’s notebook environment provides the perfect isolated space for your training experiments.
CircleCI Orchestration acts as the central coordinator, managing API calls to Kaggle, monitoring job status, and handling deployment workflows. It ensures your pipeline runs reliably and handles failures gracefully.
Weights & Biases Integration provides real-time monitoring and experiment tracking, allowing you to compare training runs and optimize hyperparameters based on historical data.
Hugging Face Hub serves as your model registry, automatically hosting your fine-tuned models with proper documentation and version control.

Data flow and security considerations

Your training data flows securely through the pipeline without exposing sensitive information. The data remains within Kaggle’s environment during training, while only model weights and metadata are transferred to Hugging Face Hub.

API tokens are managed through CircleCI’s secure environment variables, ensuring credentials never appear in your code or logs. Each service communicates through authenticated APIs, maintaining security throughout the process.

Architecture diagram

Gemma-3 Fine-tuning architecture diagram

This architecture creates a fully automated, secure, and scalable finetuning pipeline. Each component communicates through well-defined APIs, ensuring your pipeline remains maintainable and extensible as your requirements evolve.

The separation of concerns means you can modify individual components without affecting the entire system. For example, you could easily switch from Kaggle to Google Colab or add additional validation steps without restructuring the core pipeline.

Developing the Gemma 3 finetuning notebook

Now that you understand the architecture and theory behind finetuning Gemma 3, it is time to do the actual implementation. In this section, you will build a complete Kaggle notebook that handles the entire finetuning process for Gemma 3-4B.

The focus on the 4B variant is becasue it strikes the perfect balance between performance and resource efficiency. Unlike the larger 27B model that requires extensive memory, the 4B model delivers strong perfromance and fits comfortably within Kaggle’s GPU limitations. On the other hand, the smaller 1B may lack sufficient capacity for complex tasks (multimodal tasks).

Create a Kaggle notebook structure for Gemma 3

You will structure your Kaggle notebook to separate concerns and makes debugging easier. The first step is to install required dependencies. Before installing dependencies, add your tokens in Kaggle:

For HF_TOKEN: Go to your Kaggle account settings → API → Secrets → Add New Secret. Name it “HF_TOKEN” and paste your Hugging Face token (get it from huggingface.co/settings/tokens).
For wandb token: Similarly, add another secret named “wb_token” and paste your Weights & Biases API key (find it at wandb.ai/authorize).

Make sure your notebook has internet enabled and these secrets will be accessible through the UserSecretsClient() as shown in your code. The tokens remain private and will not be visible in your public notebook.

You will also need to activate a GPU instance for your notebook. To activate T4 GPUs in your Kaggle notebook:

Click Settings in the right panel of your notebook.
Under Accelerator, click GPU T4 x2.

Your notebook will restart and you’ll have GPU access. You can verify it is working by running !nvidia-smi in a cell.

Cell 1: Install pre-requsite packages

Before you can start finetuning Gemma 3, you need to set up your Kaggle environment with the right dependencies. The %%capture magic command at the top suppresses the verbose installation output.

%%capture
import os

!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
!pip install transformers==4.52.4
!pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
!pip install --no-deps unsloth

Note that we pin transformers==4.52.4 to avoid compatibility issues with Unsloth. The core packages include:

Unsloth for accelerated training
peft for parameter-efficient finetuning
bitsandbytes for quantization,
trl for reinforcement learning from human feedback capabilities
hf_transfer for speeding up your model uploads to Hugging Face.

Log into Hugging Face and WandB

# --- Hugging Face Login (for pushing the model) ---

from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("HF_TOKEN")
if hf_token:
    login(token=hf_token)
    print("Logged into Hugging Face.")
else:
    print("Warning: HF_TOKEN environment variable not set. Model pushing to Hugging Face might fail.")

This code securely retrieves your Hugging Face token from Kaggle’s secrets management system. It also authenticates you with the Hub. You need this authentication to push your fine-tuned Gemma 3 model to Hugging Face later.

You will also log into WandB using:

import wandb

wb_token = user_secrets.get_secret("wb_token")

wandb.login(key=wb_token)
run = wandb.init(
    project='Fine-tuning Gemma-3-4B on Latex-OCR Dataset', 
    job_type="training", 
    anonymous="allow"
)

This authenticates with Weights & Biases using your stored token. It initializes a new experiment run to track your training metrics. The anonymous="allow" parameter ensures access to experiment logs even if there are authentication issues

Load and inspect the dataset

For this exercise, you will use a sampled dataset of images of handwritten math formulas. The aim is to finetune a gemma3-4b model to effectively convert the images into LateX(a computer-readable format) for rendering. This is important for LLM-powered math apps that allow students to photograph math problems for LLMs to solve. You can access the sampled dataset here. The full dataset is also available.

Run this code to download both train and test splits:

from datasets import load_dataset 
print("--- Loading LaTeX_OCR dataset ---")
train_dataset = load_dataset("unsloth/LaTeX_OCR", split="train")
test_dataset = load_dataset("unsloth/LaTeX_OCR", split="test")

Inspect the train dataset:

train_dataset

Inspect train dataset

There are more than 68,000 rows in the dataset, a reasonable size for finetuning.

You can also inspect the train dataset:

test_dataset

The test dataset contains more than 7,000 rows:

Inspect test dataset

You can now inspect sample images and LateX code as follows:

train_dataset[12]["image"]

Inspect train dataset sample image

Check the corresponding LateX code:

train_dataset[12]["text"]

$Inspect train dataset sample latex$

You will also use Python’s inbuilt tools to display the LateX code as math equation:

from IPython.display import display, Math, Latex

latex = train_dataset[12]["text"]
display(Math(latex))

$Inspect train dataset$

Format data for effective finetuning

You need to transform your raw dataset into the conversation format that Gemma 3 expects during finetuning. This process includes data for both instruction format and chat templates, as shown here:

# Prepare the training data in conversation format
instruction = "Write the LaTeX representation for this image."
def convert_to_conversation(sample):
    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": instruction},
                {"type": "image", "image": sample["image"]},
            ],
        },
        {"role": "assistant", "content": [{"type": "text", "text": sample["text"]}]},
    ]
    return {"messages": conversation}

converted_train_dataset = [convert_to_conversation(sample) for sample in train_dataset]

In this code, you’re creating a structured dialogue where each sample becomes a user-assistant exchange. The user provides an instruction (“Write the LaTeX representation for this image”) along with an image, and the assistant responds with the corresponding LaTeX code. This conversational structure helps Gemma 3 understand the context and expected behavior.

Inspect the converted dataset:

converted_train_dataset[50]

Here is the dataset after conversion:

{'messages': [{'role': 'user',
   'content': [{'type': 'text',
     'text': 'Write the LaTeX representation for this image.'},
    {'type': 'image',
     'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=160x40>}]},
  {'role': 'assistant',
   'content': [{'type': 'text',
     'text': '\\frac { d \\varphi _ { s p h } } { d r } ( r \\rightarrow \\infty ) \\rightarrow 0'}]}]}

Loading Gemma 3 models with appropriate quantization

This code loads the pre-trained Gemma-3-4B model using Unsloth’s optimized FastVisionModel wrapper, which is specifically designed for vision-language tasks like your LaTeX OCR project.

from IPython.display import display, Math, Latex
import torch
from unsloth import FastVisionModel, get_chat_template # Import necessary unsloth components
from datasets import load_dataset # Import load_dataset to get the dataset
from trl import SFTTrainer, SFTConfig # Import for fine-tuning
import difflib # Import difflib for sequence comparison

# This is the initial, pre-trained Gemma-3-4B model.
print("--- Loading base Gemma-3-4B model and processor ---")
model, processor = FastVisionModel.from_pretrained(
    "unsloth/gemma-3-4b-pt",
    load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)

You’re enabling 4-bit quantization (load_in_4bit = True) to significantly reduce memory usage – essential when working with large models on Kaggle’s GPU limits – and activating Unsloth’s gradient checkpointing for efficient training. The model and processor are loaded together since you’ll need both components to handle the multimodal input (images and text) during finetuning.

Here is the expected output:

Apply the chat template to the processor:

# Apply chat template to the processor for conversational formatting
processor = get_chat_template(
    processor,
    "gemma-3"
)

The chat template formats the converted dataset according to Gemma 3’s expected input structure for dialogue-based interactions.

You also need to create a helper function to generate LateX from image:

# --- Helper Function: Generate LaTeX from an image ---
def generate_latex_from_image(model_to_use, processor_to_use, image, instruction):
    """
    Generates LaTeX representation for a given image using the provided model and processor.
    """
    messages = [
        {
            "role": "user",
            "content": [{"type": "image"}, {"type": "text", "text": instruction}],
        }
    ]

    input_text = processor_to_use.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor_to_use(
        image,
        input_text,
        add_special_tokens=False,
        return_tensors="pt",
    ).to("cuda")

    output_ids = model_to_use.generate(
        **inputs,
        max_new_tokens=128,
        use_cache=True,
        temperature=1.0,
        top_p=0.95,
        top_k=64
    )

    generated_text_ids = output_ids[0, inputs["input_ids"].shape[1]:]
    generated_latex = processor_to_use.decode(generated_text_ids, skip_special_tokens=True)
    return generated_latex.replace("<end_of_turn>", "").strip()

Create a helper function to evaluate the model before and after finetuning:

def calculate_latex_similarity(latex1, latex2):
    """
    Calculates a similarity ratio between two LaTeX strings using SequenceMatcher.
    Returns a float between 0.0 and 1.0.
    """
    return difflib.SequenceMatcher(None, latex1, latex2).ratio()

Finetuning means ensuring that the model performs better. Otherwise, it is a futile effort. Run the code to evaluate the model before finetuning:

### Test the model before fine-tuning

# --- 3. Test the model BEFORE Fine-tuning ---
print("\n" + "="*50)
print("Testing Model Performance BEFORE Fine-tuning")
print("="*50)

# Set the base model to inference mode

FastVisionModel.for_inference(model)

# Choose a specific example from the test set to evaluate initial performance
test_example_index = 24 # You can change this index to test different examples
image_test_case = test_dataset[test_example_index]["image"]
original_latex_test_case = test_dataset[test_example_index]["text"]

print(f"\n--- Test Example (Index {test_example_index}) BEFORE Fine-tuning ---")

generated_latex_before_finetune = generate_latex_from_image(model, processor, image_test_case, instruction)

print("Original LaTeX:")
print(original_latex_test_case)
display(Math(original_latex_test_case))

print("Generated LaTeX (Before Fine-tuning):")
print(generated_latex_before_finetune)
display(Math(generated_latex_before_finetune))

similarity_before = calculate_latex_similarity(original_latex_test_case, generated_latex_before_finetune)
print(f"Similarity Score (Before Fine-tuning): {similarity_before:.4f}")

if generated_latex_before_finetune == original_latex_test_case:
    print("Exact Match: YES (Unlikely before fine-tuning)")
else:
    print("Exact Match: NO (Expected before fine-tuning)")

print("\nOriginal Image:")
display(image_test_case)

The output shows that the model is hallucinating. The base model does not understand the task at hand.

Testing the model before finetuning

Finetuning with Unsloth and LoRA

The code sets up and executes the core finetuning process by first adding LoRA (Low-Rank Adaptation) adapters to your Gemma 3 model. This allows efficient parameter updates without modifying the entire model:

# --- 4. Fine-tuning with Unsloth and LoRA ---
print("\n\n" + "="*50)
print("Starting Fine-tuning Process with Unsloth and LoRA")
print("="*50)

# Add LoRA adapters to the model
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True,
    finetune_language_layers   = True,
    finetune_attention_modules = True,
    finetune_mlp_modules       = True,
    r = 16,
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
    target_modules = "all-linear",
    modules_to_save=[
        "lm_head",
        "embed_tokens",
    ],
)

# Enable model for training
FastVisionModel.for_training(model)

The SFTTrainer then handles the actual training with carefully configured hyperparameters like a low learning rate (2e-4), small batch size with gradient accumulation, and cosine learning rate scheduling – all optimized for stable finetuning on vision-language tasks while working within Kaggle’s memory constraints.


# Configure and setup the SFTTrainer
from unsloth.trainer import UnslothVisionDataCollator # Import data collator

trainer = SFTTrainer(
    model=model,
    train_dataset=converted_train_dataset,
    processing_class=processor.tokenizer,
    data_collator=UnslothVisionDataCollator(model, processor),
    args = SFTConfig(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        gradient_checkpointing = True,
        gradient_checkpointing_kwargs = {"use_reentrant": False},
        max_grad_norm = 0.3,
        warmup_ratio = 0.03,
        max_steps = 30, # Keep steps low for demonstration, increase for full training
        learning_rate = 2e-4,
        logging_steps = 1,
        save_strategy="steps",
        optim = "adamw_torch_fused",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",
        seed = 3407,
        output_dir = "outputs",
        report_to = "wandb",
        remove_unused_columns = False,
        dataset_text_field = "",
        dataset_kwargs = {"skip_prepare_dataset": True},
        dataset_num_proc = 2,
        max_seq_length = 2048,
    )
)

# Start training
trainer_stats = trainer.train()
print("\nFine-tuning complete!")

The process takes about 10 minutes.

Finetuning results

Monitoring training progress and results with wandb

You can monitor your training experiment with wandb using:

# ---. Finish WandB run ---
if 'run' in locals() and run is not None:
    wandb.finish()

This code safely terminates your Weights & Biases experiment run, ensuring all logged metrics and training data are properly saved and uploaded to your WandB dashboard. The conditional check prevents errors if the WandB run wasn’t initialized or already finished, making your notebook more robust when run multiple times.

WandB metrics

You can also review the run summary.

WandB run summary

Evaluating the finetuned model

You can now evaluate the finetuned model.

Testing model after finetuning

The generated LateX code is 95% similar to the test code, which is a satisfactory result.

You can also test the model on 20 cases in the test dataset.

Testing the model on 20 cases

Manually pushing the model to Hugging Face Hub

Now that the model is ready, push it to Hugging Face.

First, create a repository:

from huggingface_hub import HfApi
from huggingface_hub.utils import RepositoryNotFoundError

# Create API connection
api = HfApi()

# Repository details
## Please set your hugging face username and repo name
hf_username = "zattuAI"
repo_name = "gemma3-4b-latex-ocr-finetuned-merged"
HF_REPO_ID = f"{hf_username}/{repo_name}"

# Check if repo exists, if not create it
try:
    # Try to access the repo with full username/repo format
    api.repo_info(repo_id=HF_REPO_ID, repo_type="model")
    print("Repository already exists!")

except RepositoryNotFoundError:
    # Repo doesn't exist, so create it
    print("Creating new repository...")
    api.create_repo(repo_id=repo_name, repo_type="model")
    print("Repository created successfully!")

print(f"Using repository: {HF_REPO_ID}")

Now push the model to Hugging Face:

# --- Push the fine-tuned model to Hugging Face Hub (Merged Float16) ---
print("\n\n" + "="*50)
print(f"Pushing fine-tuned MERGED model to Hugging Face Hub: {HF_REPO_ID}")
print("="*50)

try:
    # Save the merged model (base model + LoRA weights) in float16 precision
    # This is ideal for deployment and general use by others.
    model.push_to_hub_merged(HF_REPO_ID, processor, token=hf_token) # Pass the token explicitly
    print("Merged model and processor successfully pushed to Hugging Face Hub!")
except Exception as e:
    print(f"Error pushing merged model to Hugging Face Hub: {e}")
    print("Please ensure your HF_TOKEN environment variable is set correctly and has write access to the repository.")
    print(f"Also, verify that the repository '{HF_REPO_ID}' exists or that you have permission to create it.")

Deployed model on Hugging Face

Automating Gemma 3 finetuning workflows with CircleCI

In this section of the tutorial, you’ll learn how to automate the entire process. You will use CircleCI to enable continuous model improvement as your dataset grows and evolves.

Manually finetuning and deploying LLMs, especially with private data, can be error-prone, time-consuming, and risky. A CI/CD pipeline offers significant advantages such as automation, reproducibility, and collaboration.

Setting up the secure CI/CD pipeline: Step-by-step

If you want to precisely control how kaggle kernels push behaves, create and commit a kernel-metadata.json file.

Configure CircleCI:

Link GitHub to CircleCI: Add your GitHub repository as a new project in CircleCI.

Create config.yml: In your repository, create .circleci/config.yml.

yaml
# CircleCI 2.1 configuration file
version: 2.1

# Define a Docker executor with a basic Python image
executors:
python_executor:
    docker:
    - image: cimg/python:3.9.16 # A standard Python image is sufficient

jobs:
trigger_kaggle_finetune:
    executor: python_executor
    steps:
    - checkout # Checkout the code from your repository

    - run:
        name: Install Kaggle CLI
        command: |
            pip install kaggle

    - run:
        name: Configure Kaggle API Credentials
        # These commands create the ~/.kaggle directory and kaggle.json file
        # using environment variables set in CircleCI.
        # KAGGLE_USERNAME and KAGGLE_KEY must be set in your CircleCI project settings.
        command: |
            mkdir -p ~/.kaggle
            echo '{"username":"'${KAGGLE_USERNAME}'","key":"'${KAGGLE_KEY}'"}' > ~/.kaggle/kaggle.json
            chmod 600 ~/.kaggle/kaggle.json # Set appropriate permissions

    - run:
        name: Push Notebook to Kaggle and Trigger Run
        # This command pushes your notebook to Kaggle.
        # Kaggle will then automatically run the kernel based on its metadata.
        # The -p . means push the kernel in the current directory.
        # Ensure your kernel-metadata.json is correctly configured.
        command: |
            kaggle kernels push -p .

workflows:
main_branch_trigger:
    jobs:
    - trigger_kaggle_finetune:
        filters:
            branches:
            only: main # This job will only run when changes are pushed to the 'main' branch

Set up secrets in CircleCI

Go to Project Settings and click Environment Variables. Add:

KAGGLE_USERNAME
KAGGLE_KEY

Handle private data securely

This process is significantly streamlined when using private Kaggle datasets.

Data storage:

Your private training data is securely stored as a private Kaggle Dataset.
Only Kaggle users you grant permission to (or your own account) can access it.

Data access in CI/CD and Kaggle Notebook:

The Kaggle finetuning notebook is linked to this private dataset through Kaggle’s interface (or using kernel-metadata.json).
The Kaggle account associated with the KAGGLE_USERNAME and KAGGLE_KEY (used by CircleCI) must have owner/collaborator access to the private dataset and the kernel.
When CircleCI triggers the Kaggle kernel execution via the API, Kaggle’s internal systems manage access control. The notebook script reads data from the standard input path (e.g., ../input/your-private-dataset-slug/) as if it were any other linked dataset.

Note: Your raw private training data does not flow through CircleCI’s environment.** CircleCI only orchestrates and handles the resulting model artifacts.

kernel-metadata.json is recommended for kaggle kernels push). Because you will use kaggle kernels push -p, create a kernel-metadata.json file in the same directory where you will store your notebook. This file explicitly tells Kaggle how to set up the kernel:

    // kaggle_scripts_dir/kernel-metadata.json
    {
    "id": "your-kaggle-username/finetuning-gemma3-on-a-latex-ocr-dataset",
    "title": "finetuning-gemma3-on-a-latex-ocr-dataset",
    "code_file": "finetuning-gemma3-on-a-latex-ocr-dataset.ipynb",
    "language": "python",
    "kernel_type": "notebook",
    "is_private": true,
    "enable_gpu": true,
    "enable_internet": true,
    "dataset_sources": [
        "unsloth/LaTeX_OCR"
    ],
    "competition_sources": [],
    "model_sources": [],
    "kernel_sources": []
    }

Make sure the id (kernel slug) is correctly formatted (lowercase, hyphens for spaces). The KAGGLE_USERNAME in CircleCI must have rights to both this kernel (if it exists) and to the specified dataset_sources.

Automating Kaggle Notebook execution using the CircleCI API

The config.yml shows the kaggle kernels push -p . command within the kaggle_scripts_dir/. This directory should contain your notebook (.ipynb or .py) and the kernel-metadata.json file. The process is triggered when changes are committed.

CircleCI green build

You can go back to Kaggle to confirm that your notebook running. You can even review logs as the process runs.

Kaggle notebook running

Conclusion

Building a secure CI/CD pipeline for finetuning LLMs on private Kaggle Datasets, orchestrated by CircleCI and deploying to Hugging Face Hub, establishes a robust, automated, and reproducible MLOps workflow. This approach simplifies secure data handling for training by keeping sensitive data within Kaggle’s ecosystem, while leveraging CircleCI for secure automation of the overall process. You can find the complete source code used in this tutorial on GitHub.

By automating these steps, your team can enhance productivity, maintain high security standards, and ensure consistency in their LLM development lifecycle. We encourage you to adapt this framework, explore advanced CircleCI features, and integrate further MLOps best practices to build powerful and secure custom AI solutions.

Site

Blog