TutorialsSep 29, 202518 min read

Deploying a multimodal RAG application with Gemma 3 and CircleCI on GKE

Armstrong Asenavi

Full Stack Engineer

Retrieval-Augmented Generation (RAG) has transformed how applications interact with Large Language Models (LLMs). RAGs ground LLM responses in external knowledge, improves accuracy, and reduces hallucinations. But traditional RAG systems have a significant limitation: they only process text.

Multimodal RAG addresses this limitation by processing and understanding multiple data types (text, images, and potentially audio). Instead of relying solely on text, it integrates information from various formats—much like humans combine inputs from different senses to form a comprehensive understanding.

In this tutorial, you’ll build a multimodal RAG application powered by Google’s Gemma 3 model served via Ollama. Your application will:

  1. Process PDF documents containing both text and images
  2. Use Qdrant as a vector store
  3. Create an interactive UI with Streamlit
  4. Deploy to Google Kubernetes Engine (GKE) using CircleCI

By the end, you’ll have a scalable application that understands both visual and textual content within PDFs. This allows users to ask questions about document content regardless of format.

Multimodal RAG Streamlit app on GKE

Prerequisites

Before you begin, ensure you have the necessary tools, accounts, and foundational knowledge.

  • Install Python 3.9+ from the official Python website.
  • Install Docker Desktop for your operating system.
  • Set up Ollama by following the instructions on the Ollama website.
  • Sign up for a Google Cloud Account at Google Cloud. Be sure to enable billing.
  • Create a free account at CircleCI signup and connect it to your version control system.
  • Basic understanding of RAG systems and containerization concepts.

Understanding the architecture

Here’s a diagram showing how the different components of your multimodal RAG system work together.

Multimodal RAG architecture diagram

The system’s core components include:

PDF processing pipeline

  • Extraction: PyMuPDF(fitz) extracts both text and images from PDFs
  • Chunking: Breaks text into meaningful segments that fit within context limits
  • Embedding generation: You will use openai/clip-vit-base-patch32 – a vision embeddings model.

Vector database Qdrant efficiently stores and indexes multimodal embeddings. It supports multiple named vectors (text_vector, image_vector) in single data points. Qdrant enables filtering based on metadata like page numbers.

LLM integration Ollama serves Gemma 3 locally via API. This approach decouples inference from resource-intensive model inference. The approach also allows model updates wothout rebuilding the application.

Application flow

  1. User uploads a PDF and asks a question via the Streamlit interface
  2. PyMuPDF extracts text and images from the document
  3. Embedding models convert text chunks and images into vector embeddings
  4. Qdrant stores these embeddings with metadata
  5. When a user submits a query, it’s converted to an embedding
  6. The system searches Qdrant for relevant text and images
  7. Gemma 3 generates a response using the query and retrieved context
  8. Streamlit displays the answer with relevant image snippets

Setting up your development environment

Create a new folder for your project (for example multimodal-rag-gemma). Clone the repository:

git clone https://github.com/CIRCLECI-GWP/multimodal-rag-gemma3.git
cd multimodal-rag-gemma3

Now create a Python virtual environment:

python -m venv multimodal_rag
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Create the requirements.txt:

streamlit
qdrant-client # Ensure a recent version for features like named vectors
PyMuPDF # For PDF text/image extraction
pillow # For image manipulation (needed by PyMuPDF, Streamlit, embedding models)
sentence-transformers # For text and image embeddings
transformers # For embedding models
accelerate # Often needed by transformers for efficient model loading
ollama # Client library for Ollama API
python-dotenv # For managing environment variables

Install dependencies:

pip install -r requirements.txt

Installing and running Ollama

To run Gemma 3 with Ollama, first install Ollama from ollama.com.

After installation run ollama pull gemma3 to download a model. This command downloads gemma3:4b model by default. Gemma3 family of models includes 4b, 12b, and 27b multimodal versions, while the smallest 1b model is text only.

You can test the model by running:

ollama run gemma3:latest “List three benefits of Kubernetes”

Ollama running gemma3:4b example

You can also try a multimodal query where you ask gemma3:4b to describe an image.

Example multimodal query

US stock market chart

The model picks up the trend correctly, describing how the stock market behaves.

Gemma3 also does well for OCR tasks.

Gemma3 multimodal query example

Now that you have Ollama up and running, you will create a RAG engine for your app.

Creating the RAG engine

The core of your application is the rag_engine.py that processes PDFs, generates embeddings, and interfaces with Ollama. You can see code snippets in the following subsections.

Configuring the QueryEngine

Start by setting up the QueryEngine class that handles the entire RAG pipeline:

def __init__(self, uploaded_file, session_id, progress_callback=None, poppler_path=None):
    self.uploaded_file = uploaded_file
    self.processed_data = []
    self.embedded_data_clip = []
    self.clip_model = None
    self.clip_processor = None
    self.qdrant_client = None
    self.collection_name = f"clip_multimodal_pdf_rag_{session_id}"
    self.embedding_dimension_clip = None
    self.ollama_model_name = 'gemma3:latest'
    self.ollama_api_base = "http://localhost:11434"
    self.DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
    self.progress_callback = progress_callback or (lambda x: None)
    self.POPPLER_PATH = poppler_path

    # Cache directory for HuggingFace models
    self.CLIP_MODEL_NAME = "openai/clip-vit-base-patch32"
    self.CACHE_DIR = "./hf_cache"

    # Initialize the RAG pipeline
    self._process_pdf()
    self._load_embedding_model()
    self._generate_embeddings()
    self._setup_qdrant()
    self._ingest_data()

This set up creates pipeline states and initializes key components like model selection (for embedding and generation), device usage and database configuration (Qdrant).

Processing PDF documents

This part will process pdfs and extract both text and images:

def _process_pdf(self):
    self.progress_callback("Processing PDF...")
    try:
        with tempfile.TemporaryDirectory() as temp_dir:
            file_path = os.path.join(temp_dir, self.uploaded_file.name)
            with open(file_path, "wb") as f:
                f.write(self.uploaded_file.getvalue())

            # Convert PDF pages to images
            pil_images = convert_from_path(file_path, poppler_path=self.POPPLER_PATH)
            # Open PDF to extract text
            doc = fitz.open(file_path)

            for i, page_image in enumerate(tqdm(pil_images, desc="Extracting pages")):
                # Extract text with PyMuPDF
                page_text = doc[i].get_text("text") if i < len(doc) else ""
                page_text = ' '.join(page_text.split())

                # Save image to buffer for later use
                buffered = io.BytesIO()
                if page_image.mode == 'RGBA':
                    page_image = page_image.convert('RGB')
                page_image.save(buffered, format="JPEG")
                img_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')

                # Store both text and image data
                self.processed_data.append({
                    "id": str(uuid.uuid4()),
                    "page_num": i + 1,
                    "text": page_text,
                    "image_pil": page_image,
                    "image_b64": img_base64
                })

The function creates a temporary file from the uploaded PDF document. It uses pdf2image to convert PDF pages to images and fitz(PyMuPDF) to extract text from each page. It then stores both text and images for each page with unique IDs.

Loading the CLIP embedding model

Next, load CLIP model to generate embeddings:

def _load_embedding_model(self):
    self.progress_callback(f"Loading CLIP model: {self.CLIP_MODEL_NAME}")
    try:
        if not os.path.exists(self.CACHE_DIR):
            os.makedirs(self.CACHE_DIR, exist_ok=True)

        # Load CLIP model
        self.clip_model = CLIPModel.from_pretrained(
            self.CLIP_MODEL_NAME,
            cache_dir=self.CACHE_DIR
        ).to(self.DEVICE).eval()

        # Load CLIP processor
        self.clip_processor = AutoProcessor.from_pretrained(
            self.CLIP_MODEL_NAME,
            cache_dir=self.CACHE_DIR
        )

CLIP is a powerful multimodal embedding model. It generates embeddings for both images and text in the same vector space. This simplifies finding semantic similarity between text queries and document images.

Generating CLIP embeddings for images

Now you can generate embeddings for all page images:

def _generate_embeddings(self):
    self.progress_callback(f"Generating CLIP IMAGE embeddings for {len(self.processed_data)} items...")
    for chunk in tqdm(self.processed_data, desc="Generating Image Embeddings"):
        try:
            image_pil = chunk['image_pil']
            if image_pil.mode != 'RGB':
                image_pil = image_pil.convert('RGB')

            # Process the image with CLIP
            inputs = self.clip_processor(images=image_pil, return_tensors="pt", padding=True).to(self.DEVICE)
            with torch.no_grad():
                image_features = self.clip_model.get_image_features(**inputs)
            # Convert to numpy and store
            image_embedding_vector = image_features[0].cpu().float().numpy().tolist()

            if image_embedding_vector:
                chunk['embedding'] = image_embedding_vector
                self.embedded_data_clip.append(chunk)

This code processes each image with CLIP processor and passes it to CLIP model to generate emebddings. The model converts the tensor to a list format for storage.

Setting up Qdrant vector database

Set up Qdrant to store your embeddings:

def _setup_qdrant(self):
    try:
        try:
            self.qdrant_client = QdrantClient(host="qdrant", port=6333, timeout=5)
            self.qdrant_client.get_collections()
        except Exception:
            self.qdrant_client = QdrantClient(host="localhost", port=6333, timeout=5)
            self.qdrant_client.get_collections()

        # Create a new collection for this session
        if self.collection_name in collection_names:
            self.qdrant_client.delete_collection(collection_name=self.collection_name)

        self.qdrant_client.create_collection(
            collection_name=self.collection_name,
            vectors_config=models.VectorParams(
                size=self.embedding_dimension_clip,
                distance=models.Distance.COSINE
            )
        )

Qdrant is a vector database that allows for efficient searching for similar vectors. The code creates a new collection for each session with the appropriate vector dimensions.

Ingesting data into Qdrant

Add your embeddings to Qdrant:

def _ingest_data(self):
    BATCH_SIZE = 64
    points_to_upsert = []

    for item in batch:
        if 'embedding' in item and isinstance(item['embedding'], list):
            points_to_upsert.append(
                models.PointStruct(
                    id=item['id'],
                    vector=item['embedding'],
                    payload={
                        "text": item['text'],
                        "page_num": item['page_num'],
                        "image_b64": item['image_b64']
                    }
                )
            )

    if points_to_upsert:
        self.qdrant_client.upsert(
            collection_name=self.collection_name, 
            points=points_to_upsert, 
            wait=True
        )

For each page, you store the embedding vector and a payload. The payload contains extracted text, page number, and the base64-encoded image.

Processing user queries

When a user submits a question, the code converts it to a CLIP text embedding:

def _get_clip_text_embedding(self, text_query):
    try:
        inputs = self.clip_processor(text=[text_query], return_tensors="pt", padding=True).to(self.DEVICE)
        with torch.no_grad():
            text_features = self.clip_model.get_text_features(**inputs)
        text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)
        return text_features[0].cpu().float().numpy().tolist()
    except Exception as e:
        self.progress_callback(f"Error generating text query embedding: {e}")
        return None

This function:

Processes the text query using the CLIP processor

  • Passes it through the text encoder to get the text embedding
  • Normalizes the embedding vector
  • Returns it as a list for Qdrant

Retrieving relevant context

Enter:

def _retrieve_context(self, query_embedding, top_k=2):
    try:
        search_result = self.qdrant_client.search(
            collection_name=self.collection_name,
            query_vector=query_embedding,
            limit=top_k,
            with_payload=True
        )
        return search_result
    except Exception as e:
        self.progress_callback(f"Error during Qdrant search: {e}")
        return []

This performs a vector similarity search to find the top_k most relevant pages to the query.

Generating responses with Gemma3

Finally, use Gemma 3 to generate a response:

def _prepare_and_generate(self, query, retrieved_results):
    prompt = f"I have a question about a document: {query}\n\nHere are relevant parts of the document to help you answer:\n\n"
    base64_images = []

    # Format context from retrieved results
    for i, result in enumerate(retrieved_results):
        context_payload = result.payload
        context_text_content = context_payload.get('text', '')
        context_page = context_payload.get('page_num', 'N/A')
        relevance_score = result.score

        prompt += f"--- Document Page {context_page} (Relevance Score: {relevance_score:.4f}) ---\n"
        if context_text_content:
            prompt += f"Text: {context_text_content}\n\n"

        if 'image_b64' in context_payload and context_payload['image_b64']:
            base64_images.append(context_payload['image_b64'])

    # Send to Ollama for generation
    generate_payload = {
        "model": self.ollama_model_name,
        "prompt": prompt,
        "stream": True
    }

    if base64_images:
        generate_payload["images"] = base64_images

This code:

  • Creates a prompt with the user’s question
  • Adds context from the most relevant pages (text and images)
  • Sends context and query to Gemma 3 via Ollama
  • Streams back the generated response

Putting it all together

The full query process ties all these components together:

def query(self, query_text):
    # Clear any temporary files
    if os.path.exists("temp_images"):
        try:
            for file in os.listdir("temp_images"):
                file_path = os.path.join("temp_images", file)
                if os.path.isfile(file_path):
                    os.unlink(file_path)
        except Exception as e:
            print(f"Error clearing temp_images: {e}")

    os.makedirs("temp_images", exist_ok=True)

    # Generate query embedding and retrieve context
    query_embedding = self._get_clip_text_embedding(query_text)
    retrieved_results = self._retrieve_context(query_embedding)

    # Generate response
    return self._prepare_and_generate(query_text, retrieved_results)

With the rag_engine.py complete, you can tie the entire system together using a Streamlit app.

Building the UI with Streamlit

In this part, you will build a user-friendly interface with Streamlit to upload PDFs and interact with your RAG system. The app.py file creates a chat interface where users can ask questions about their uploaded documents.

Set up the Streamlit application

In the app.py file, begin by importing dependencies and set key parameters for your Streamlit app. You will also import the custom QueryEngine class from the previous section. Ensure that the POPPLER_PATH points to the Poppler binaries which are needed for PDF-to-image conversion.

The app has a simple layout with a title, reset button, sidebar for document upload, and a chat interface:

# App Header
col1, col2 = st.columns([6, 1])

with col1:
    st.markdown("""
    # Multimodal RAG powered by Gemma 3 and Ollama
    """)

with col2:
    st.button("Clear ↺", on_click=reset_chat)

The reset_chat() function clears the conversation and any loaded documents. The sidebar contains a file uploader for PDF documents:

# Sidebar
with st.sidebar:
    st.header("Add your documents!")
    uploaded_file = st.file_uploader("Choose your `.pdf` file", type="pdf")

When a document is uploaded, check that it’s already been processed:

if uploaded_file:
    # Check if we need to process a new file or if it's already loaded
    file_key = uploaded_file.name + str(hash(uploaded_file.getvalue()))

    if file_key not in st.session_state.file_cache:
        # Process the new file
        with st.spinner("Processing your document. This may take a moment..."):
            try:
                st.session_state.query_engine = QueryEngine(
                    uploaded_file, 
                    session_id,
                    progress_callback=streamlit_progress_callback,
                    poppler_path=POPPLER_PATH
                )
                st.session_state.file_cache[file_key] = True
                st.success("Document processed and ready to chat!")
            except Exception as e:
                st.error(f"Error processing document: {str(e)}")

The app creates a unique key for each document based on its name and content hash. This prevents reprocessing the same document if it’s uploaded again.

Once a document is uploaded, you can view in display:

def display_pdf(file):
    st.markdown("### PDF Preview")
    base64_pdf = base64.b64encode(file.getvalue()).decode("utf-8")
    pdf_display = f"""<iframe src="data:application/pdf;base64,{base64_pdf}" width="100%" height="400" type="application/pdf"></iframe>"""
    st.markdown(pdf_display, unsafe_allow_html=True)

The function embeds the PDF directly in the Streamlit app using a base64-encoded iframe.

Chat interface and message handling

The main chat interface displays previous messages and accepts new queries:

# Display chat messages from history on app rerun
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# Accept user input
if prompt := st.chat_input("What's up?"):
    # Add user message to chat history
    st.session_state.messages.append({"role": "user", "content": prompt})
    # Display user message
    with st.chat_message("user"):
        st.markdown(prompt)

When a user submits a query, the RAG engine processes it and streams the response:

# Display assistant response
with st.chat_message("assistant"):
    message_placeholder = st.empty()
    full_response = ""

    if "query_engine" in st.session_state and st.session_state.query_engine:
        try:
            for chunk in st.session_state.query_engine.query(prompt):
                full_response += chunk
                message_placeholder.markdown(full_response + "▌")
            message_placeholder.markdown(full_response)
        except Exception as e:
            error_message = f"Error processing query: {str(e)}"
            message_placeholder.markdown(error_message)
            full_response = error_message
    else:
        full_response = "Please upload a PDF document in the sidebar to begin."
        message_placeholder.markdown(full_response)

The key aspects here are:

  • st.chat_message() for creating message bubbles
  • message_placeholder.markdown()for streaming responses and updating it
  • The blinking cursor (▌) indicates the system is still generating

Once complete you store the full response in the message history.

Progress feedback during processing

The callback function shows users what is happening:

def streamlit_progress_callback(msg):
    """Callback function to display progress in Streamlit"""
    st.write(msg)

This callback is passed to the QueryEngine to display status messages during PDF processing, embedding generation, and Qdrant setup.

Running your Streamlit application

Now it is time to run your application. But, before launching Streamlit, you should start Qdrant and Ollama.

First, make sure Docker Desktop is running, then launch the Qdrant container:

docker run -p 6333:6333 -p 6334:6334 -v qdrant_storage:/qdrant/storage qdrant/qdrant

This command:

  • Runs Qdrant container in docker
  • Maps the required ports to your local machine (-p 6333:6333 -p 6334:6334)
  • Creates a persistent volume for your data (-v qdrant_storage:/qdrant/storage)

You can verify that Qdrant is running by visiting http://localhost:6333/dashboard in your browser.

Qdrant running at local host

Next, ensure Ollama is running to serve the Gemma 3 model:

ollama serve

By default, Ollama will be serving at http://localhost:11434.

Ollama-running

You can verify it by running:

ollama list

Finally, run your Streamlit application:

streamlit run app.py

Your browser should automatically open to http://localhost:8501, displaying the multimodal RAG interface you’ve built. Upload a pdf and ask your questions.

Running Streamlit app

The model correctly reads the images of financial tables and provides the correct answer.

Setting up Google Kubernetes Engine

In this section of the tutorial, you will learn to deploy your multimodal RAG application to Google Kubernetes Engine (GKE). The strategy is to create a GKE cluster with mixed node pools:

  • CPU node for the Streamlit app
  • GPU node for Ollama
  • Qdrant with persistence storage

Creating a GCP project and required services

You will need to have a Google Cloud project. If you don’t already have a project, create one:

  1. Go to the Google Cloud Console and create a new project.
  2. Enable the following APIs (if not already enabled):
    • Kubernetes Engine API
    • Container Registry API
    • Cloud Build API

Creating a deployment service account and permissions

For this exercise you will download and install Google Cloud CLI by following the installation guide.

After downloading and installing Google Cloud SDK Shell authenticate it to GCP.

If you are working on windows, you may want to use Git Bash to run Google Cloud SDK. This will allow you to executive Unix-based commands without a fuss.

gcloud auth login

Install kubectl as follows:

gcloud components install kubectl

You can also download kubectl – the Kubernetes command-line tool using the official documentation. Or, you can activate kubectl by enabling Kubernetes Engine in Docker Desktop.

To create your GKE cluster, start by setting an environment variable PROJECT_ID:

export PROJECT_ID="enter_your-actual-project-id"

Set your project ID:

gcloud config set project $PROJECT_ID

Create a service account for deployment:

gcloud iam service-accounts create gke-deploy --display-name="GKE Deployment Service Account"

Assign roles:

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:gke-deploy@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/container.developer"

Enable permissions for artifacts registry writer and service account user:

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:gke-deploy@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/artifactregistry.writer" && \
gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:gke-deploy@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/iam.serviceAccountUser"

Create and download a base64 key for this service account:

# Create and download key
gcloud iam service-accounts keys create gke-deploy.json \
    --iam-account=gke-deploy@$PROJECT_ID.iam.gserviceaccount.com

Convert the key to base64 encoding for safe transport and storage of your key:

base64 gke-deploy.json > gke-deploy.base64

Download your key:

  • Navigate to the GCP Console
  • Click the Terminal/Cloud Shell icon at the top right (it looks like a >_ symbol)
  • This opens a terminal inside your browser — that’s Cloud Shell
  • set your project gcloud config set project $PROJECT_ID and run cloudshell download gke-deploy.base64.

When you run this command:

  • Your browser will pop up a download for the file.
  • Save it wherever you want on your local computer. You will need it for CircleCI configuration.

Create and configure your GKE cluster

For this exercise, you will follow Google’s best practice by creating separate node pools for your containers.

Google recommends creating separate node pools in GKE for efficiency and cost optimization. This allows you to have dedicated hardware that you can scale independently. You save money by avoid to run general tasks on expensive GPUs.

Therefore, you will first set up a cluster with a default CPU pool and then add a high CPU pool, and a dedicated GPU pool. Your Streamlit app and vector database (Qdrant) will run on CPU nodes while ollama will run on GPU.

Run the following command to create your cluster:

# Create the cluster first with a default CPU pool
gcloud container clusters create multimodal-rag-cluster \
    --zone us-central1-a \
    --release-channel=regular \
    --machine-type=e2-standard-2 \
    --num-nodes=1 \
    --enable-ip-alias \
    --scopes=https://www.googleapis.com/auth/cloud-platform

Add a high-cpu node pool for Qdrant:

gcloud container node-pools create cpu-high \
  --cluster multimodal-rag-cluster \
  --zone us-central1-a \
  --machine-type=e2-standard-4 \
  --num-nodes=1 \
  --enable-autoscaling \
  --min-nodes=1 \
  --max-nodes=3 \
  --scopes=https://www.googleapis.com/auth/cloud-platform

Allow time for the GCP to create your cluster. Then, add a dedicated GPU node pool:

gcloud container node-pools create gpu-pool \
    --cluster multimodal-rag-cluster \
    --zone us-central1-a \
    --machine-type=g2-standard-8 \
    --accelerator type=nvidia-l4,count=1,gpu-driver-version=latest \
    --num-nodes=1 \
    --enable-autoscaling --min-nodes=0 --max-nodes=3 \
    --node-locations=us-central1-a \
    --scopes=https://www.googleapis.com/auth/cloud-platform 

Allow sometime for GCP to create your cluster and add the appropriate nodes. If you are experiencing challenges creating a GPU node pool, check to ensure that you have sufficient quota. If not, you may need to request for a GPU quota.

After creating your cluster, you will configure kubectl to use it:

gcloud container clusters get-credentials multimodal-rag-cluster --zone us-central1-a

Verify that kubectl is now pointing to your GKE cluster:

kubectl config current-context

The result will be similar to: gke_YOUR-PROJECT-ID_us-central1-a_multimodal-rag-cluster.

Containerizing the multimodal RAG application with Docker

At this point, you will containerize your application for deployment. First, create a Dockerfilein the root directory of your app:

FROM python:3.12-slim

WORKDIR /app

RUN apt-get update && apt-get install -y \
    build-essential \
    poppler-utils \
    libpoppler-dev \
    libpoppler-cpp-dev \
    libgl1-mesa-glx \
    libglib2.0-0 \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8501

CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]

Now, build and test your Docker image locally:

docker build -t us-central1-docker.pkg.dev/$PROJECT_ID/images/streamlit-rag:latest .

# Run the container locally
docker run -p 8501:8501 us-central1-docker.pkg.dev/$PROJECT_ID/images/streamlit-rag:latest

You should now be able to access your application at http://localhost:8501.

After confirming that your container is running properly, you will need to push it to Google Container Registry.

Before you push the image to the artifacts registry, you need to modify your code to use Kubernetes service discovery. In your QueryEngine class, change ollama from localhost to container:

# Change Ollama API base URL
#self.ollama_api_base = "http://localhost:11434"
self.ollama_api_base = "http://ollama:11434"

After this change, you will need to rebuild the image before pushing it.

Create a repository in Google Container Registry. First, configure Docker:

gcloud auth configure-docker us-central1-docker.pkg.dev # Replace with your Artifact Registry region

Then, create an Artifact Registry repository (if does not exist) in the Console or using the gcloud command:

if ! gcloud artifacts repositories describe images --location=us-central1 --project=$PROJECT_ID > /dev/null 2>&1; then
    gcloud artifacts repositories create images \
    --repository-format=docker \
    --location=us-central1 \
    --project=$PROJECT_ID
fi

This creates a repository named images.

Now push the image to the Google Artifacts Registry:

docker push us-central1-docker.pkg.dev/$PROJECT_ID/images/streamlit-rag:latest

Note: In the next section of this tutorial, you will create .yaml files for deployment and service resources.

Creating deployment and service resources

Up to this point, you have successfully set up a Google Cloud environment, and created a service account and a GKE cluster with a dedicated GPU node pool.

You have also containerized your app. Now you are ready to deploy it to GKE.

Your application consists of three main services that will work together in Kubernetes:

  • Ollama Deployment: Runs on GPU nodes (depending on resource availability)
  • Qdrant Deployment: Runs on CPU nodes and requires persistent storage (PVC)
  • Streamlit Deployment: Runs on CPU nodes and handles user interactions

This approach follows Kubernetes best practices of separating concerns while enabling seamless communication between application components.

You will start by creating a directory named k8s-manifests.

Prepare storage configuration

Create a file inside the k8s-manifests named qdrant-pvc.yaml:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: qdrant-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

In this code you define the PersistentVolumeClaim named qdrant-pvc requesting storage (10Gi). You can request higher storage as neccessary.

Create the qdrant-deployment.yaml file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: qdrant
  labels:
    app: qdrant
spec:
  replicas: 1
  selector:
    matchLabels:
      app: qdrant
  template:
    metadata:
      labels:
        app: qdrant
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-nodepool
                operator: In
                values:
                - cpu-high
      containers:
      - name: qdrant
        image: qdrant/qdrant:latest
        ports:
        - containerPort: 6333
        - containerPort: 6334
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "2"
            memory: "4Gi"
        volumeMounts:
        - name: qdrant-storage
          mountPath: /qdrant/storage
      volumes:
      - name: qdrant-storage
        persistentVolumeClaim:
          claimName: qdrant-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: qdrant
spec:
  selector:
    app: qdrant
  ports:
  - port: 6333
    targetPort: 6333
    name: grpc
  - port: 6334
    targetPort: 6334
    name: rest
  • requests: These are the resources that Kubernetes guarantees to the container. The scheduler uses these values to decide which node to place the pod on.
  • limits: These are the maximum resources the container can use. If the container tries to use more than these limits, it might be throttled or terminated (for memory).

Create the streamlit-deployment.yaml file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: streamlit-app
  labels:
    app: streamlit-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: streamlit-app
  template:
    metadata:
      labels:
        app: streamlit-app
    spec:
      containers:
      - name: streamlit-app
        image: us-central1-docker.pkg.dev/$PROJECT_ID/images/streamlit-rag:latest
        ports:
        - containerPort: 8501
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "1"
            memory: "2Gi"
        env:
        - name: POPPLER_PATH
          value: "/usr/bin"
---
apiVersion: v1
kind: Service
metadata:
  name: streamlit-app
spec:
  selector:
    app: streamlit-app
  ports:
  - port: 80
    targetPort: 8501
  type: LoadBalancer

This configuration requests 0.5 CPU cores (500 millicores) and 1GB of memory. It sets limits at 1 CPU core and 2GB of memory and sets the environment variable for POPPLER_PATH.

The values are apprpriate for a Streamlit app with moderate traffic. In itself, the app is less resource-intensive as compared to the model serving component (Ollama).

Create ollama-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  labels:
    app: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      nodeSelector:
        cloud.google.com/gke-nodepool: gpu-pool
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        resources:
          limits:
            nvidia.com/gpu: 1
        command: ["/bin/sh", "-c"]
        args:
          - |
            ollama serve &
            sleep 10
            ollama pull gemma3:latest
            wait
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
spec:
  selector:
    app: ollama
  ports:
  - port: 11434
    targetPort: 11434

This file defines a deployment named “ollama” with one replica. It uses nodeSelector to select the gpu-pool node pool. It also pulls the official Ollama image. Moreover, it configures appropriate resource requests and limits.

At startup, the service runs a script that:

  • Starts the Ollama server
  • Waits 10 seconds for it to initialize, and
  • Downloads the gemma3 model

The configuration ensures that Ollama has sufficient resources to run efficiently on your GPU node.

Now you are set to deploy everything using Kubernetes.

Deploy to Google Kubernetes Engine (GKE)

Once you have pushed your Streamlit app container to Google Artifact Registry, you need to apply your Kubernetes resources to GKE.

Configure kubectl to communicate with your GKE cluster:

gcloud container clusters get-credentials multimodal-rag-cluster --zone us-central1-a

Verify that kubectl is now pointing to your GKE cluster:

kubectl config current-context

The output should be something like: gke_YOUR-PROJECT-ID_us-central1-a_multimodal-rag-cluster

The next step is to set the pvc:

kubectl apply -f qdrant-pvc.yaml

Then, deploy Qdrant, Ollama, and Streamlit images:

kubectl apply -f qdrant-deployment.yaml
kubectl apply -f ollama-deployment.yaml
kubectl apply -f streamlit-deployment.yaml

Access your application

Get the external IP address of your Streamlit app:

kubectl get service streamlit-app

Output will include an EXTERNAL-IP you can use to access your app.

Automating deployment with CircleCI

You have seen that deploying applications to Google Kubernetes Engine (GKE) involves several manual steps. The processs can be tedious and error-prone. This is where CircleCI CI/CD platform can automate the entire process for you.

Here is the CircleCI YAML file that automates our GKE deployment process:

version: 2.1

jobs:
  build-and-deploy:
    docker:
      - image: cimg/python:3.12.5
    steps:
      - checkout
      # Enable Docker support
      - setup_remote_docker:
          docker_layer_caching: true
      - run:
          name: Install Google Cloud SDK
          command: |
            curl https://sdk.cloud.google.com | bash > /dev/null 2>&1
            source $HOME/google-cloud-sdk/path.bash.inc
      # Install dependencies and run tests
      - restore_cache:
          keys:
            - v1-dependencies-{{ checksum "requirements.txt" }}
      - run:
          name: Install Dependencies
          command: |
            python -m venv venv
            . venv/bin/activate
            pip install --no-cache-dir -r requirements.txt
      - save_cache:
          paths:
            - ./venv
          key: v1-dependencies-{{ checksum "requirements.txt" }}
      # - run:
      #     name: Run Tests
      #     command: |
      #       . venv/bin/activate
      #       pytest tests/ -v
      # Create repo and build container in Artifacts Registry
      - run:
          name: Authenticate Google Cloud
          command: |
            export PATH=$HOME/google-cloud-sdk/bin:$PATH
            echo $GCP_KEY | base64 -d > ${HOME}/gcloud-service-key.json
            gcloud auth activate-service-account --key-file=${HOME}/gcloud-service-key.json
            gcloud config set project $PROJECT_ID
            gcloud auth configure-docker us-central1-docker.pkg.dev
      - run:
          name: Create Artifact Registry Repository
          command: |
            export PATH=$HOME/google-cloud-sdk/bin:$PATH
            if ! gcloud artifacts repositories describe images --location=us-central1 --project=$PROJECT_ID > /dev/null 2>&1; then
              gcloud artifacts repositories create images \
                --repository-format=docker \
                --location=us-central1 \
                --project=$PROJECT_ID
            fi
      - run:
          name: Build Docker Image
          command: |
            docker build -t us-central1-docker.pkg.dev/$PROJECT_ID/images/streamlit-rag:latest .
      - run:
          name: Docker Login
          command: |
            export PATH=$HOME/google-cloud-sdk/bin:$PATH
            docker login -u _json_key -p "$(cat ${HOME}/gcloud-service-key.json)" us-central1-docker.pkg.dev
      - run:
          name: Push Docker Image
          command: |
            export PATH=$HOME/google-cloud-sdk/bin:$PATH
            docker push us-central1-docker.pkg.dev/$PROJECT_ID/images/streamlit-rag:latest

      # Connect to GKE and deploy
      - run:
          name: Install kubectl
          command: |
            export PATH=$HOME/google-cloud-sdk/bin:$PATH
            gcloud components install kubectl
      - run:
          name: Connect to GKE Cluster
          command: |
            export PATH=$HOME/google-cloud-sdk/bin:$PATH
            gcloud container clusters get-credentials $GKE_CLUSTER_NAME --zone $GKE_ZONE --project $PROJECT_ID
      - run:
          name: Update Kubernetes Manifests
          command: |
            # Replace placeholders in deployment files with actual values
            sed -i "s|IMAGE_PLACEHOLDER|us-central1-docker.pkg.dev/$PROJECT_ID/images/streamlit-rag:latest|g" k8s-manifests/streamlit-deployment.yaml
      - run:
          name: Apply Kubernetes Manifests
          command: |
            export PATH=$HOME/google-cloud-sdk/bin:$PATH
            # Create PVCs first
            kubectl apply -f k8s-manifests/qdrant-pvc.yaml
            # Wait for PVC to be bound
            echo "Waiting for PVC to be bound..."
            sleep 10
            # Create deployments
            kubectl apply -f k8s-manifests/qdrant-deployment.yaml
            kubectl apply -f k8s-manifests/ollama-deployment.yaml
            kubectl apply -f k8s-manifests/streamlit-deployment.yaml
      - run:
          name: Verify Deployment
          command: |
            export PATH=$HOME/google-cloud-sdk/bin:$PATH
            echo "Waiting for deployments to be ready..."
            kubectl rollout status deployment/qdrant
            kubectl rollout status deployment/ollama
            kubectl rollout status deployment/streamlit-app
            echo "Deployment successful! Here are the services:"
            kubectl get services

workflows:
  build-deploy:
    jobs:
      - build-and-deploy:
          context:
            - gke_deploy

The pipeline explained step-by-step

Here is a breakdown of what this configuration does:

  • Environment setup: The pipeline uses a Python 3.12.5 Docker container as your working environment. CircleCI checks out your code from your repository and sets Docker support with layer caching for faster future builds.
  • Installing tools and dependencies: The pipeline install Google Cloud SDK to interact with Google Cloud. The next step installing dependencies from your requirements.txt file in a virtual environment.
  • Building and pushing your Docker image: The pipeline authenticates with Google Cloud using credentials stored in CircleCI and creates an Artifact Registry repository if it doesn’t exist. The pipeline builds a Docker image and pushes it to Google’s Artifact Registry where it’s stored safely.
  • Deploying to GKE: The pipeline installs kubectl and connects to your specific GKE cluster. Then it updates your Kubernetes manifest files with the correct image reference.

It then applies these manifests to deploy several components:

  • A persistent volume claim (PVC) for Qdrant (vector database)
  • Qdrant deployment for vector search
  • Ollama deployment for AI model serving
  • Streamlit web application deployment

Finally, it verifies everything deployed correctly by checking the status.

  • Workflow definition: The workflow specifies that your build-and-deploy job should use the gke_deploy context.

For this pipeline to work, you will need to configure a CircleCI context named gke_deploy with these environment variables:

CircleCI context variables

You can monitor your build process in CircleCI.

CircleCI build process-part-1

CircleCI build processs-part-2

You can confirm that the build process was indeed successful by inspecting the deployment verification step.

Successful deployment to GKE

You can now review your app on GKE. You will use the IP address for your running Streamlit app and exposed through a LoadBalancer.

To get the IP, run:

kubectl get service streamlit-app

There will be an external IP in the EXTERNAL-IP column.

Checking external IP via get services

When you go to this IP address, you will be able to run your app on GKE.

Running your Streamlit app on GKE via external IP

Now your app is running, you can monitor its health and performance. You will learn to use some metrics in the next part.

Monitoring and logging

After deployment, you can check pod status:

kubectl get pods

All the pods are running.

Check pod status

View logs for specific pods:

kubectl logs deployment/streamlit-app
kubectl logs deployment/ollama
kubectl logs deployment/qdrant

You can also use the kubectl describe pod command to get investigate a problematic pod:

kubectl describe pod POD_NAME

Get more details by adding a -f flag to show new events as they stream.

This command will show you scheduling information, events, and potential issues that are preventing the pod from starting.

For example, if pod creating is stuck in pending mode, you can check messages in the “Events” section at the bottom of the output.

Common reasons for pods being stuck in the pending state include:

  • Insufficient resources: GPU node pool lacks enough capacity
  • Node selector issues: Mismatch in the specified label
  • Taint/Toleration issues: GPU nodes might have taints that the pod does not tolerate

Cleaning Up GKE Resources to prevent unexpected costs

When working with cloud resources like GKE, you need to clean up when you are done to avoid unexpected charges. Google Cloud bills for running clusters, persistent volumes, and other resources even when you are not actively using them.

You can manually delete resources using these commands:

# Delete all deployments
kubectl delete deployment streamlit-app ollama qdrant

# Delete services
kubectl delete service streamlit-service ollama-service qdrant-service

# Delete PVC (this will also delete the underlying PV)
kubectl delete pvc qdrant-pvc

# Finally, if you don't need the cluster anymore
gcloud container clusters delete $GKE_CLUSTER_NAME --zone $GKE_ZONE --project $PROJECT_ID

While GKE simplifies deployment, it is designed for production workloads and priced accordingly. When using GKE for educational purposes, you might consider Minikube or Docker Desktop.

Conclusion

You have successfully navigated the process of building and deploying a sophisticated multimodal RAG application. You now have a fully deployed, automated, and scalable multimodal RAG system capable of analyzing complex PDF documents.

You can sign up for a free CircleCI account and start automating your own projects today.