Sep 2, 20259 min read

Build and deploy a Pinecone question answering RAG application

Vivek Maskara

Software Engineer

Vector databases allow you to store, manage, and efficiently query high-dimensional vector data, which are numerical representations of data like text, images, or audio. Pinecone is a fully managed vector database optimized for fast, scalable similarity search—to power a Retrieval-Augmented Generation (RAG) system. This allows you to enhance language model responses by grounding them in relevant context retrieved from your own documents.

Pinecone is available in Python and TypeScript. In this tutorial, you will learn how to build a RAG-powered question-answering (QA) system using Pinecone Python SDK and expose the functionality using Flask REST APIs. The application will use Langchain to interface with OpenAI and Pinecone’s vector database. Finally, you will automate testing using CircleCI.

Prerequisites

For this tutorial, you need to set up a Python development environment on your machine. You also need a CircleCI account to automate the testing and deployment of the Python application. Refer to this list to set up everything required for this tutorial.

Create a Pinecone index

For this tutorial, you need to create a new serverless index. Head over to your Pinecone account dashboard and click Create index to create a new serverless index. Make sure to choose text-embedding-3-small as the embedding model while creating the index.

Create a new serverless index

Next, create a Pinecone API key by navigating to the API keys tab.

Create a new Pinecone API key

Create a new Python project

First, create a new directory for your Python project and navigate into it.

mkdir rag-pinecone-app-circleci
cd rag-pinecone-app-circleci

Install the dependencies

In this tutorial, you will use the pinecone Python package for the Pinecone vector database and Flask for exposing the model’s ingestion and question-answering functionality as a REST API. You will also need a few Langchain packages to chunk text data, interface with OpenAI and Pinecone. Create a requirements.txt file in the root of the project and add the following dependencies to it:

openai
pinecone
langchain
tiktoken
python-dotenv
flask
pytest
werkzeug
langchain_openai
langchain_pinecone
langchain_community
gunicorn

To install the dependencies first create a virtual Python environment. This allows you to install Python packages in an abstracted environment that is not your entire local machine. Execute the following commands to create a new virtual environment.

python -m venv .venv
source .venv/bin/activate

Now install the dependencies for the project by issuing the pip install command (in your terminal):

pip install -r requirements.txt

Define the ingestion script

Before you can query documents using a LLM model, you need to ingest and index them in a vector database. This process involves converting the raw text into embeddings and storing them in Pinecone, which enables efficient semantic retrieval later. In this section, you’ll define a utility function that handles this by loading a file, splitting it into smaller chunks and storing it in Pinecone. Create a app/ingest.py file and add the following code snippet to it:

import os
from dotenv import load_dotenv
from langchain_pinecone import PineconeVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from typing import Optional

def ingest_document(file_path: str) -> dict:
    """
    Ingest a document into Pinecone vector store.

    Args:
        file_path (str): Path to the document to be ingested

    Returns:
        dict: Status of the ingestion
    """
    try:
         load_dotenv()
        index_name = os.getenv("PINECONE_INDEX_NAME")
        if not index_name:
            raise ValueError("PINECONE_INDEX_NAME must be set in your .env file.")

        embedding = OpenAIEmbeddings()
        loader = TextLoader(file_path)
        documents = loader.load()

        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=100
        )
        docs = text_splitter.split_documents(documents)

        PineconeVectorStore.from_documents(
            docs,
            index_name=index_name,
            embedding=embedding,
            namespace="default"
        )

        return {
            "status": "success",
            "message": f"Document '{file_path}' successfully ingested into index '{index_name}'",
            "index_name": index_name
        }

    except Exception as e:
        print(f"Error during ingestion: {str(e)}")
        return {
            "status": "error",
            "message": str(e)
        }

Here’s how the ingest_document works:

Define the query script

Once documents are ingested into Pinecone, the next step is to query them using a LLM model. This involves retrieving the most similar chunks from the vector index and passing them as context to OpenAI’s model for answer generation. In this section, you’ll define a function that takes a user question, performs a semantic search over your indexed data, and returns an AI-generated response.

Next, create a app/query.py file and add the following code snippet to it:

# app/query.py
import os
from dotenv import load_dotenv
from langchain_pinecone import PineconeVectorStore
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA

def ask(question: str) -> str:
    load_dotenv()
    embedding = OpenAIEmbeddings()
    index_name = os.getenv("PINECONE_INDEX_NAME")

    if not index_name:
        raise ValueError("PINECONE_INDEX_NAME must be set in your .env file.")

    docsearch = PineconeVectorStore.from_existing_index(
        index_name=index_name,
        embedding=embedding,
        namespace="default"
    )

    qa = RetrievalQA.from_chain_type(
        llm=ChatOpenAI(),
        retriever=docsearch.as_retriever()
    )

    result = qa.invoke({"query": question})
    return result["result"]

Here’s how the ask function works:

  • It connects to the existing Pinecone index and loads the stored embeddings using PineconeVectorStore.from_existing_index.
  • A RetrievalQA chain is created using LangChain, which wraps a ChatOpenAI model with a retriever that fetches context from Pinecone.
  • The user’s question is passed to the chain via the invoke method, which returns a context-aware answer generated by the LLM.

Set up the environment variables

You’ll notice that the ingest.py and query.py scripts rely on certain environment variables that needs to be setup before running the application. Create a .env file in the project’s root and add the following environment variables to it:

OPENAI_API_KEY=
PINECONE_API_KEY=
PINECONE_INDEX_NAME=
PINECONE_INDEX_HOST=

Set the following values for the environment variables:

  • OPENAI_API_KEY: Add the OpenAI API key you created in the prerequisities section.
  • PINECONE_API_KEY: Create a Pinecone API key and add it here.
  • PINECONE_INDEX_NAME: Add the name of the Pinecone index that you created earlier.
  • PINECONE_INDEX_HOST: Add the Pinecone host URL for the index that you created in the previous section.

Adding unit test for ingestion

Now that you have defined the app logic, add some unit tests to verify that everything works as expected. Add tests for the ingestion logic to verify that the ingest_document function behaves as expected under both successful and failure scenarios, without making real API calls or file reads. Create a tests/test_ingest.py file and add this code snippet to it:

# tests/test_ingest.py

import sys
import os
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))

import pytest
from unittest.mock import patch, MagicMock
from app.ingest import ingest_document

@patch("app.ingest.load_dotenv")
@patch("app.ingest.PineconeVectorStore")
@patch("app.ingest.OpenAIEmbeddings")
@patch("app.ingest.TextLoader")
@patch("app.ingest.RecursiveCharacterTextSplitter")
def test_ingest_document_success(mock_text_splitter, mock_text_loader, mock_embeddings, mock_vectorstore, mock_load_dotenv):
    mock_load_dotenv.return_value = None
    mock_loader_instance = MagicMock()
    mock_loader_instance.load.return_value = ["mock_document"]
    mock_text_loader.return_value = mock_loader_instance

    mock_splitter_instance = MagicMock()
    mock_splitter_instance.split_documents.return_value = ["mock_chunk1", "mock_chunk2"]
    mock_text_splitter.return_value = mock_splitter_instance

    mock_embeddings.return_value = MagicMock()
    mock_vectorstore.from_documents.return_value = None

    with patch.dict("os.environ", {"PINECONE_INDEX_NAME": "mock-index"}):
        result = ingest_document("mock_file.txt")

    assert result["status"] == "success"
    assert result["index_name"] == "mock-index"
    assert "successfully ingested" in result["message"]

    mock_text_loader.assert_called_once_with("mock_file.txt")
    mock_loader_instance.load.assert_called_once()
    mock_text_splitter.assert_called_once()
    mock_splitter_instance.split_documents.assert_called_once_with(["mock_document"])
    mock_embeddings.assert_called_once()
    mock_vectorstore.from_documents.assert_called_once_with(
        ["mock_chunk1", "mock_chunk2"],
        index_name="mock-index",
        embedding=mock_embeddings.return_value,
        namespace="default"
    )

@patch("app.ingest.load_dotenv")
def test_ingest_document_missing_env(mock_load_dotenv):
    mock_load_dotenv.return_value = None
    with patch.dict("os.environ", {}, clear=True):
        result = ingest_document("mock_file.txt")

    assert result["status"] == "error"
    assert "PINECONE_INDEX_NAME must be set" in result["message"]

Adding unit tests for query

Similar to the ingestion script, lets define tests for the query script by creating a tests/test_query.py file and adding this code snippet to it:

# tests/test_query.py

import sys
import os
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))

import pytest
from unittest.mock import patch, MagicMock
from app.query import ask

@patch("app.query.load_dotenv")
@patch("app.query.PineconeVectorStore")
@patch("app.query.OpenAIEmbeddings")
@patch("app.query.ChatOpenAI")
@patch("app.query.RetrievalQA")
def test_ask_success(mock_retrieval_qa, mock_chat_openai, mock_embeddings, mock_vectorstore, mock_load_dotenv):
    mock_load_dotenv.return_value = None
    mock_docsearch = MagicMock()
    mock_vectorstore.from_existing_index.return_value = mock_docsearch

    mock_llm = MagicMock()
    mock_chat_openai.return_value = mock_llm

    mock_qa_chain = MagicMock()
    mock_qa_chain.invoke.return_value = {"result": "mocked answer"}
    mock_retrieval_qa.from_chain_type.return_value = mock_qa_chain

    with patch.dict("os.environ", {"PINECONE_INDEX_NAME": "mock-index"}):
        response = ask("What is RAG?")

    assert response == "mocked answer"

    mock_vectorstore.from_existing_index.assert_called_once_with(
        index_name="mock-index",
        embedding=mock_embeddings.return_value,
        namespace="default"
    )
    mock_chat_openai.assert_called_once()
    mock_retrieval_qa.from_chain_type.assert_called_once_with(
        llm=mock_chat_openai.return_value,
        retriever=mock_docsearch.as_retriever.return_value
    )
    mock_qa_chain.invoke.assert_called_once_with({"query": "What is RAG?"})

@patch("app.query.load_dotenv")
@patch("app.query.OpenAIEmbeddings")
def test_ask_missing_env(mock_embeddings, mock_load_dotenv):
    mock_load_dotenv.return_value = None
    mock_embeddings.return_value = None

    with patch.dict("os.environ", {}, clear=True):
        with pytest.raises(ValueError) as excinfo:
            ask("Test question")

    assert "PINECONE_INDEX_NAME must be set" in str(excinfo.value)

You can run the test by executing the following command:

pytest ./

Once you execute the tests, it should the show the status as passed:

================================================ test session starts ================================================
platform darwin -- Python 3.9.12, pytest-8.3.4, pluggy-1.5.0
...
collected 4 items

tests/test_ingest.py ..                                                                                       [ 50%]
tests/test_query.py ..                                                                                        [100%]

================================================= 4 passed in 0.93s =================================================

Define the Flask server

To expose the ingestion and question-answering functionality as HTTP endpoints, you’ll create a Flask server. Flask is an open-source Python framework for developing web applications. It is a popular choice for building the API service layer for lightweight applications. This allows external clients to interact with your RAG system via REST API calls. Create a server.py file in the project’s root and add the following code snippet to it:

# server.py
import os
from flask import Flask, request, jsonify
from dotenv import load_dotenv
from app.query import ask
from app.ingest import ingest_document
import werkzeug

load_dotenv()
app = Flask(__name__)

@app.route('/ask', methods=['POST'])
def query_rag():
    data = request.get_json()
    question = data.get("question")
    if not question:
        return jsonify({"error": "Missing 'question' in request"}), 400

    try:
        answer = ask(question)
        return jsonify({"question": question, "answer": answer})
    except Exception as e:
        return jsonify({"error": str(e)}), 500

@app.route('/ingest', methods=['POST'])
def ingest_file():
    try:
        if 'file' not in request.files:
            return jsonify({"error": "No file provided"}), 400

        file = request.files['file']
        if file.filename == '':
            return jsonify({"error": "No file selected"}), 400

        temp_file_path = f"temp_{werkzeug.utils.secure_filename(file.filename)}"
        file.save(temp_file_path)

        result = ingest_document(temp_file_path)
        os.remove(temp_file_path)

        return jsonify(result)

    except Exception as e:
        return jsonify({"status": "error", "message": str(e)}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Here’s an overview the endpoints defined in the Flask server:

  • POST /ask accepts a JSON payload with a question field, runs it through the retrieval and LLM chain (ie. by calling the ask method defined earlier), and returns the generated answer.
  • POST /ingest accepts a file upload, temporarily saves the document, ingests its contents into Pinecone by invoking ingest_document, and returns a status message.

To test the API endpoint, start the Flask web server by executing this command:

flask --app server run

It will start a webserver at http://localhost:5000 and you can test the ingestion API using curl.

curl --location 'http://127.0.0.1:5000/ingest' \
--form 'file=@"/Users/vivekmaskara/Downloads/paul_graham_essay.txt"'

Note: Make sure that the input text file exists at the specified file location. You can download the sample paul_graham_essay.txt used in the above code snippet from GitHub.

Your output should be similar to this:

{"index_name":"pinecone-example-index","message":"Document 'temp_paul_graham_essay.txt' successfully ingested into index 'pinecone-example-index'","status":"success"}

Similarly, you can test the query API using curl:

curl --location 'http://127.0.0.1:5000/ask' \
--header 'Content-Type: application/json' \
--data '{
    "question": "What is the essay about?"
}'

You should see an output similar to this:

{"answer":"The essay is about the experience and process of writing and publishing essays online. It discusses the shift from traditional publishing channels to online platforms, the author's motivation and inspiration for writing essays, the relationship between giving talks and writing essays, and the challenges faced in receiving feedback and criticism online. It also touches upon the significance of choosing what to work on, how ideas are developed, and the decision-making process behind selecting topics to write about.","question":"What is the essay about?"}

Setting up the Heroku application

Now that you have tested the application locally, you can deploy the app on a remote server such as Heroku. This will enable more people to use the QA system.

Configure Heroku

Before deploying to Heroku, you need to create a Heroku app using its dashboard. Provide a name for the app and choose a region for the new app.

Create a new Heroku app

Also, for your app to work correctly, you need to subscribe to a Heroku Dyno. Go to Account > Billing. Click Subscribe under Eco Dynos Plan to subscribe to a Dyno.

Subscribe to Heroku Dyno

Finally, you need to configure the application config variables on Heroku so that the Flask application works as expected. Go to your project’s Settings > Config vars section and add the variables.

Set up Heroku config variables

Configure the Flask app

In the codebase, create a Procfile in the project’s root and add this content:

web: gunicorn server:app

The Procfile config is needed by Heroku to find the commands that are executed by the app on startup.

Automate deployment using CircleCI

Now that you have tested the ingestion APIs locally, automate the workflow so that the unit tests can be executed and the application is deployed to Heroku every time you push code to the main branch.

Add the configuration script

First, add a .circleci/config.yml script in the project’s root containing the configuration file for the CI pipeline. Add this code to it:

version: 2.1

orbs:
  heroku: circleci/heroku@2.0

executors:
  python-executor:
    docker:
      - image: cimg/python:3.11

jobs:
  test:
    executor: python-executor
    steps:
      - checkout
      - run:
          name: Install dependencies and run tests
          command: |
            python -m venv venv
            . venv/bin/activate
            pip install --upgrade pip
            pip install -r requirements.txt
            pytest tests/

workflows:
  test-and-deploy:
    jobs:
      - test
      - heroku/deploy-via-git:
          requires:
            - test
          filters:
            branches:
              only: main
          app-name: "${HEROKU_APP_NAME}"

Take a moment to review the CircleCI configuration:

  • The test job uses the cimg/python:3.11 Docker image as its environment, sets up a virtual environment, installs dependencies from requirements.txt, and runs the unit tests using pytest.
  • The heroku/deploy-via-git job uses CircleCI Heroku Orb to deploy the app to Heroku. This job is triggered only if the test job is successful, ensuring that only tested code is deployed to production from the main branch.

Now that the configuration file has been properly set up, create a repository for the project on GitHub and push all the code to it. Review Pushing a project to GitHub for instructions.

Set up the project on CircleCI

Next, log into your CircleCI account. On the CircleCI dashboard, click the Projects tab, search for the GitHub repo name and click Set Up Project for your project.

Setting up the project on CircleCI

You will be prompted to add a new configuration file manually or use an existing one. Since you have already pushed the required configuration file to the codebase, select the Fastest option and enter the name of the branch hosting your configuration file. Click Set Up Project to continue.

Setting up project configuration

Completing the setup will trigger the pipeline but it will fail since the environment variables are not set.

Set environment variables

On the project page, click Project settings and go to the Environment variables tab. On the screen that appears, click Add environment variable button and add the following environment variables.

  • HEROKU_APP_NAME to the application that you used while creating the Heroku application.
  • HEROKU_API_KEY to your Heroku account’s API key. You can retrieve your API key from your Heroku account’s settings page.

Once you add the environment variables, it should show the key values on the dashboard.

Set environment variables on CircleCI

Now that the environment variables are configured, trigger the pipeline again. This time the build should succeed.

Successful build on CircleCI

Conclusion

In this tutorial, you learned how to build a Retrieval-Augmented Generation (RAG) application using Pinecone and OpenAI, and expose it via a Flask API. You also saw how to automate testing and deployment using CircleCI and Heroku. Pinecone simplifies working with high-dimensional vector data, making it easy to store and retrieve semantically relevant context for your language model queries.

With CircleCI, you can automate the build and testing pipeline for continuous integration. The pipeline can be used to execute unit tests for the application using pytest to boost development speed.

You can check out the complete source code used in this tutorial on GitHub.


Vivek Kumar Maskara is a Software Engineer at JP Morgan. He loves writing code, developing apps, creating websites, and writing technical blogs about his experiences. His profile and contact information can be found at maskaravivek.com.