Build and deploy a Pinecone question answering RAG application

Software Engineer

Vector databases allow you to store, manage, and efficiently query high-dimensional vector data, which are numerical representations of data like text, images, or audio. Pinecone is a fully managed vector database optimized for fast, scalable similarity search—to power a Retrieval-Augmented Generation (RAG) system. This allows you to enhance language model responses by grounding them in relevant context retrieved from your own documents.
Pinecone is available in Python and TypeScript. In this tutorial, you will learn how to build a RAG-powered question-answering (QA) system using Pinecone Python SDK and expose the functionality using Flask REST APIs. The application will use Langchain to interface with OpenAI and Pinecone’s vector database. Finally, you will automate testing using CircleCI.
Prerequisites
For this tutorial, you need to set up a Python development environment on your machine. You also need a CircleCI account to automate the testing and deployment of the Python application. Refer to this list to set up everything required for this tutorial.
- Download and install Python
- Sign up for a Pinecone account
- Create a CircleCI account
- Sign up for a Heroku account
- Sign up for an OpenAI account and create a new OpenAI API key
Create a Pinecone index
For this tutorial, you need to create a new serverless index. Head over to your Pinecone account dashboard and click Create index to create a new serverless index. Make sure to choose text-embedding-3-small
as the embedding model while creating the index.
Next, create a Pinecone API key by navigating to the API keys tab.
Create a new Python project
First, create a new directory for your Python project and navigate into it.
mkdir rag-pinecone-app-circleci
cd rag-pinecone-app-circleci
Install the dependencies
In this tutorial, you will use the pinecone Python package for the Pinecone vector database and Flask for exposing the model’s ingestion and question-answering functionality as a REST API. You will also need a few Langchain packages to chunk text data, interface with OpenAI and Pinecone. Create a requirements.txt
file in the root of the project and add the following dependencies to it:
openai
pinecone
langchain
tiktoken
python-dotenv
flask
pytest
werkzeug
langchain_openai
langchain_pinecone
langchain_community
gunicorn
To install the dependencies first create a virtual Python environment. This allows you to install Python packages in an abstracted environment that is not your entire local machine. Execute the following commands to create a new virtual environment.
python -m venv .venv
source .venv/bin/activate
Now install the dependencies for the project by issuing the pip install command (in your terminal):
pip install -r requirements.txt
Define the ingestion script
Before you can query documents using a LLM model, you need to ingest and index them in a vector database. This process involves converting the raw text into embeddings and storing them in Pinecone, which enables efficient semantic retrieval later. In this section, you’ll define a utility function that handles this by loading a file, splitting it into smaller chunks and storing it in Pinecone. Create a app/ingest.py
file and add the following code snippet to it:
import os
from dotenv import load_dotenv
from langchain_pinecone import PineconeVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from typing import Optional
def ingest_document(file_path: str) -> dict:
"""
Ingest a document into Pinecone vector store.
Args:
file_path (str): Path to the document to be ingested
Returns:
dict: Status of the ingestion
"""
try:
load_dotenv()
index_name = os.getenv("PINECONE_INDEX_NAME")
if not index_name:
raise ValueError("PINECONE_INDEX_NAME must be set in your .env file.")
embedding = OpenAIEmbeddings()
loader = TextLoader(file_path)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=100
)
docs = text_splitter.split_documents(documents)
PineconeVectorStore.from_documents(
docs,
index_name=index_name,
embedding=embedding,
namespace="default"
)
return {
"status": "success",
"message": f"Document '{file_path}' successfully ingested into index '{index_name}'",
"index_name": index_name
}
except Exception as e:
print(f"Error during ingestion: {str(e)}")
return {
"status": "error",
"message": str(e)
}
Here’s how the ingest_document
works:
- It initializes an embedding model using OpenAI (
OpenAIEmbeddings
) and loads the input document using TextLoader. - The text is split into smaller chunks using RecursiveCharacterTextSplitter to ensure they fit within token limits.
- Finally, the chunked documents are embedded and stored in Pinecone using the PineconeVectorStore.from_documents() method.
Define the query script
Once documents are ingested into Pinecone, the next step is to query them using a LLM model. This involves retrieving the most similar chunks from the vector index and passing them as context to OpenAI’s model for answer generation. In this section, you’ll define a function that takes a user question, performs a semantic search over your indexed data, and returns an AI-generated response.
Next, create a app/query.py
file and add the following code snippet to it:
# app/query.py
import os
from dotenv import load_dotenv
from langchain_pinecone import PineconeVectorStore
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA
def ask(question: str) -> str:
load_dotenv()
embedding = OpenAIEmbeddings()
index_name = os.getenv("PINECONE_INDEX_NAME")
if not index_name:
raise ValueError("PINECONE_INDEX_NAME must be set in your .env file.")
docsearch = PineconeVectorStore.from_existing_index(
index_name=index_name,
embedding=embedding,
namespace="default"
)
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(),
retriever=docsearch.as_retriever()
)
result = qa.invoke({"query": question})
return result["result"]
Here’s how the ask function works:
- It connects to the existing Pinecone index and loads the stored embeddings using PineconeVectorStore.from_existing_index.
- A RetrievalQA chain is created using LangChain, which wraps a ChatOpenAI model with a retriever that fetches context from Pinecone.
- The user’s question is passed to the chain via the invoke method, which returns a context-aware answer generated by the LLM.
Set up the environment variables
You’ll notice that the ingest.py
and query.py
scripts rely on certain environment variables that needs to be setup before running the application. Create a .env
file in the project’s root and add the following environment variables to it:
OPENAI_API_KEY=
PINECONE_API_KEY=
PINECONE_INDEX_NAME=
PINECONE_INDEX_HOST=
Set the following values for the environment variables:
OPENAI_API_KEY
: Add the OpenAI API key you created in the prerequisities section.PINECONE_API_KEY
: Create a Pinecone API key and add it here.PINECONE_INDEX_NAME
: Add the name of the Pinecone index that you created earlier.PINECONE_INDEX_HOST
: Add the Pinecone host URL for the index that you created in the previous section.
Adding unit test for ingestion
Now that you have defined the app logic, add some unit tests to verify that everything works as expected. Add tests for the ingestion logic to verify that the ingest_document
function behaves as expected under both successful and failure scenarios, without making real API calls or file reads. Create a tests/test_ingest.py
file and add this code snippet to it:
# tests/test_ingest.py
import sys
import os
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
import pytest
from unittest.mock import patch, MagicMock
from app.ingest import ingest_document
@patch("app.ingest.load_dotenv")
@patch("app.ingest.PineconeVectorStore")
@patch("app.ingest.OpenAIEmbeddings")
@patch("app.ingest.TextLoader")
@patch("app.ingest.RecursiveCharacterTextSplitter")
def test_ingest_document_success(mock_text_splitter, mock_text_loader, mock_embeddings, mock_vectorstore, mock_load_dotenv):
mock_load_dotenv.return_value = None
mock_loader_instance = MagicMock()
mock_loader_instance.load.return_value = ["mock_document"]
mock_text_loader.return_value = mock_loader_instance
mock_splitter_instance = MagicMock()
mock_splitter_instance.split_documents.return_value = ["mock_chunk1", "mock_chunk2"]
mock_text_splitter.return_value = mock_splitter_instance
mock_embeddings.return_value = MagicMock()
mock_vectorstore.from_documents.return_value = None
with patch.dict("os.environ", {"PINECONE_INDEX_NAME": "mock-index"}):
result = ingest_document("mock_file.txt")
assert result["status"] == "success"
assert result["index_name"] == "mock-index"
assert "successfully ingested" in result["message"]
mock_text_loader.assert_called_once_with("mock_file.txt")
mock_loader_instance.load.assert_called_once()
mock_text_splitter.assert_called_once()
mock_splitter_instance.split_documents.assert_called_once_with(["mock_document"])
mock_embeddings.assert_called_once()
mock_vectorstore.from_documents.assert_called_once_with(
["mock_chunk1", "mock_chunk2"],
index_name="mock-index",
embedding=mock_embeddings.return_value,
namespace="default"
)
@patch("app.ingest.load_dotenv")
def test_ingest_document_missing_env(mock_load_dotenv):
mock_load_dotenv.return_value = None
with patch.dict("os.environ", {}, clear=True):
result = ingest_document("mock_file.txt")
assert result["status"] == "error"
assert "PINECONE_INDEX_NAME must be set" in result["message"]
Adding unit tests for query
Similar to the ingestion script, lets define tests for the query script by creating a tests/test_query.py
file and adding this code snippet to it:
# tests/test_query.py
import sys
import os
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
import pytest
from unittest.mock import patch, MagicMock
from app.query import ask
@patch("app.query.load_dotenv")
@patch("app.query.PineconeVectorStore")
@patch("app.query.OpenAIEmbeddings")
@patch("app.query.ChatOpenAI")
@patch("app.query.RetrievalQA")
def test_ask_success(mock_retrieval_qa, mock_chat_openai, mock_embeddings, mock_vectorstore, mock_load_dotenv):
mock_load_dotenv.return_value = None
mock_docsearch = MagicMock()
mock_vectorstore.from_existing_index.return_value = mock_docsearch
mock_llm = MagicMock()
mock_chat_openai.return_value = mock_llm
mock_qa_chain = MagicMock()
mock_qa_chain.invoke.return_value = {"result": "mocked answer"}
mock_retrieval_qa.from_chain_type.return_value = mock_qa_chain
with patch.dict("os.environ", {"PINECONE_INDEX_NAME": "mock-index"}):
response = ask("What is RAG?")
assert response == "mocked answer"
mock_vectorstore.from_existing_index.assert_called_once_with(
index_name="mock-index",
embedding=mock_embeddings.return_value,
namespace="default"
)
mock_chat_openai.assert_called_once()
mock_retrieval_qa.from_chain_type.assert_called_once_with(
llm=mock_chat_openai.return_value,
retriever=mock_docsearch.as_retriever.return_value
)
mock_qa_chain.invoke.assert_called_once_with({"query": "What is RAG?"})
@patch("app.query.load_dotenv")
@patch("app.query.OpenAIEmbeddings")
def test_ask_missing_env(mock_embeddings, mock_load_dotenv):
mock_load_dotenv.return_value = None
mock_embeddings.return_value = None
with patch.dict("os.environ", {}, clear=True):
with pytest.raises(ValueError) as excinfo:
ask("Test question")
assert "PINECONE_INDEX_NAME must be set" in str(excinfo.value)
You can run the test by executing the following command:
pytest ./
Once you execute the tests, it should the show the status as passed:
================================================ test session starts ================================================
platform darwin -- Python 3.9.12, pytest-8.3.4, pluggy-1.5.0
...
collected 4 items
tests/test_ingest.py .. [ 50%]
tests/test_query.py .. [100%]
================================================= 4 passed in 0.93s =================================================
Define the Flask server
To expose the ingestion and question-answering functionality as HTTP endpoints, you’ll create a Flask server. Flask is an open-source Python framework for developing web applications. It is a popular choice for building the API service layer for lightweight applications. This allows external clients to interact with your RAG system via REST API calls. Create a server.py
file in the project’s root and add the following code snippet to it:
# server.py
import os
from flask import Flask, request, jsonify
from dotenv import load_dotenv
from app.query import ask
from app.ingest import ingest_document
import werkzeug
load_dotenv()
app = Flask(__name__)
@app.route('/ask', methods=['POST'])
def query_rag():
data = request.get_json()
question = data.get("question")
if not question:
return jsonify({"error": "Missing 'question' in request"}), 400
try:
answer = ask(question)
return jsonify({"question": question, "answer": answer})
except Exception as e:
return jsonify({"error": str(e)}), 500
@app.route('/ingest', methods=['POST'])
def ingest_file():
try:
if 'file' not in request.files:
return jsonify({"error": "No file provided"}), 400
file = request.files['file']
if file.filename == '':
return jsonify({"error": "No file selected"}), 400
temp_file_path = f"temp_{werkzeug.utils.secure_filename(file.filename)}"
file.save(temp_file_path)
result = ingest_document(temp_file_path)
os.remove(temp_file_path)
return jsonify(result)
except Exception as e:
return jsonify({"status": "error", "message": str(e)}), 500
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
Here’s an overview the endpoints defined in the Flask server:
POST /ask
accepts a JSON payload with aquestion
field, runs it through the retrieval and LLM chain (ie. by calling theask
method defined earlier), and returns the generated answer.POST /ingest
accepts a file upload, temporarily saves the document, ingests its contents into Pinecone by invokingingest_document
, and returns a status message.
To test the API endpoint, start the Flask web server by executing this command:
flask --app server run
It will start a webserver at http://localhost:5000
and you can test the ingestion API using curl
.
curl --location 'http://127.0.0.1:5000/ingest' \
--form 'file=@"/Users/vivekmaskara/Downloads/paul_graham_essay.txt"'
Note: Make sure that the input text file exists at the specified file
location. You can download the sample paul_graham_essay.txt used in the above code snippet from GitHub.
Your output should be similar to this:
{"index_name":"pinecone-example-index","message":"Document 'temp_paul_graham_essay.txt' successfully ingested into index 'pinecone-example-index'","status":"success"}
Similarly, you can test the query API using curl
:
curl --location 'http://127.0.0.1:5000/ask' \
--header 'Content-Type: application/json' \
--data '{
"question": "What is the essay about?"
}'
You should see an output similar to this:
{"answer":"The essay is about the experience and process of writing and publishing essays online. It discusses the shift from traditional publishing channels to online platforms, the author's motivation and inspiration for writing essays, the relationship between giving talks and writing essays, and the challenges faced in receiving feedback and criticism online. It also touches upon the significance of choosing what to work on, how ideas are developed, and the decision-making process behind selecting topics to write about.","question":"What is the essay about?"}
Setting up the Heroku application
Now that you have tested the application locally, you can deploy the app on a remote server such as Heroku. This will enable more people to use the QA system.
Configure Heroku
Before deploying to Heroku, you need to create a Heroku app using its dashboard. Provide a name for the app and choose a region for the new app.
Also, for your app to work correctly, you need to subscribe to a Heroku Dyno. Go to Account > Billing. Click Subscribe under Eco Dynos Plan to subscribe to a Dyno.
Finally, you need to configure the application config variables on Heroku so that the Flask application works as expected. Go to your project’s Settings > Config vars section and add the variables.
Configure the Flask app
In the codebase, create a Procfile
in the project’s root and add this content:
web: gunicorn server:app
The Procfile config is needed by Heroku to find the commands that are executed by the app on startup.
Automate deployment using CircleCI
Now that you have tested the ingestion APIs locally, automate the workflow so that the unit tests can be executed and the application is deployed to Heroku every time you push code to the main branch.
Add the configuration script
First, add a .circleci/config.yml
script in the project’s root containing the configuration file for the CI pipeline. Add this code to it:
version: 2.1
orbs:
heroku: circleci/heroku@2.0
executors:
python-executor:
docker:
- image: cimg/python:3.11
jobs:
test:
executor: python-executor
steps:
- checkout
- run:
name: Install dependencies and run tests
command: |
python -m venv venv
. venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pytest tests/
workflows:
test-and-deploy:
jobs:
- test
- heroku/deploy-via-git:
requires:
- test
filters:
branches:
only: main
app-name: "${HEROKU_APP_NAME}"
Take a moment to review the CircleCI configuration:
- The test job uses the cimg/python:3.11 Docker image as its environment, sets up a virtual environment, installs dependencies from
requirements.txt
, and runs the unit tests usingpytest
. - The heroku/deploy-via-git job uses CircleCI Heroku Orb to deploy the app to Heroku. This job is triggered only if the
test
job is successful, ensuring that only tested code is deployed to production from the main branch.
Now that the configuration file has been properly set up, create a repository for the project on GitHub and push all the code to it. Review Pushing a project to GitHub for instructions.
Set up the project on CircleCI
Next, log into your CircleCI account. On the CircleCI dashboard, click the Projects tab, search for the GitHub repo name and click Set Up Project for your project.
You will be prompted to add a new configuration file manually or use an existing one. Since you have already pushed the required configuration file to the codebase, select the Fastest option and enter the name of the branch hosting your configuration file. Click Set Up Project to continue.
Completing the setup will trigger the pipeline but it will fail since the environment variables are not set.
Set environment variables
On the project page, click Project settings and go to the Environment variables tab. On the screen that appears, click Add environment variable button and add the following environment variables.
HEROKU_APP_NAME
to the application that you used while creating the Heroku application.HEROKU_API_KEY
to your Heroku account’s API key. You can retrieve yourAPI key
from your Heroku account’s settings page.
Once you add the environment variables, it should show the key values on the dashboard.
Now that the environment variables are configured, trigger the pipeline again. This time the build should succeed.
Conclusion
In this tutorial, you learned how to build a Retrieval-Augmented Generation (RAG) application using Pinecone and OpenAI, and expose it via a Flask API. You also saw how to automate testing and deployment using CircleCI and Heroku. Pinecone simplifies working with high-dimensional vector data, making it easy to store and retrieve semantically relevant context for your language model queries.
With CircleCI, you can automate the build and testing pipeline for continuous integration. The pipeline can be used to execute unit tests for the application using pytest
to boost development speed.
You can check out the complete source code used in this tutorial on GitHub.