TutorialsOct 14, 202514 min read

Building LLM agents to validate LangGraph tool use and structured API responses

Muhammad Arham

Senior NLP Researcher

Transitioning LLM agents from intriguing prototypes to reliable, production-grade solutions introduces a unique and significant challenge: the inherent stochasticity of LLMs. Unlike conventional software, where inputs predictably yield precise outputs, an LLM’s response can exhibit variability even when presented with identical prompts.

To ensure the dependability of your LLM agent, you will need a rigorous validation strategy. For any agent intended for production, you must verify more than just the potential for tool usage.

You need to confirm several vital aspects at each stage of the tool-calling process:

  • Correct tool invocation: Did the LLM accurately select the appropriate tool for the task, and did it maintain a logical chain-of-thought through multi-step operations?
  • Accurate parameter generation: Were the arguments generated by the LLM for the chosen tool precise and valid for its intended execution?
  • Tool output parsing: Did your agent correctly interpret the response from the tool, especially when dealing with complex or varied data returned from external APIs?
  • Output schema validation: Was the tool’s output consistently structured according to a defined schema?

By integrating automated testing into your CI/CD pipeline you can systematically validate your LLM agent’s behavior. This automation ensures that any modifications to your prompts, adjustments to your agent’s core logic, or updates to your tools do not introduce regressions. This approach significantly reduces the occurrence of failed workflows, providing the robustness of essential production-grade applications.

CircleCI pipelines can automate validation of your agentic workflows. In these sections, you will build this kind of pipeline, using:

  • LangGraph for structured agent development
  • Pydantic for rigorous data validation
  • PyTest for testing dynamic workflows
  • CircleCI for continuous, automated quality assurance

Prerequisites

To follow this guide and build your LLM agent CI/CD pipeline, here’s what you’ll need:

Setting up your Python project

First, create a new project directory and go to it using your terminal:

mkdir AgentToolValidation
cd AgentToolValidation

It is a recommended best practice to create a Python virtual environment. For Unix-based systems, create a new virtual environment with these commands:

python3 -m venv venv
source venv/bin/activate

Next, define and install your project’s dependencies. Create a requirements.txt file in your project’s root directory and add this content:

langgraph==0.4.7 		        # For building robust, stateful LLM agents with clear execution flow.
pytest==8.3.5 			        # Our primary framework for writing and running tests. 
pytest-asyncio==1.0.0 	    # Enables pytest to discover and execute asynchronous test functions.
langchain-openai==0.3.18	  # Integrates LangGraph’s agent with OpenAI's LLMs

Install these dependencies using pip:

pip install -r requirements.txt

Create a new file named .env, and add this content:

OPENAI_API_KEY="YOUR_OPENAI_API_KEY_HERE"

Once these files are set up, create these directories. It’s important to add an empty __init__.py file inside each of them. This tells Python that these directories should be treated as packages, allowing for proper imports between your agent’s modules and your test files.

mkdir agent tests
touch agent/__init__.py tests/__init__.py

With your environment ready and dependencies installed, you are all set to start building the modular components of your LLM agent.

Building structured tools with Pydantic

Your LLM agent’s true power lies in its ability to interact with the world through these tools. For this project, we implement a few practical examples that demonstrate some external real-time interactions:

  • Weather Tool: Using OpenWeather’s Geocoding API and Open-meteo’s Weather API API, this complex tool intelligently combines two functionalities: first, a geocoding lookup to convert a human-readable city name (like “London”) into precise latitude and longitude coordinates, and then a call to a weather API to fetch real-time current conditions for those coordinates.
  • Wikipedia Tool: Designed to search for and summarize information from Wikipedia API, allowing the agent to answer factual questions by retrieving up-to-date knowledge.
  • Calculator Tool: A simple utility for performing basic mathematical operations, extending the agent’s numerical reasoning capabilities beyond what the LLM can reliably compute internally.

Create a new file agent/tools.py and add this code:

# File Name: agent/tools.py

import typing as T
import json
import requests
from langchain_core.tools import tool
from pydantic import BaseModel, Field, ValidationError

class WikipediaToolInputSchema(BaseModel):
    """Input for the search_wikipedia tool"""
    query: str = Field(description="Query to search Wikipedia for")

class WikipediaArticle(BaseModel):
    """Represents a summary of a Wikipedia article"""
    title: str = Field(description="Title of the Wikipedia article")
    summary: str = Field(description="Summary of the Wikipedia article")
    url: str = Field(description="URL of the Wikipedia article")

class CoordinatesToolInputSchema(BaseModel):
    """Input for the get_coordinates_from_city tool"""
    city_name: str = Field(description="Name of the city to get coordinates for")

class Coordinates(BaseModel):
    """Represents geographical coordinates for a location"""
    latitude: float = Field(description="Latitude of the location")
    longitude: float = Field(description="Longitude of the location")

class WeatherToolInputSchema(BaseModel):
    """Input for the get_current_weather tool"""
    latitude: float = Field(description="Latitude of the location")
    longitude: float = Field(description="Longitude of the location")

class CurrentWeather(BaseModel):
    """Represents the current weather conditions for a location"""
    latitude: float = Field(description="Latitude of the location")
    longitude: float = Field(description="Longitude of the location")
    temperature: float = Field(description="Current temperature in degrees Celsius")
    wind_speed: float = Field(description="Current wind speed in km/h")
    relative_humidity_2m: float = Field(description="Current relative humidity at 2 meters above ground level in percentage")
    is_day: int = Field(description="Indicates if it is currently day (1) or night (0) at the location")
    weather_code: int = Field(description="WMO Weather interpretation code")
    time: str = Field(description="Current time of the weather observation in ISO format")

@tool("get_coordinates_from_city", args_schema=CoordinatesToolInputSchema)
def get_coordinates_from_city(city_name: str) -> str:
    """
    Converts a city name into a geographical latitutde and longitude using Open-Meteo Geocoding API.
    Returns a JSON string of a coordinate object for the most relevant result.
    """
    GEOCODING_API_URL = "https://geocoding-api.open-meteo.com/v1/search"
    params = {
        "name": city_name,
        "count": 1,
        "language": "en",
        "format": "json"
    }

    try:
        response = requests.get(GEOCODING_API_URL, params=params, timeout=30)
        response.raise_for_status()
        data = response.json()

        results = data.get("results")
        if not results:
            return json.dumps({"error": f"Could not find coordinates for city: {city_name}"})

        top_result = results[0]
        results_data = {
            "name": top_result.get("name"),
            "latitude": top_result.get("latitude"),
            "longitude": top_result.get("longitude"),
            "country": top_result.get("country"),
        }
        validated_result = Coordinates(**results_data)
        return validated_result.model_dump_json()

    except requests.exceptions.Timeout:
        return json.dumps({"error": "Geocoding API request timed out."})
    except requests.exceptions.RequestException as e:
        return json.dumps({"error": f"Error connecting to Geocoding API: {e}"})
    except json.JSONDecodeError:
        return json.dumps({"error": "Failed to decode JSON response from Geocoding API."})
    except ValidationError as e:
        return json.dumps({"error": f"Failed to validate coordinates schema: {e.errors()}"})
    except Exception as e:
        return json.dumps({"error": f"An unexpected error occurred during coordinate retrieval: {e}"})

@tool("get_current_weather", args_schema=WeatherToolInputSchema)
def get_current_weather(latitude: float, longitude: float) -> str:
    """
    Retrieves the current weather conditions for a given latitude and longitude using Open-Meteo Weather API.
    Returns a JSON string of a weather object.
    """
    WEATHER_API_URL = "https://api.open-meteo.com/v1/forecast"
    params = {
        "latitude": latitude,
        "longitude": longitude,
        "current": "temperature_2m,relative_humidity_2m,is_day,wind_speed_10m,weather_code",
        "timezone": "auto",
        "forecast_days": 1
    }

    try:
        response = requests.get(WEATHER_API_URL, params=params, timeout=30)
        response.raise_for_status()
        data = response.json()

        if "current" not in data:
            return json.dumps({"error": "No current weather data available for this location."})

        current_data = data["current"]
        weather_data = {
            "latitude": latitude,
            "longitude": longitude,
            "temperature": current_data.get("temperature_2m"),
            "wind_speed": current_data.get("wind_speed_10m"),
            "relative_humidity_2m": current_data.get("relative_humidity_2m"),
            "is_day": current_data.get("is_day"),
            "weather_code": current_data.get("weather_code"),
            "time": current_data.get("time")
        }
        validated_weather = CurrentWeather(**weather_data) 
        return validated_weather.model_dump_json()

    except requests.exceptions.Timeout:
        return json.dumps({"error": "Weather API request timed out."})
    except requests.exceptions.RequestException as e:
        return json.dumps({"error": f"Error connecting to Weather API: {e}"})
    except json.JSONDecodeError:
        return json.dumps({"error": "Failed to decode JSON response from Weather API."})
    except ValidationError as e:
        return json.dumps({"error": f"Failed to validate weather data schema: {e.errors()}"})
    except Exception as e:
        return json.dumps({"error": f"An unexpected error occurred during weather retrieval: {e}"})

@tool("search_wikipedia", args_schema=WikipediaToolInputSchema)
def search_wikipedia(query: str) -> str:
    """
    Searches Wikipedia for a given query and returns a summary and URL.
    Returns a JSON string of a Wikipedia object.
    """

    API_URL = "https://en.wikipedia.org/w/api.php"
    params = {
        "action": "query",
        "format": "json",
        "titles": query,
        "prop": "extracts|info",
        "exintro": True, # Get only introductory section
        "explaintext": True,
        "inprop": "url",
        "redirects": 1
    }

    try:
        response = requests.get(API_URL, params=params, timeout=30)
        response.raise_for_status()
        data = response.json()
        pages = data.get("query", {}).get("pages", {})
        if not pages:
            return json.dumps({"error": "No wikipedia article found for the query"})

        page_id = next(iter(pages))
        page_data = pages[page_id]

        if "missing" in page_data:
            return json.dumps({"error": "No wikipedia article found the query"})

        results_data = {
            "title": page_data.get("title"),
            "summary": page_data.get("summary") if "summary" in page_data else page_data.get("extract"),
            "url": page_data.get("fullurl")
        }
        validated_result = WikipediaArticle(**results_data)
        return validated_result.model_dump_json()

    except requests.exceptions.Timeout:
        return json.dumps({"error": "Wikipedia API request timed out."})
    except requests.exceptions.RequestException as e:
        return json.dumps({"error": f"Error connecting to Wikipedia API: {e}"})
    except json.JSONDecodeError:
        return json.dumps({"error": "Failed to decode JSON response from Wikipedia API."})
    except ValidationError as e:
        return json.dumps({"error": f"Failed to validate Wikipedia article schema: {e.errors()}"})
    except Exception as e:
        return json.dumps({"error": f"An unexpected error occurred during Wikipedia search: {e}"})

@tool("calculate")
def calculate(expression: str) -> str:
    """
    Evaluates a mathematical expression (e.g '2 + 2 * 3').
    Supports basic arithmetic operations
    """
    try:
        result = eval(expression)
        return str(result)

    except SyntaxError:
        return "Error: Invalid mathematical expression."
    except NameError:
        return "Error: Invalid input in expression (e.g., non-numeric characters)."
    except Exception as e:
        return f"Error during calculation: {e}"

The tool’s functionality for performing such actions is defined within the Python file. But defining the function is only the beginning. The critical aspect for reliability, especially given the stochastic behavior of LLMs, is ensuring that the LLM interacts with these tools predictably. This is where Pydantic becomes an indispensable guardian of data integrity, both for inputs to your tools and outputs from them.

When your LLM determines it needs to use a tool, it generates the arguments for that tool based on the user’s query and its internal reasoning. Because of the LLM’s non-deterministic nature, it might occasionally generate parameters that do not perfectly match the tool’s expected input signature. It could be a wrong data type, a misspelled parameter name, or a missing required field.

Pydantic models, used as args_schema for your tools, act as a strict contract. They enforce that the LLM’s generated input conforms precisely to the tool’s requirements. If the LLM produces something that does not fit the schema, it fails early and predictably. This robust input parsing ensures your tool always receives valid, expected data, giving your agent a reliable foundation for execution.

For example, the CoordinatesToolInputSchema for the geocoding tool guarantees that the city_name passed to get_coordinates_from_city will always be a string. This prevents unexpected types from causing runtime errors or misinterpretation by the tool itself.

After a tool executes, especially if it interacts with an external API, it returns a result. These raw outputs can be unstructured, contain irrelevant data, or even be inconsistent. Your agent needs to reliably parse this output to continue its chain-of-thought or formulate a final response.

Pydantic models, used as output_schema for your tools, enforce that the tool’s return value is always transformed into a predictable, structured format. This acts as a powerful guardrail. If the tool’s raw output does not conform to this expected structure, it is flagged immediately.

For example, the Coordinates output model enforces an output schema, and guarantees that the agent always receives consistent latitude and longitude values for subsequent API calls.

Building your agent with LangGraph

With your structured tools ready, set up your agent using LangGraph. Create a new file called agent/agent.py and add this code:

# File Name: agent/agent.py 

import typing as T
from langchain_core.messages import BaseMessage, AIMessage, HumanMessage, ToolMessage
from langchain_openai import ChatOpenAI
from langgraph.graph.message import add_messages
from langgraph.graph import END, StateGraph
from langgraph.prebuilt import ToolNode

class AgentState(T.TypedDict):
    messages: T.Annotated[T.List[BaseMessage], add_messages]

class LLMAgent:
    def __init__(
        self,
        model_name: str = "gpt-4o-mini",
        tools: T.List[T.Callable[..., T.Any]] = None,
        max_tokens: int = 1000,
        temperature: float = 0.1,
    ) -> None:
        """
        Initializes the LLM agent with tools and an OpenAI LLM.
        Args:
            model_name (str): The name of the OpenAI model to use
            tools (list): A list of LangChain tools the agent can use.
            max_tokens (int): The max tokens to genreate in the response.
            temperature (float): The sampling temperature for the LLM.
        """
        self.tools = tools or []
        self.llm = ChatOpenAI(model_name=model_name, temperature=temperature, max_tokens=max_tokens)
        self.llm_with_tools = self.llm.bind_tools(self.tools)
        self.app = self._build_workflow()

    def _call_model(self, state: AgentState):
        """
        Internal node method that calls the LLM with the current message history.
        The LLM decides whether to respond directly or call a tool.
        """
        messages = state["messages"]
        response = self.llm_with_tools.invoke(messages)
        return {"messages": [response]}

    def _should_continue(self, state: AgentState):
        """
        Internal method that determines whether the agent should continue by calling a tool
        or end the conversation. This logic inspects the last message from the LLM.
        """
        last_message = state["messages"][-1]
        if last_message.tool_calls:
            return "tools"
        return END

    def _build_workflow(self):
        """Builds and compiles the LangGraph workflow for the agent."""
        workflow = StateGraph(AgentState)
        workflow.add_node("agent", self._call_model)
        workflow.add_node("tools", ToolNode(self.tools))
        workflow.set_entry_point("agent")

        workflow.add_conditional_edges(
            "agent", 
            self._should_continue,
            {"tools": "tools", END: END}
        )
        workflow.add_edge("tools", "agent")
        return workflow.compile()

    async def run_query(self, query: str) -> str:
        """
        Runs a query through the agent's workflow.
        Args:
            query (str): The input query for the agent.
        Returns:
            str: The final response from the agent.
        """
        inputs = {"messages": [HumanMessage(content=query)]}
        full_response = []
        async for s in self.app.astream(inputs):
            if "__end__" not in s:
                full_response.append(s)

        final_state = full_response[-1]
        final_message = final_state["agent"]["messages"][-1]
        return final_message.content

    async def stream_query(self, query: str):
        """
        Streams the intermediate steps and final response of a query.
        Useful for debugging and observing agent behavior.

        Args:
            query (str): The user's input query.

        Yields:
            Dict: The intermediate state or final response.
        """
        inputs = {"messages": [HumanMessage(content=query)]}
        async for s in self.app.astream(inputs):
            yield s

Here is a walk-through of the integral parts of the LLMAgent class. At the foundation of your agent’s operation is the AgentState class, a TypedDict that defines the shared memory flowing through your LangGraph workflow. It is designed to consistently manage messages, automatically appending new interactions from the user, the agent itself, or tool outputs to maintain conversation context for the LLM.

When you initialize your LLMAgent class, you are setting up the core intelligence. The __init__ method initalizes your chosen OpenAI LLM using LangChain’s ChatOpenAI class. The truly pivotal step here is self.llm.bind_tools(self.tools). This is what informs your LLM about the specific tools it has access to and their schemas, empowering it with the essential function-calling capabilities needed to interact with the world beyond its training data. This binding is how the LLM “sees” and understands the tools you have provided.

Central to the agent’s decision-making process are two internal methods that serve as distinct nodes in your LangGraph:

  • The _call_model function acts as the agent’s “brain.” When this node is active, it takes the current message history and invokes the LLM. It is at this point that the LLM analyzes the conversation and decides its next action: either to generate a direct textual response, indicating the end of a thought cycle, or to specify a tool call, signaling its intent to interact with an external system. Following this, the _should_continue function acts as a conditional router. This critical method inspects the last message generated by the LLM. If that message contains tool_calls, meaning the LLM decided to use a tool, the workflow is directed to execute those tools. Conversely, if the LLM has produced a final, direct answer, this function signals the graph to END the current interaction. This dynamic logic is fundamental to how your agent intelligently orchestrates tool usage.
  • The _build_workflow method is where the entire LangGraph StateGraph is constructed and compiled. This method serves as the definitive blueprint for your agent’s behavior. Within it, you explicitly define nodes like "agent" for the LLM’s reasoning and "tools" for the execution of those external tools. You set an entry_point("agent") to ensure every interaction starts with the LLM’s initial thought process. Importantly, add_conditional_edges implements the decision logic from _should_continue, allowing your agent to dynamically switch between thinking and acting. A crucial element for creating conversational loops is add_edge("tools", "agent"); after a tool successfully executes, the workflow automatically returns to the "agent" node. This enables the LLM to process the tool’s output and intelligently decide on the next step in the conversation or workflow, forming a continuous cycle of reasoning and action. Finally, compile() optimizes this intricate graph for efficient execution.

To interact with your fully assembled agent, you have two asynchronous methods:

  • run_query is designed to provide you with the final, complete response from the agent.
  • stream_query is particularly invaluable during development and debugging. It allows you to observe every intermediate step of your agent’s chain-of-thought and tool interactions as they happen, giving you deep insights into its internal workings.

Running your agent locally

With your LLM agent structured and its tools defined, you can now run a quick test to initialize your agent with the tools you have built and execute a simple query.

Create the main.py file in your project root directory, and add this code:

# File Name: main.py

import asyncio
from agent.tools import get_coordinates_from_city, get_current_weather, search_wikipedia, calculate
from agent.agent import LLMAgent

async def main():
    tools = [get_coordinates_from_city, get_current_weather, search_wikipedia, calculate]
    agent = LLMAgent(tools=tools, model_name="gpt-3.5-turbo")
    print("--- Testing LLM Agent ---")
    response = await agent.run_query("What's the current weather like in Karachi?")
    print(f"--- Agent Final Response ---\n{response}")

if __name__ == "__main__":
    asyncio.run(main())    

To execute the code locally, ensure your environment variables, including your OPENAI_API_KEY, are loaded. If you are on a Unix-like system, open your terminal in the project’s root directory. Run:

set -a
source .env

Now, execute the main.py script:

python main.py

The sample response should be similar to this:

--- Testing LLM Agent --- 
--- Agent Final Response --- 
The current weather in Karachi is clear sky with a temperature of 32.2°C, and a humidity of 46%. 

The imperative for rigorous testing in LLM agents

So far, you have built a simple LLM agent that can reason and use tools. But how do you confirm it consistently behaves as expected, especially given that LLMs are stochastic? Unlike traditional code where input X always yields output Y, an LLM’s response can vary. The LLM decision to call a tool, or the parameters it generates for it, might not always be perfectly consistent.

Traditional unit tests often fall short because they check isolated functions. An LLM agent involves a dynamic chain-of-thought and interaction with external systems.

You need to validate the entire journey: from the LLM’s initial understanding, its decision to call a tool, the correctness of the generated parameters, the successful execution of the tool, the proper parsing of its output, and the LLM’s final response.

To achieve this rigorous validation, you will employ a powerful combination of tools:

  • Pytest: This is the test framework, known for its flexibility and ease of use in writing clear, concise tests.
  • Mocking: Since the agent interacts with external APIs (like weather services or Wikipedia), making live calls during tests would be slow, costly, and unreliable. You will use mocking to simulate these external responses. Mocking ensures that your tests run quickly and deterministically, regardless of network conditions, time or API rate limits.
  • Pydantic: Pydantic’s role in defining structured inputs and outputs for your tools is vital. In testing, Pydantic acts as a strong assertion mechanism, allowing you to programmatically verify that tool calls and their results adhere to expected data schemas, catching subtle data integrity issues.

Centralizing your test helpers

To keep your tests clean and avoid repetition, you will centralize common fixtures and helper functions in tests/conftest.py. This file automatically gets discovered by pytest. The fixtures defined in it become available to all your test files for state management and initialization.

Create a new file called tests/conftest.py and add this code:

# File Name: tests/conftest.py

from typing import Dict
import pytest
from unittest.mock import MagicMock
from langchain_core.messages import AIMessage, ToolMessage

from agent.agent import LLMAgent
from agent.tools import calculate, get_current_weather, get_coordinates_from_city, search_wikipedia

@pytest.fixture(scope="module")
def llm_agent():
    """
    Provides an instance of LLMAgent for testing
    We use a module scope to initialize it once for all tests in this file
    """
    tools = [calculate, get_coordinates_from_city, get_current_weather, search_wikipedia]
    return LLMAgent(tools=tools, model_name="gpt-3.5-turbo", temperature=0.1)

async def get_agent_trajectory(agent: LLMAgent, query: str):
    """
    Runs the agent in streaming mode and extracts relevant steps for validation.
    Returns a list of dictionaries, each representing a significant event.
    """
    trajectory = []
    async for step in agent.stream_query(query):
        # LangGraph stream yields dictionaries representing state changes or events
        print(step)
        if "agent" not in step and "tools" not in step: continue
        last_message = step["agent"]["messages"][-1] if "agent" in step else step["tools"]["messages"][-1]
        if isinstance(last_message, AIMessage) and last_message.tool_calls:
            # LLM decided to call a tool
            for tool_call in last_message.tool_calls:
                trajectory.append({
                    "type": "tool_call",
                    "tool_name": tool_call['name'],
                    "tool_args": tool_call['args']
                })
        elif isinstance(last_message, ToolMessage):
            # Tool execution result
            trajectory.append({
                "type": "tool_output",
                "tool_name": last_message.name,
                "tool_output": last_message.content # This is the JSON string
            })
        elif isinstance(last_message, AIMessage) and not last_message.tool_calls:
            # Final LLM response
            trajectory.append({
                "type": "final_response",
                "content": last_message.content
            })
    return trajectory

def create_mock_response(status_code: int, json_data: Dict = None, text_data: str = None):
    """
    Helper to create a MagicMock object that simulates a requests.Response.
    """
    mock_response = MagicMock(status_code=status_code)
    if json_data is not None:
        mock_response.json.return_value = json_data
    elif text_data is not None:
        mock_response.text = text_data # For APIs that might return plain text
    mock_response.raise_for_status.return_value = None # Assume success unless status_code indicates error
    return mock_response

Key helpers you will find in conftest.py:

  • The llm_agent fixture provides a ready-to-use instance of your LLMAgent for every test, ensuring a consistent starting point.
  • create_mock_response simulates external API calls. Instead of making actual network requests, you use this to craft a fake HTTP response, allowing you to dictate precisely what data your tool receives. This is fundamental for deterministic testing.
  • get_agent_trajectory is crucial for inspecting the agent’s internal thought process. By streaming the agent’s execution, this helper captures every step—every LLM decision, every tool call, every tool output—allowing you to make assertions not just on the final answer, but on the entire chain-of-thought.

End-to-end workflow testing with weather tool

To keep your tests organized and focused, you will create separate test files for each major agent workflow:

  • tests/test_weather_workflow.py
  • tests/test_wikipedia_workflow.py
  • tests/test_calculator_workflow.py

This modularity makes tests easier to manage and debug.

Here’s an description of the testing process using the weather example. This is a two-step process for the agent: first, it gets city coordinates, then it fetches the weather. Your test needs to validate both steps.

Create a new file called tests/test_weather_workflow.py and add these tests:

# File Name: tests/test_weather_workflow.py

import pytest
import json
from unittest.mock import patch

from agent.tools import Coordinates, CurrentWeather
from .conftest import get_agent_trajectory, create_mock_response

@pytest.mark.asyncio
async def test_weather_query_success_workflow(llm_agent):
    """
    Tests the complete workflow for a weather query:
    1. Agent calls get_coordinates_from_city.
    2. Agent calls get_current_weather with coordinates from step 1.
    3. Agent provides a final response using weather data.
    4. Validates structured JSON outputs from tools.
    """
    # Mock responses for the external APIs
    # Mock 1: Geocoding API response for "London"
    mock_response_geo = create_mock_response(
        status_code=200,
        json_data={
            "results": [{
                "name": "London",
                "latitude": 51.5074,
                "longitude": -0.1278,
                "country": "United Kingdom",
                "admin1": "England"
            }]
        }
    )

    # Mock 2: Open-Meteo Weather API response
    mock_response_weather = create_mock_response(
        status_code=200,
        json_data={
            "latitude": 51.5074,
            "longitude": -0.1278,
            "current": {
                "temperature_2m": 15.5,
                "wind_speed_10m": 12.3,
                "relative_humidity_2m": 75,
                "is_day": 1,
                "weather_code": 3, # Example WMO code for 'cloudy'
                "time": "2025-06-01T10:00Z"
            }
        }
    )

    with patch('agent.tools.requests.get', side_effect=[mock_response_geo, mock_response_weather]) as mock_get_func: # Use mock_get_func
        query = "What's the weather like in London?"
        trajectory = await get_agent_trajectory(llm_agent, query)

        assert len(trajectory) >= 3, "Expected at least 3 steps: geocoding call, weather call, final response"

        # Step 1: Validate get_coordinates_from_city call
        coord_call = next((s for s in trajectory if s["type"] == "tool_call" and s["tool_name"] == "get_coordinates_from_city"), None)
        assert coord_call is not None, "Agent did not call get_coordinates_from_city"
        assert coord_call["tool_args"]["city_name"] == "London", "get_coordinates_from_city called with incorrect city name"

        # Step 2: Validate Coordinates tool output and its structure
        coord_output = next((s for s in trajectory if s["type"] == "tool_output" and s["tool_name"] == "get_coordinates_from_city"), None)
        assert coord_output is not None, "Missing output from get_coordinates_from_city"
        parsed_coords = json.loads(coord_output["tool_output"])
        validated_coords = Coordinates(**parsed_coords) # Validate against Pydantic model
        assert validated_coords.latitude == pytest.approx(51.5074)
        assert validated_coords.longitude == pytest.approx(-0.1278)

        # Step 3: Validate get_current_weather call (using output from geocoding)
        weather_call = next((s for s in trajectory if s["type"] == "tool_call" and s["tool_name"] == "get_current_weather"), None)
        assert weather_call is not None, "Agent did not call get_current_weather"
        assert weather_call["tool_args"]["latitude"] == pytest.approx(51.5074) # Use pytest.approx for floats
        assert weather_call["tool_args"]["longitude"] == pytest.approx(-0.1278)

        # Step 4: Validate CurrentWeather tool output and its structure
        weather_output = next((s for s in trajectory if s["type"] == "tool_output" and s["tool_name"] == "get_current_weather"), None)
        assert weather_output is not None, "Missing output from get_current_weather"
        parsed_weather = json.loads(weather_output["tool_output"])
        validated_weather = CurrentWeather(**parsed_weather) # Validate against Pydantic model
        assert validated_weather.temperature == 15.5
        assert validated_weather.wind_speed == 12.3
        assert validated_weather.relative_humidity_2m == 75

        # Step 5: Validate final agent response
        final_response = next((s for s in trajectory if s["type"] == "final_response"), None)
        assert final_response is not None, "Missing final response from agent"
        assert "15.5" in final_response["content"] # Check for temperature in response
        assert "London" in final_response["content"] # Check for city name in response
        # The LLM's interpretation of weather code 3 (cloudy) might vary, so a flexible check
        assert any(keyword in final_response["content"].lower() for keyword in ["cloudy", "partly cloudy", "overcast"])

@pytest.mark.asyncio
async def test_weather_city_not_found_error_handling(llm_agent):
    """
    Tests agent's error handling when geocoding API cannot find a city.
    """
    # Mock 1: Geocoding API response for "imaginary_city_123" (no results)
    mock_response_no_results = create_mock_response(
        status_code=200,
        json_data={"results": []} # No results found
    )
    with patch('agent.tools.requests.get', return_value=mock_response_no_results) as mock_get_func: # Use mock_get_func
        query = "What's the weather in imaginary_city_123?"
        trajectory = await get_agent_trajectory(llm_agent, query)

        # Validate get_coordinates_from_city call
        coord_call = next((s for s in trajectory if s["type"] == "tool_call" and s["tool_name"] == "get_coordinates_from_city"), None)
        assert coord_call is not None, "Agent did not call get_coordinates_from_city"
        assert coord_call["tool_args"]["city_name"] == "imaginary_city_123"

        # Validate tool output indicates error
        coord_output = next((s for s in trajectory if s["type"] == "tool_output" and s["tool_name"] == "get_coordinates_from_city"), None)
        assert coord_output is not None, "Missing output from get_coordinates_from_city"
        parsed_output = json.loads(coord_output["tool_output"])
        assert "error" in parsed_output
        assert "Could not find coordinates" in parsed_output["error"]

        # Validate final agent response reflects the error
        final_response = next((s for s in trajectory if s["type"] == "final_response"), None)
        assert final_response is not None
        assert "could not find" in final_response["content"].lower() or "unable to determine" in final_response["content"].lower() or "couldn't find" in final_response["content"].lower() 
        assert "imaginary_city_123" in final_response["content"]

Tests for the Wikipedia and calculator tools

Similarly to the weather tool tests, you have test workflows for the other tools.

Create a new file called tests/test_calculator_workflow.py and add these tests:

# File Name: tests/test_calculator_workflow.py

import pytest
from .conftest import get_agent_trajectory

@pytest.mark.asyncio
async def test_calculator_success_workflow(llm_agent):
    """
    Tests the workflow for a successful calculation query.
    1. Agent calls calculate.
    2. Agent provides a final response with the calculation result.
    """
    query = "What is 10 + 5 * 2?"
    trajectory = await get_agent_trajectory(llm_agent, query) # Changed from pytest.helpers.get_agent_trajectory

    # Validate tool call
    calc_call = next((s for s in trajectory if s["type"] == "tool_call" and s["tool_name"] == "calculate"), None)
    assert calc_call is not None, "Agent did not call calculate"
    assert calc_call["tool_args"]["expression"] == "10 + 5 * 2", "calculate called with incorrect expression"

    # Validate Calculator tool output
    calc_output = next((s for s in trajectory if s["type"] == "tool_output" and s["tool_name"] == "calculate"), None)
    assert calc_output is not None, "Missing output from calculate"
    assert calc_output["tool_output"] == "20", "Calculator returned incorrect result" # 10 + (5*2) = 20

    # Validate final agent response
    final_response = next((s for s in trajectory if s["type"] == "final_response"), None)
    assert final_response is not None, "Missing final response from agent"
    assert "20" in final_response["content"], "Final response does not contain calculation result"

For the Wikipedia workflow, create a new file called tests/test_wikipedia_workflow.py. Add this code:

# File Name: tests/test_wikipedia_workflow.py

import pytest
import json
from unittest.mock import patch

from agent.tools import WikipediaArticle
from .conftest import get_agent_trajectory, create_mock_response

@pytest.mark.asyncio
async def test_wikipedia_search_success_workflow(llm_agent):
    """
    Tests the workflow for a successful Wikipedia search query.
    1. Agent calls search_wikipedia.
    2. Agent provides a final response using Wikipedia summary.
    3. Validates structured JSON output from the tool.
    """

    mock_response_wiki = create_mock_response(
        status_code=200,
        json_data={
            "query": {
                "pages": {
                    "12345": {
                        "pageid": 12345,
                        "ns": 0,
                        "title": "Artificial intelligence",
                        "extract": "Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines...",
                        "fullurl": "https://en.wikipedia.org/wiki/Artificial_intelligence"
                    }
                }
            }
        }
    )

    with patch('agent.tools.requests.get', return_value=mock_response_wiki) as mock_get_func: # Use mock_get_func
        query = "Tell me about artificial intelligence."
        trajectory = await get_agent_trajectory(llm_agent, query)

        # Validate tool call
        wiki_call = next((s for s in trajectory if s["type"] == "tool_call" and s["tool_name"] == "search_wikipedia"), None)
        assert wiki_call is not None, "Agent did not call search_wikipedia"
        assert wiki_call["tool_args"]["query"].lower() == "artificial intelligence", "search_wikipedia called with incorrect query"

        # Validate Wikipedia tool output and its structure
        wiki_output = next((s for s in trajectory if s["type"] == "tool_output" and s["tool_name"] == "search_wikipedia"), None)
        assert wiki_output is not None, "Missing output from search_wikipedia"
        parsed_wiki = json.loads(wiki_output["tool_output"])
        validated_wiki = WikipediaArticle(**parsed_wiki) # Validate against Pydantic model
        assert validated_wiki.title.lower() == "artificial intelligence".lower()
        assert "intelligence—perceiving" in validated_wiki.summary
        assert "https://en.wikipedia.org/wiki/Artificial_intelligence" == validated_wiki.url

        # Validate final agent response
        final_response = next((s for s in trajectory if s["type"] == "final_response"), None)
        assert final_response is not None, "Missing final response from agent"
        assert "Artificial intelligence" in final_response["content"]
        assert "machines" in final_response["content"]

@pytest.mark.asyncio
# Removed @patch decorator, will use with patch inside the test
async def test_wikipedia_search_not_found_error_handling(llm_agent): # Removed 'mock_get' from arguments
    """
    Tests agent's error handling when Wikipedia API finds no article.
    """

    mock_response_no_results = create_mock_response(
        status_code=200,
        json_data={"query": {"pages": {"-1": {"missing": ""}}}} # Wikipedia API response for no article found
    )

    with patch('agent.tools.requests.get', return_value=mock_response_no_results) as mock_get_func: # Use mock_get_func
        query = "Summarize the life of Zorp the Destroyer."
        trajectory = await get_agent_trajectory(llm_agent, query)

        # Validate search_wikipedia call
        wiki_call = next((s for s in trajectory if s["type"] == "tool_call" and s["tool_name"] == "search_wikipedia"), None)
        assert wiki_call is not None, "Agent did not call search_wikipedia"
        assert wiki_call["tool_args"]["query"] == "Zorp the Destroyer"

        # Validate tool output indicates error
        wiki_output = next((s for s in trajectory if s["type"] == "tool_output" and s["tool_name"] == "search_wikipedia"), None)
        assert wiki_output is not None, "Missing output from search_wikipedia"
        parsed_output = json.loads(wiki_output["tool_output"])
        assert "error" in parsed_output
        assert "No Wikipedia article found".lower() in parsed_output["error"].lower()

        # Validate final agent response reflects the error
        final_response = next((s for s in trajectory if s["type"] == "final_response"), None)
        assert final_response is not None
        assert "could not find" in final_response["content"].lower() or "no information" in final_response["content"].lower() or "couldn't find" in final_response["content"].lower()
        assert "zorp the destroyer" in final_response["content"].lower()

Running your tests

With your comprehensive tests in place, executing them is straightforward. From the root of your project directory, simply use the pytest command, pointing it to your tests/ folder:

pytest tests/

Pytest will discover and run all your defined test functions across the test_weather_workflow.py, test_wikipedia_workflow.py, and test_calculator_workflow.py files.

When execution is successful, your output will confirm that all your agent’s functionalities, including complex multi-step tool calls and error handling, are performing precisely as expected:

========================================================= test session starts ==========================================================
platform darwin -- Python 3.10.15, pytest-8.3.5, pluggy-1.6.0
rootdir: /Users/muhammadarham/Drive/CircleCIBlogs/AgentToolValidation
plugins: anyio-4.9.0, langsmith-0.3.43, asyncio-1.0.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 5 items

tests/test_calculator_workflow.py .                                                                                              [ 20%]
tests/test_weather_workflow.py ..                                                                                                [ 60%]
tests/test_wikipedia_workflow.py ..                                                                                              [100%]

Automating regression tests with CircleCI

You have built a robust LLM agent and a comprehensive test suite. The next step is automating this validation with a Continuous Integration (CI) pipeline. This ensures that every code change is instantly checked for regressions.

CI is indispensable because of the inherently stochastic nature of LLMs. A CI pipeline continuously runs your tests, immediately flagging if a prompt tweak or LLM update causes unexpected behavior or tool misuse. It is your safety net against introducing regressions in complex, multi-step agentic workflows and ensures reliability despite external API dependencies or potential vulnerabilities. Integrated with your version control system (like GitHub), CircleCI triggers automatically on every code push, providing instant quality feedback.

This simple .circleci/config.yml file guides CircleCI on when and how to execute your CI pipeline:

# File Name: .circle/config.yml

version: 2.1

jobs:
  test_llm_agent:
    docker:
      - image: cimg/python:3.10 # Use a stable Python image. Adjust version if needed.
    steps:
      - checkout

      - run:
          name: Install Python dependencies
          command: |
            pip install --upgrade pip
            pip install -r requirements.txt

      - run:
          name: Run LLM Agent Tests
          command: |
            pytest tests/

workflows:
  version: 2
  build_and_test_llm_agent:
    jobs:
      - test_llm_agent

The test_llm_agent job runs in a Python Docker container. It first checks out your code, then installs dependencies, and finally executes pytest tests/. The workflows section defines that this job runs automatically on every code push.

By implementing this CircleCI pipeline, you are establishing a continuous quality gate for your LLM agent. This quality gate ensures agent reliability and production readiness with every code change.

Setting up your project on CircleCI

Before you can set up a project on CircleCI, you first need to upload your code to GitHub. Create a new file named .gitignore in the project root directory. The .gitignore file defines files and folder that should not be pushed to GitHub. Open .gitignore and add this content to it:

__pycache__
.pytest_cache
*.py[cod]
.DS_Store
venv
.env

You can now push your code to a GitHub repository.

Before you can trigger the pipeline, you need to configure environment variables. Start by logging into your CircleCI account and creating a new project. From the CircleCI sidebar, select Projects. Click the ellipsis in your projects row and select Project Settings.

Opening project settings

On the project settings page, select Environment Variables on the sidebar. Add an environment variable with the key OPENAI_API_KEY and assign the your OpenAI API key as the value.

Addind an environment variable

You can now trigger the pipeline manually. It should execute successfully.

Successful execution

You can access the full code for this project on GitHub.

Conclusion

You have completed the tutorial! You now know how to move beyond basic LLM agent prototypes to build truly robust, production-grade systems. The journey involves a deliberate architecture using LangGraph for precise control over agentic workflows, Pydantic for ensuring structured and reliable data exchange with tools, and a comprehensive CI/CD pipeline orchestrated by CircleCI.


Muhammad Arham is a Deep Learning Engineer specializing in Computer Vision and Natural Language Processing, with experience developing globally top-ranking generative AI applications for the Google Play Store. He is passionate about constructing and optimizing machine learning models for intelligent systems, and firmly believes in continuous improvement within this rapidly evolving field.