---
title: "Integration Testing for AI Agent Connections: Mocking External Services and Verifying Flows"
description: "Learn how to write robust integration tests for AI agent integrations using mock servers, VCR-style recording, fixture-based testing patterns, and CI pipeline configuration to verify external service connections without hitting live APIs."
canonical: https://callsphere.ai/blog/integration-testing-ai-agent-connections-mocking-external-services-verifying-flows
category: "Learn Agentic AI"
tags: ["Integration Testing", "Mocking", "CI/CD", "AI Agents", "Test Automation"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-07T03:45:55.303Z
---

# Integration Testing for AI Agent Connections: Mocking External Services and Verifying Flows

> Learn how to write robust integration tests for AI agent integrations using mock servers, VCR-style recording, fixture-based testing patterns, and CI pipeline configuration to verify external service connections without hitting live APIs.

## Why Integration Testing Matters for AI Agents

AI agents that connect to external services — Slack, GitHub, Stripe, Notion — have integration surfaces that unit tests cannot cover. A unit test might verify that your agent formats a Jira ticket correctly, but it cannot verify that the Jira API accepts that format, that your authentication works, or that webhook signatures validate properly. Integration tests close this gap by testing the full request-response cycle against realistic service behavior.

The challenge is testing against external APIs without making real API calls in CI, which would be slow, flaky, and expensive. The solution: mock servers and recorded interactions.

## Setting Up Mock Servers with Respx

Respx is a library that intercepts httpx requests and returns predefined responses. It is ideal for testing agents that use httpx-based API clients.

```mermaid
flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness
PromptFoo or Braintrust"]
    GOLD[("Golden set
200 tagged cases")]
    JUDGE["LLM as judge
plus regex graders"]
    SCORE["Aggregate score
and per slice"]
    GATE{"Score regress
more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff
```

```python
import pytest
import respx
import httpx
from your_agent.github_client import GitHubClient

@pytest.fixture
def github_client():
    return GitHubClient(token="test-token-fake")

@respx.mock
@pytest.mark.asyncio
async def test_create_issue_comment(github_client):
    # Mock the GitHub API endpoint
    route = respx.post(
        "https://api.github.com/repos/owner/repo/issues/42/comments"
    ).mock(return_value=httpx.Response(
        201,
        json={
            "id": 123456,
            "body": "AI Triage: This is a bug",
            "created_at": "2026-03-17T10:00:00Z",
        },
    ))

    result = await github_client.create_issue_comment(
        owner="owner",
        repo="repo",
        issue_number=42,
        body="AI Triage: This is a bug",
    )

    assert result["id"] == 123456
    assert route.called
    # Verify the request body
    sent_body = route.calls[0].request.content
    assert b"AI Triage" in sent_body

@respx.mock
@pytest.mark.asyncio
async def test_handles_github_rate_limit(github_client):
    respx.post(
        "https://api.github.com/repos/owner/repo/issues/1/comments"
    ).mock(return_value=httpx.Response(
        429,
        headers={"Retry-After": "60"},
        json={"message": "API rate limit exceeded"},
    ))

    with pytest.raises(httpx.HTTPStatusError) as exc_info:
        await github_client.create_issue_comment(
            "owner", "repo", 1, "test"
        )
    assert exc_info.value.response.status_code == 429
```

## VCR-Style Recording with pytest-recording

VCR records real API responses and replays them in subsequent test runs. This gives you realistic test data without the manual effort of writing fixtures.

```python
# Install: pip install pytest-recording vcrpy
import pytest

@pytest.mark.vcr()
@pytest.mark.asyncio
async def test_fetch_pull_request_diff(github_client):
    """First run makes a real API call and records the response.
    Subsequent runs replay the recorded response."""
    diff = await github_client.get_pull_request_diff(
        owner="your-org",
        repo="your-repo",
        pr_number=100,
    )

    assert "diff --git" in diff
    assert len(diff) > 0

# Configure VCR in conftest.py
@pytest.fixture(scope="module")
def vcr_config():
    return {
        "filter_headers": [
            "authorization",  # Strip auth tokens from recordings
            "x-api-key",
        ],
        "filter_query_parameters": ["api_key"],
        "record_mode": "once",  # Record once, replay forever
        "cassette_library_dir": "tests/cassettes",
        "decode_compressed_response": True,
    }
```

Cassette files (YAML recordings) are committed to your repository so CI can replay them without API access.

## Testing Webhook Signature Verification

Webhook handlers must verify signatures. Test both valid and invalid signatures to ensure security.

```python
import hmac
import hashlib
import json
from fastapi.testclient import TestClient
from your_agent.webhook_hub import app

client = TestClient(app)

def generate_github_signature(payload: bytes, secret: str) -> str:
    return "sha256=" + hmac.new(
        secret.encode(), payload, hashlib.sha256
    ).hexdigest()

def test_valid_github_webhook():
    payload = json.dumps({
        "action": "opened",
        "issue": {"number": 1, "title": "Test", "body": "Bug report"},
        "sender": {"login": "testuser"},
        "repository": {"name": "repo", "owner": {"login": "owner"}},
    }).encode()

    signature = generate_github_signature(payload, "gh-secret")

    response = client.post(
        "/webhooks/github",
        content=payload,
        headers={
            "Content-Type": "application/json",
            "X-Hub-Signature-256": signature,
            "X-GitHub-Event": "issues",
        },
    )
    assert response.status_code == 200
    assert response.json()["status"] == "accepted"

def test_invalid_signature_rejected():
    payload = b'{"test": true}'
    response = client.post(
        "/webhooks/github",
        content=payload,
        headers={
            "Content-Type": "application/json",
            "X-Hub-Signature-256": "sha256=invalid",
            "X-GitHub-Event": "ping",
        },
    )
    assert response.status_code == 401
```

## Testing the Full Agent Flow

End-to-end tests verify the complete chain: webhook received, event normalized, agent processes, action taken.

```python
@respx.mock
@pytest.mark.asyncio
async def test_issue_triage_full_flow():
    # Mock the AI agent's LLM call
    respx.post("https://api.openai.com/v1/chat/completions").mock(
        return_value=httpx.Response(200, json={
            "choices": [{
                "message": {
                    "content": json.dumps({
                        "labels": ["bug", "high-priority"],
                        "priority": "P1",
                        "comment": "This appears to be a critical bug.",
                    })
                }
            }]
        })
    )

    # Mock the GitHub label and comment APIs
    label_route = respx.post(
        "https://api.github.com/repos/owner/repo/issues/5/labels"
    ).mock(return_value=httpx.Response(200, json=[]))

    comment_route = respx.post(
        "https://api.github.com/repos/owner/repo/issues/5/comments"
    ).mock(return_value=httpx.Response(201, json={"id": 999}))

    # Simulate the webhook
    payload = {
        "action": "opened",
        "issue": {
            "number": 5,
            "title": "App crashes on login",
            "body": "After the latest update the app crashes.",
        },
        "sender": {"login": "reporter"},
        "repository": {
            "name": "repo",
            "owner": {"login": "owner"},
        },
    }

    await handle_issue_event(payload)

    assert label_route.called
    assert comment_route.called
    comment_body = json.loads(comment_route.calls[0].request.content)
    assert "P1" in comment_body["body"]
```

## CI Pipeline Configuration

Configure your CI to run integration tests with proper environment setup.

```python
# .github/workflows/integration-tests.yml content as Python dict for reference
ci_config = {
    "name": "Integration Tests",
    "on": {"push": {"branches": ["main"]}, "pull_request": {}},
    "jobs": {
        "integration": {
            "runs-on": "ubuntu-latest",
            "steps": [
                {"uses": "actions/checkout@v4"},
                {"uses": "actions/setup-python@v5",
                 "with": {"python-version": "3.12"}},
                {"run": "pip install -e '.[test]'"},
                {
                    "run": "pytest tests/integration/ -v --tb=short",
                    "env": {
                        "TESTING": "true",
                        "WEBHOOK_SECRET": "test-secret",
                    },
                },
            ],
        }
    },
}
```

The key principles: never use real API keys in CI, commit VCR cassettes alongside tests, and separate integration tests from unit tests so they can run on different schedules.

## FAQ

### When should I use mock servers versus VCR recordings?

Use mock servers (respx, responses) when you need precise control over edge cases — rate limits, timeouts, malformed responses, and error codes. Use VCR recordings when you want to capture realistic API behavior including complex response structures and headers. Many teams use both: VCR for happy-path tests and mocks for error-case tests.

### How do I keep VCR cassettes from becoming stale?

Set up a scheduled CI job (weekly or monthly) that runs tests in "record" mode against the real APIs using a test account. This refreshes the cassettes and catches API changes early. Also configure cassette expiration so tests fail loudly if a recording is older than a set threshold, prompting a re-record.

### Should I test the actual LLM responses or mock them?

Mock LLM responses for deterministic integration tests. Real LLM calls are non-deterministic, slow, and expensive — they make tests flaky. Mock the LLM with fixed responses that represent the structured output your agent expects, then test that your code correctly processes those outputs into API calls. Test the LLM integration separately with a small set of evaluation tests that run on a less frequent schedule.

---

#IntegrationTesting #Mocking #CICD #AIAgents #TestAutomation #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/integration-testing-ai-agent-connections-mocking-external-services-verifying-flows
