Skip to content
Learn Agentic AI
Learn Agentic AI12 min read5 views

Multiprocessing vs Asyncio for AI Workloads: When to Use Each Approach

Understand when to use multiprocessing versus asyncio for AI agent workloads. Learn CPU-bound vs I/O-bound trade-offs, ProcessPoolExecutor, and hybrid patterns.

The Fundamental Decision

Python's GIL (Global Interpreter Lock) means that only one thread executes Python bytecode at a time within a single process. This creates a clear decision tree for AI workloads:

  • I/O-bound work (LLM API calls, database queries, file reads) — use asyncio. The GIL is released during I/O operations, so asyncio's single-threaded event loop efficiently multiplexes thousands of concurrent I/O operations.
  • CPU-bound work (embedding computation, text preprocessing, local model inference) — use multiprocessing. Each process has its own GIL, so CPU work truly runs in parallel across cores.

Most AI agent systems involve both. The key is choosing the right tool for each part of the pipeline.

I/O-Bound: asyncio Dominates

API calls to LLM providers are pure I/O. The agent sends a request and waits for the response. asyncio handles this efficiently because the event loop switches to other tasks during the wait.

flowchart TD
    START["Multiprocessing vs Asyncio for AI Workloads: When…"] --> A
    A["The Fundamental Decision"]
    A --> B
    B["I/O-Bound: asyncio Dominates"]
    B --> C
    C["CPU-Bound: Multiprocessing Is Required"]
    C --> D
    D["The Hybrid Pattern: asyncio + ProcessPo…"]
    D --> E
    E["When to Use asyncio.to_thread"]
    E --> F
    F["Decision Matrix"]
    F --> G
    G["FAQ"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
import asyncio
import httpx
import time

async def benchmark_io_bound():
    """Benchmark concurrent LLM API calls with asyncio."""
    prompts = [f"Question {i}: Explain concept {i}" for i in range(20)]

    async with httpx.AsyncClient(timeout=30.0) as client:
        start = time.monotonic()
        tasks = [
            simulate_llm_call(client, prompt)
            for prompt in prompts
        ]
        results = await asyncio.gather(*tasks)
        elapsed = time.monotonic() - start

    print(f"20 I/O-bound calls: {elapsed:.2f}s with asyncio")
    # ~2s (limited by slowest call, not sum of all calls)

async def simulate_llm_call(client, prompt):
    await asyncio.sleep(1.5)  # Simulate API latency
    return f"Response to {prompt}"

asyncio.run(benchmark_io_bound())

CPU-Bound: Multiprocessing Is Required

Embedding generation, text chunking, and local model inference are CPU-intensive. asyncio provides zero speedup for CPU-bound work because the GIL prevents parallel execution within a single process.

import multiprocessing as mp
from concurrent.futures import ProcessPoolExecutor
import time

def compute_embeddings_batch(texts: list[str]) -> list[list[float]]:
    """CPU-intensive embedding computation (runs in worker process)."""
    # Simulating CPU-heavy work
    embeddings = []
    for text in texts:
        # In reality, this would be a local model inference
        embedding = [hash(text + str(i)) % 1000 / 1000.0
                     for i in range(384)]
        embeddings.append(embedding)
    return embeddings

def benchmark_cpu_bound():
    """Benchmark CPU-bound work with multiprocessing."""
    all_texts = [f"Document {i} content..." for i in range(1000)]
    chunk_size = 100
    chunks = [
        all_texts[i:i + chunk_size]
        for i in range(0, len(all_texts), chunk_size)
    ]

    # Sequential
    start = time.monotonic()
    for chunk in chunks:
        compute_embeddings_batch(chunk)
    seq_time = time.monotonic() - start

    # Parallel with multiprocessing
    start = time.monotonic()
    with ProcessPoolExecutor(max_workers=mp.cpu_count()) as executor:
        results = list(executor.map(compute_embeddings_batch, chunks))
    par_time = time.monotonic() - start

    print(f"Sequential: {seq_time:.2f}s")
    print(f"Parallel ({mp.cpu_count()} workers): {par_time:.2f}s")
    print(f"Speedup: {seq_time / par_time:.1f}x")

benchmark_cpu_bound()

The Hybrid Pattern: asyncio + ProcessPoolExecutor

Real AI agents combine I/O-bound and CPU-bound work. The hybrid pattern uses asyncio for the main event loop and offloads CPU-heavy work to a process pool.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import asyncio
from concurrent.futures import ProcessPoolExecutor
from functools import partial

# Module-level process pool (shared across requests)
_process_pool = ProcessPoolExecutor(max_workers=4)

def cpu_heavy_preprocess(text: str) -> dict:
    """CPU-bound text preprocessing (runs in separate process)."""
    # Tokenization, NER, chunking — CPU intensive
    tokens = text.split()
    chunks = [
        " ".join(tokens[i:i+256])
        for i in range(0, len(tokens), 256)
    ]
    return {"chunks": chunks, "token_count": len(tokens)}

async def agent_pipeline(document: str) -> dict:
    """Agent pipeline mixing I/O and CPU work."""
    loop = asyncio.get_running_loop()

    # Step 1: CPU-bound preprocessing (offload to process pool)
    preprocessed = await loop.run_in_executor(
        _process_pool,
        cpu_heavy_preprocess,
        document,
    )

    # Step 2: I/O-bound LLM calls (run concurrently with asyncio)
    async with httpx.AsyncClient(timeout=60.0) as client:
        summaries = await asyncio.gather(*[
            call_llm(client, f"Summarize: {chunk}")
            for chunk in preprocessed["chunks"]
        ])

    # Step 3: CPU-bound post-processing
    final = await loop.run_in_executor(
        _process_pool,
        merge_summaries,
        summaries,
    )
    return final

The key method is loop.run_in_executor(). It runs a synchronous function in a thread pool or process pool without blocking the event loop.

When to Use asyncio.to_thread

For lighter CPU work or blocking library calls, asyncio.to_thread() offloads to a thread instead of a process. This avoids the serialization overhead of multiprocessing but is limited by the GIL.

import asyncio

async def process_with_blocking_library(data: str) -> dict:
    """Use asyncio.to_thread for blocking library calls."""
    # This runs in a thread — GIL limits parallelism but
    # it does not block the event loop
    result = await asyncio.to_thread(
        blocking_library_call, data
    )
    return result

Use to_thread for: blocking file I/O, synchronous database drivers, third-party libraries without async support. Use run_in_executor with a process pool for: heavy computation, numpy operations, local model inference.

Decision Matrix

Workload Type       | Best Tool              | Example
--------------------+------------------------+-----------------------------
LLM API calls       | asyncio                | OpenAI, Anthropic API calls
Database queries    | asyncio (async driver)  | asyncpg, motor
File I/O            | asyncio.to_thread      | Reading large documents
Text preprocessing  | ProcessPoolExecutor    | Tokenization, chunking
Local model infer.  | ProcessPoolExecutor    | sentence-transformers
Embedding compute   | ProcessPoolExecutor    | numpy-heavy operations
Mixed pipeline      | Hybrid (asyncio + PPE) | Full agent workflow

FAQ

Does the GIL affect LLM API calls?

No. The GIL is released during I/O operations (network calls, file reads, etc.). When your code is waiting for an API response from OpenAI, the GIL is free and other Python threads or asyncio tasks can run. The GIL only matters for CPU-bound Python bytecode execution.

What is the overhead of ProcessPoolExecutor?

Each task submission serializes the function arguments with pickle, sends them to a worker process, and deserializes the results back. For small inputs this adds 1-5ms overhead. For large data (megabytes of text), serialization can take 10-100ms. Batch your work to amortize this cost — send 100 documents per process call, not one.

Can I use multiprocessing.Pool inside an asyncio event loop?

Not directly. multiprocessing.Pool's methods are blocking and will freeze your event loop. Always use loop.run_in_executor(ProcessPoolExecutor(...)) to integrate multiprocessing with asyncio. The executor handles the inter-process communication without blocking the event loop.


#Python #Multiprocessing #Asyncio #Performance #AIAgents #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Use Cases

Automating Client Document Collection: How AI Agents Chase Missing Tax Documents and Reduce Filing Delays

See how AI agents automate tax document collection — chasing missing W-2s, 1099s, and receipts via calls and texts to eliminate the #1 CPA bottleneck.

Technical Guides

Voice AI Latency: Why Sub-Second Response Time Matters (And How to Hit It)

A technical breakdown of voice AI latency budgets — STT, LLM, TTS, network — and how to hit sub-second end-to-end response times.

Technical Guides

Scaling AI Voice Agents to 1000+ Concurrent Calls: Architecture Guide

Architecture patterns for scaling AI voice agents to 1000+ concurrent calls — horizontal scaling, connection pooling, and queue management.

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

Learn Agentic AI

Computer Use in GPT-5.4: Building AI Agents That Navigate Desktop Applications

Technical guide to GPT-5.4's computer use capabilities for building AI agents that interact with desktop UIs, browser automation, and real-world application workflows.

Learn Agentic AI

API Design for AI Agent Tool Functions: Best Practices and Anti-Patterns

How to design tool functions that LLMs can use effectively with clear naming, enum parameters, structured responses, informative error messages, and documentation.