Skip to content
Learn Agentic AI
Learn Agentic AI13 min read1 views

Building a CSV Analysis Agent: Upload Data, Ask Questions, Get Answers

Create an AI agent that accepts CSV file uploads, uses pandas for statistical analysis, answers natural language questions about the data, and generates clear reports with charts — all without the user writing any code.

The Problem with Manual CSV Analysis

Every business analyst has been there: someone sends a CSV file and asks "What trends do you see?" What follows is 30 minutes of opening the file in a spreadsheet, eyeballing columns, writing formulas, creating pivot tables, and building charts. A CSV analysis agent automates this entire workflow. The user uploads a file, asks a question in plain English, and gets a statistical answer with supporting visualizations.

The key insight is that pandas is already the perfect execution engine for tabular data analysis. The agent's job is to translate natural language into pandas operations, run them safely, and present the results clearly.

Loading and Profiling the CSV

The first tool loads a CSV file and returns a structural profile — column names, types, row count, and sample values:

flowchart TD
    START["Building a CSV Analysis Agent: Upload Data, Ask Q…"] --> A
    A["The Problem with Manual CSV Analysis"]
    A --> B
    B["Loading and Profiling the CSV"]
    B --> C
    C["Statistical Analysis Tool"]
    C --> D
    D["Safe Code Execution for Custom Analysis"]
    D --> E
    E["Assembling the CSV Agent"]
    E --> F
    F["Example Interaction"]
    F --> G
    G["Handling Messy Data"]
    G --> H
    H["FAQ"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
import pandas as pd
from agents import Agent, Runner, function_tool

_dataframes: dict[str, pd.DataFrame] = {}

@function_tool
def load_csv(file_path: str, name: str = "default") -> str:
    """Load a CSV file into memory and return a profile of its structure."""
    try:
        df = pd.read_csv(file_path)
    except Exception as e:
        return f"Error loading CSV: {e}"

    _dataframes[name] = df
    profile_parts = [
        f"Loaded: {len(df)} rows, {len(df.columns)} columns",
        f"Columns: {', '.join(df.columns.tolist())}",
        f"Data types:\n{df.dtypes.to_string()}",
        f"\nFirst 3 rows:\n{df.head(3).to_string()}",
        f"\nNull counts:\n{df.isnull().sum().to_string()}",
    ]
    return "\n".join(profile_parts)

Statistical Analysis Tool

This tool runs descriptive statistics, correlations, group-by aggregations, and other pandas operations specified by the agent:

@function_tool
def analyze_data(name: str, operation: str) -> str:
    """Run a pandas operation on a loaded dataset.
    Operations: describe, corr, groupby, value_counts, query.
    For groupby: use format 'groupby:column:agg_col:agg_func'
    For query: use format 'query:pandas_query_string'"""
    if name not in _dataframes:
        return f"Error: No dataset loaded with name '{name}'."

    df = _dataframes[name]

    try:
        if operation == "describe":
            return df.describe(include="all").to_string()
        elif operation == "corr":
            numeric_df = df.select_dtypes(include="number")
            return numeric_df.corr().to_string()
        elif operation.startswith("groupby:"):
            parts = operation.split(":")
            group_col, agg_col, agg_func = parts[1], parts[2], parts[3]
            result = df.groupby(group_col)[agg_col].agg(agg_func)
            return result.to_string()
        elif operation.startswith("value_counts:"):
            col = operation.split(":")[1]
            return df[col].value_counts().head(20).to_string()
        elif operation.startswith("query:"):
            query_str = operation[6:]
            filtered = df.query(query_str)
            return f"Rows matching: {len(filtered)}\n{filtered.head(10).to_string()}"
        else:
            return "Unknown operation."
    except Exception as e:
        return f"Analysis error: {e}"

Safe Code Execution for Custom Analysis

Sometimes the agent needs to run custom pandas code that does not fit predefined operations. A sandboxed execution tool handles this:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

@function_tool
def run_pandas_code(name: str, code: str) -> str:
    """Execute custom pandas code on a loaded dataset.
    The dataframe is available as 'df'. Return value of last expression is captured."""
    if name not in _dataframes:
        return f"Error: No dataset '{name}' loaded."

    BLOCKED_KEYWORDS = ["import os", "subprocess", "eval(", "exec(", "__", "open("]
    for kw in BLOCKED_KEYWORDS:
        if kw in code:
            return f"Blocked: '{kw}' is not allowed for security reasons."

    df = _dataframes[name].copy()
    local_vars = {"df": df, "pd": pd}

    try:
        exec(code, {"__builtins__": {}}, local_vars)
        if "result" in local_vars:
            r = local_vars["result"]
            return r.to_string() if hasattr(r, "to_string") else str(r)
        return "Code executed. Set 'result = ...' to return output."
    except Exception as e:
        return f"Execution error: {e}"

Assembling the CSV Agent

csv_agent = Agent(
    name="CSV Analyst",
    instructions="""You are a CSV data analysis agent. Your workflow:
1. When the user provides a file, call load_csv to inspect its structure.
2. For questions about distributions or summaries, use analyze_data with 'describe'.
3. For relationship questions, use analyze_data with 'corr'.
4. For group comparisons, use analyze_data with 'groupby:col:agg_col:func'.
5. For complex analysis, use run_pandas_code with safe pandas operations.
6. Always explain results in plain language with specific numbers.
7. Suggest follow-up questions the user might want to ask.""",
    tools=[load_csv, analyze_data, run_pandas_code],
)

Example Interaction

result = Runner.run_sync(
    csv_agent,
    "Load the file sales_2025.csv and tell me which region had the "
    "highest average order value, and whether order value correlates with "
    "customer tenure.",
)
print(result.final_output)

The agent loads the CSV, runs describe to understand the data, performs a groupby:region:order_value:mean to rank regions, computes correlations between order_value and customer_tenure_months, and responds with something like: "The West region had the highest average order value at $187.40. There is a moderate positive correlation (r = 0.43) between customer tenure and order value."

Handling Messy Data

Real-world CSVs are messy. Add a cleaning tool that the agent can invoke before analysis:

@function_tool
def clean_data(name: str, operations: str) -> str:
    """Clean a loaded dataset. Operations (comma-separated):
    drop_nulls, fill_nulls_zero, strip_whitespace, lowercase_columns"""
    if name not in _dataframes:
        return f"No dataset '{name}'."
    df = _dataframes[name]
    ops = [o.strip() for o in operations.split(",")]
    for op in ops:
        if op == "drop_nulls":
            df = df.dropna()
        elif op == "fill_nulls_zero":
            df = df.fillna(0)
        elif op == "strip_whitespace":
            df = df.apply(lambda c: c.str.strip() if c.dtype == "object" else c)
        elif op == "lowercase_columns":
            df.columns = [c.lower().replace(" ", "_") for c in df.columns]
    _dataframes[name] = df
    return f"Cleaned. Shape: {df.shape[0]} rows, {df.shape[1]} columns."

FAQ

Is it safe to let an LLM execute pandas code?

The sandboxed executor shown above blocks dangerous operations like file access and subprocess calls. For production systems, run the code in a Docker container or use a dedicated code sandbox like E2B to provide true isolation.

What is the maximum file size this agent can handle?

Pandas loads the entire CSV into memory, so the limit depends on your server's RAM. For files under 500 MB, standard pandas works well. For larger files, swap in polars or dask as the execution engine — the agent interface stays the same.

Can the agent handle Excel files or databases too?

Yes. Replace pd.read_csv with pd.read_excel for spreadsheets or pd.read_sql for databases. The rest of the analysis pipeline remains identical since pandas normalizes everything into DataFrames.


#CSV #Pandas #DataAnalysis #Statistics #AIAgents #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Use Cases

Automating Client Document Collection: How AI Agents Chase Missing Tax Documents and Reduce Filing Delays

See how AI agents automate tax document collection — chasing missing W-2s, 1099s, and receipts via calls and texts to eliminate the #1 CPA bottleneck.

Learn Agentic AI

API Design for AI Agent Tool Functions: Best Practices and Anti-Patterns

How to design tool functions that LLMs can use effectively with clear naming, enum parameters, structured responses, informative error messages, and documentation.

Learn Agentic AI

AI Agents for IT Helpdesk: L1 Automation, Ticket Routing, and Knowledge Base Integration

Build IT helpdesk AI agents with multi-agent architecture for triage, device, network, and security issues. RAG-powered knowledge base, automated ticket creation, routing, and escalation.

Learn Agentic AI

Computer Use in GPT-5.4: Building AI Agents That Navigate Desktop Applications

Technical guide to GPT-5.4's computer use capabilities for building AI agents that interact with desktop UIs, browser automation, and real-world application workflows.

Learn Agentic AI

Prompt Engineering for AI Agents: System Prompts, Tool Descriptions, and Few-Shot Patterns

Agent-specific prompt engineering techniques: crafting effective system prompts, writing clear tool descriptions for function calling, and few-shot examples that improve complex task performance.

Learn Agentic AI

Google Cloud AI Agent Trends Report 2026: Key Findings and Developer Implications

Analysis of Google Cloud's 2026 AI agent trends report covering Gemini-powered agents, Google ADK, Vertex AI agent builder, and enterprise adoption patterns.