---
title: "LangSmith: Tracing, Debugging, and Evaluating LangChain Applications"
description: "Set up LangSmith for tracing LangChain runs, analyzing run trees, building evaluation datasets, running automated evaluations, and collecting feedback on LLM outputs."
canonical: https://callsphere.ai/blog/langsmith-tracing-debugging-evaluating-langchain-applications
category: "Learn Agentic AI"
tags: ["LangSmith", "Observability", "LLM Evaluation", "Debugging", "Python"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-06-05T05:49:13.552Z
---

# LangSmith: Tracing, Debugging, and Evaluating LangChain Applications

> Set up LangSmith for tracing LangChain runs, analyzing run trees, building evaluation datasets, running automated evaluations, and collecting feedback on LLM outputs.

## Why You Need Observability for LLM Applications

LLM applications are notoriously difficult to debug. A chain might call three models, two tools, and a retriever — and when the final answer is wrong, you need to trace exactly which step failed and why. Logging raw inputs and outputs is not enough when calls are nested and asynchronous.

LangSmith is the observability and evaluation platform built specifically for LangChain (and any LLM application). It captures detailed traces of every run, lets you visualize the execution tree, and provides tools for systematic evaluation.

## Setting Up Tracing

LangSmith tracing requires an API key and two environment variables. Once set, all LangChain operations are traced automatically.

```mermaid
flowchart LR
    APP(["Agent or API"])
    SDK["OTel SDK
GenAI conventions"]
    COL["OTel Collector"]
    subgraph BACKENDS["Backends"]
        TR[("Traces
Tempo or Honeycomb")]
        MET[("Metrics
Prometheus")]
        LOG[("Logs
Loki or ELK")]
    end
    DASH["Grafana plus alerts"]
    PAGE(["Pager"])
    APP --> SDK --> COL
    COL --> TR
    COL --> MET
    COL --> LOG
    TR --> DASH
    MET --> DASH
    LOG --> DASH
    DASH --> PAGE
    style SDK fill:#4f46e5,stroke:#4338ca,color:#fff
    style DASH fill:#f59e0b,stroke:#d97706,color:#1f2937
    style PAGE fill:#dc2626,stroke:#b91c1c,color:#fff
```

```python
import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "lsv2_pt_your_key_here"
os.environ["LANGCHAIN_PROJECT"] = "my-project"

# That is it. All LangChain operations are now traced.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

chain = (
    ChatPromptTemplate.from_template("Explain {topic}")
    | ChatOpenAI(model="gpt-4o-mini")
    | StrOutputParser()
)

# This call is automatically traced in LangSmith
result = chain.invoke({"topic": "quantum computing"})
```

Every invocation creates a run in the LangSmith dashboard. You can see the input, output, latency, token usage, and cost for each component in the chain.

## Understanding Run Trees

A run tree is a hierarchical view of a single invocation. For a RAG chain, the tree might look like:

- **Chain Run** (root)
**Retriever Run** — query, returned documents, latency
- **Prompt Run** — template variables, formatted prompt
- **LLM Run** — model, temperature, prompt tokens, completion tokens, response
- **Parser Run** — raw input, parsed output

Each node shows timing, inputs, outputs, and any errors. This lets you identify bottlenecks (slow retrievals) and failures (parsing errors) instantly.

## Custom Tracing with the @traceable Decorator

Trace any Python function, not just LangChain components.

```python
from langsmith import traceable

@traceable(name="process_order")
def process_order(order_id: str, items: list[str]) -> dict:
    # Your business logic
    total = calculate_total(items)
    result = submit_order(order_id, items, total)
    return {"order_id": order_id, "total": total, "status": result}

# This function call appears as a traced run in LangSmith
process_order("ORD-123", ["widget", "gadget"])
```

`@traceable` functions nest correctly inside LangChain traces. If a traced function calls a LangChain chain, both appear in the same run tree.

## Building Evaluation Datasets

LangSmith lets you create datasets of input-output examples for systematic evaluation.

```python
from langsmith import Client

client = Client()

# Create a dataset
dataset = client.create_dataset(
    "customer-support-qa",
    description="Questions and expected answers for customer support bot",
)

# Add examples
client.create_examples(
    inputs=[
        {"question": "How do I reset my password?"},
        {"question": "What is your refund policy?"},
        {"question": "How do I cancel my subscription?"},
    ],
    outputs=[
        {"answer": "Go to Settings > Security > Reset Password."},
        {"answer": "Full refund within 30 days of purchase."},
        {"answer": "Go to Settings > Subscription > Cancel."},
    ],
    dataset_id=dataset.id,
)
```

You can also create datasets from traced runs — select successful or failed runs from the dashboard and convert them into evaluation examples.

## Running Evaluations

Evaluators score your chain's outputs against expected results.

```python
from langsmith.evaluation import evaluate, LangChainStringEvaluator

# Define evaluators
correctness = LangChainStringEvaluator("qa")  # LLM-based QA evaluator
relevance = LangChainStringEvaluator("criteria", config={
    "criteria": "relevance",
})

# Run evaluation
def predict(inputs: dict) -> dict:
    result = chain.invoke(inputs)
    return {"answer": result}

results = evaluate(
    predict,
    data="customer-support-qa",
    evaluators=[correctness, relevance],
    experiment_prefix="v1-gpt4o-mini",
)

print(results.to_pandas())
```

Each evaluation run creates an experiment in LangSmith. You can compare experiments side by side to measure the impact of prompt changes, model upgrades, or chain modifications.

## Custom Evaluators

Write evaluators for domain-specific quality criteria.

```python
from langsmith.schemas import Run, Example

def check_citation(run: Run, example: Example) -> dict:
    """Check if the response cites a source."""
    output = run.outputs.get("answer", "")
    has_citation = "source:" in output.lower() or "[" in output
    return {
        "key": "has_citation",
        "score": 1 if has_citation else 0,
    }

results = evaluate(
    predict,
    data="customer-support-qa",
    evaluators=[check_citation],
)
```

## Collecting Human Feedback

LangSmith supports feedback collection on individual runs via the API or dashboard.

```python
from langsmith import Client

client = Client()

# After a user rates a response
client.create_feedback(
    run_id="run-uuid-here",
    key="user_rating",
    score=1,  # thumbs up
    comment="Accurate and helpful",
)
```

Feedback data powers fine-tuning datasets and helps identify where your application needs improvement.

## FAQ

### Is LangSmith required to use LangChain?

No. LangSmith is optional. LangChain works without it. But for any production application, the observability and evaluation capabilities are essential for debugging issues, measuring quality, and iterating on prompts and chains.

### What does LangSmith cost?

LangSmith has a free tier with limited trace retention. Paid tiers offer longer retention, more traces, and team collaboration features. Check the current pricing at smith.langchain.com.

### Can I use LangSmith with non-LangChain code?

Yes. The `@traceable` decorator and the LangSmith SDK work with any Python code. You can trace raw OpenAI API calls, custom HTTP requests, or any function. LangSmith is not limited to LangChain applications.

---

#LangSmith #Observability #LLMEvaluation #Debugging #Python #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/langsmith-tracing-debugging-evaluating-langchain-applications