---
title: "Extracting Entities from Documents: Names, Dates, Addresses, and Custom Types"
description: "Build production-grade entity extraction with LLMs. Learn schema design for names, dates, addresses, and custom entity types, plus batch extraction techniques and accuracy optimization strategies."
canonical: https://callsphere.ai/blog/extracting-entities-documents-names-dates-addresses-custom-types
category: "Learn Agentic AI"
tags: ["Entity Extraction", "NER", "Structured Outputs", "Pydantic", "Python"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:42.806Z
---

# Extracting Entities from Documents: Names, Dates, Addresses, and Custom Types

> Build production-grade entity extraction with LLMs. Learn schema design for names, dates, addresses, and custom entity types, plus batch extraction techniques and accuracy optimization strategies.

## Entity Extraction with LLMs vs. Traditional NER

Traditional Named Entity Recognition (NER) models like spaCy's `en_core_web_lg` are fast and work well for standard entity types: person names, organizations, locations. But they struggle with domain-specific entities (medical codes, legal citations, product SKUs) and they cannot extract structured attributes for each entity.

LLM-based extraction handles arbitrary entity types, extracts attributes, and understands context that statistical models miss. The tradeoff is cost and latency: an LLM call takes 500ms-2s versus 5ms for spaCy. For most business applications, the accuracy gain justifies the cost.

## Designing Entity Schemas

Start with a base entity class and specialize for each type:

```mermaid
flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus
classify"]
    PLAN["Plan and tool
selection"]
    AGENT["Agent loop
LLM plus tools"]
    GUARD{"Guardrails
and policy"}
    EXEC["Execute and
verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus
next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
```

```python
from pydantic import BaseModel, Field
from typing import List, Optional, Literal
from datetime import date

class PersonEntity(BaseModel):
    full_name: str
    first_name: Optional[str] = None
    last_name: Optional[str] = None
    title: Optional[str] = Field(default=None, description="Mr, Mrs, Dr, etc.")
    role: Optional[str] = Field(default=None, description="Job title or role")
    organization: Optional[str] = None

class DateEntity(BaseModel):
    raw_text: str = Field(description="Original date text from document")
    normalized: Optional[str] = Field(
        default=None,
        description="ISO format YYYY-MM-DD when possible"
    )
    date_type: Literal["exact", "relative", "range", "approximate"]

class AddressEntity(BaseModel):
    full_address: str
    street: Optional[str] = None
    city: Optional[str] = None
    state: Optional[str] = None
    postal_code: Optional[str] = None
    country: Optional[str] = Field(default="US")

class MoneyEntity(BaseModel):
    amount: float
    currency: str = Field(default="USD")
    raw_text: str = Field(description="Original text, e.g., '$1.2 million'")
```

Keeping the `raw_text` field alongside normalized values is essential for auditing. When a downstream process questions an extracted value, you can trace it back to the exact source text.

## The Extraction Prompt

A well-structured prompt dramatically improves extraction quality:

```python
from openai import OpenAI
import instructor

client = instructor.from_openai(OpenAI())

class DocumentEntities(BaseModel):
    people: List[PersonEntity]
    dates: List[DateEntity]
    addresses: List[AddressEntity]
    monetary_values: List[MoneyEntity]

def extract_entities(text: str) -> DocumentEntities:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=DocumentEntities,
        max_retries=2,
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a precise document entity extractor. "
                    "Extract ALL entities of each type from the text. "
                    "If an entity attribute is not explicitly stated, use null. "
                    "Never infer or guess values not present in the text."
                )
            },
            {"role": "user", "content": text}
        ],
    )
```

The instruction "never infer or guess" is critical. Without it, the model tends to hallucinate plausible-sounding addresses or fill in missing first/last name splits incorrectly.

## Custom Entity Types

Define domain-specific entities for your use case. Here is an example for legal document extraction:

```python
class LegalCitation(BaseModel):
    case_name: str
    citation: str = Field(description="e.g., '123 F.3d 456'")
    court: Optional[str] = None
    year: Optional[int] = None

class ContractClause(BaseModel):
    clause_type: Literal[
        "termination", "liability", "indemnification",
        "confidentiality", "payment_terms", "warranty", "other"
    ]
    summary: str
    parties_involved: List[str]
    key_conditions: List[str]
```

The `Literal` type constrains the model to a fixed set of values, which prevents it from inventing clause types that your downstream system cannot handle.

## Batch Extraction for Multiple Documents

When processing many documents, use async calls for throughput:

```python
import asyncio
from openai import AsyncOpenAI

async_client = instructor.from_openai(AsyncOpenAI())

async def extract_batch(documents: List[str]) -> List[DocumentEntities]:
    tasks = [
        async_client.chat.completions.create(
            model="gpt-4o",
            response_model=DocumentEntities,
            max_retries=2,
            messages=[
                {"role": "system", "content": "Extract all entities from the text."},
                {"role": "user", "content": doc}
            ],
        )
        for doc in documents
    ]
    # Process in batches of 10 to respect rate limits
    results = []
    for i in range(0, len(tasks), 10):
        batch = tasks[i:i + 10]
        results.extend(await asyncio.gather(*batch))
    return results
```

## Improving Accuracy with Few-Shot Examples

Include examples in your prompt to calibrate the model:

```python
FEW_SHOT_EXAMPLE = """
Text: "Dr. Sarah Chen, Chief Medical Officer at Valley Health (123 Oak St,
Portland, OR 97201), approved a $2.5M equipment purchase on March 15, 2025."

Expected extraction:
- Person: Dr. Sarah Chen, role=Chief Medical Officer, org=Valley Health
- Address: 123 Oak St, Portland, OR 97201
- Money: $2,500,000 USD (raw: "$2.5M")
- Date: 2025-03-15, type=exact (raw: "March 15, 2025")
"""

def extract_with_examples(text: str) -> DocumentEntities:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=DocumentEntities,
        messages=[
            {
                "role": "system",
                "content": f"Extract entities precisely. Example:\n{FEW_SHOT_EXAMPLE}"
            },
            {"role": "user", "content": text}
        ],
    )
```

Few-shot examples improve extraction accuracy by 10-20% on complex documents, especially for ambiguous cases like distinguishing between a person's location and a company's headquarters.

## FAQ

### How do I handle entities that span sentence boundaries?

Use overlapping chunking when splitting documents, with at least 1-2 sentences of overlap. After extraction, deduplicate entities by comparing normalized names. If an entity appears in the overlap region of two chunks, you will get it from both and can merge the attributes.

### When should I use spaCy instead of an LLM for entity extraction?

Use spaCy when you need sub-10ms latency, are extracting only standard entity types (person, org, location), and are processing millions of documents where LLM costs would be prohibitive. Use LLMs when you need custom entity types, attribute extraction, or when context-dependent interpretation is important.

### How do I measure extraction accuracy?

Create a gold-standard dataset of 100+ manually annotated documents. For each entity type, compute precision (extracted entities that are correct), recall (real entities that were found), and F1 score. Track accuracy separately per entity type, as some types are harder than others.

---

#EntityExtraction #NER #StructuredOutputs #Pydantic #Python #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/extracting-entities-documents-names-dates-addresses-custom-types
