---
title: "Building a Resume Parser with Structured Outputs: End-to-End Tutorial"
description: "Build a complete resume parsing pipeline from PDF to structured data. Covers PDF text extraction, schema design for work experience and education, LLM extraction, validation, and output formatting."
canonical: https://callsphere.ai/blog/building-resume-parser-structured-outputs-end-to-end-tutorial
category: "Learn Agentic AI"
tags: ["Resume Parser", "Data Extraction", "PDF", "Structured Outputs", "Tutorial"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-07T12:50:41.601Z
---

# Building a Resume Parser with Structured Outputs: End-to-End Tutorial

> Build a complete resume parsing pipeline from PDF to structured data. Covers PDF text extraction, schema design for work experience and education, LLM extraction, validation, and output formatting.

## Why Build a Resume Parser?

Resume parsing is a classic structured extraction problem. Resumes contain predictable data types (names, dates, companies, skills) but wildly inconsistent formatting. Traditional regex-based parsers break on every new resume template. LLM-based parsers handle any format because they understand the content semantically, not syntactically.

In this tutorial, you will build a complete pipeline: PDF input, text extraction, LLM-powered structured extraction, validation, and clean JSON output.

## Step 1: Define the Schema

Start by modeling what a parsed resume looks like:

```mermaid
flowchart LR
    CALLER(["Student or Parent"])
    subgraph TEL["Telephony"]
        SIP["Twilio SIP and PSTN"]
    end
    subgraph BRAIN["Education AI Agent"]
        STT["Streaming STT
Deepgram or Whisper"]
        NLU{"Intent and
Entity Extraction"}
        TOOLS["Tool Calls"]
        TTS["Streaming TTS
ElevenLabs or Rime"]
    end
    subgraph DATA["Live Data Plane"]
        CRM[("CRM and Notes")]
        CAL[("Calendar and
Schedule")]
        KB[("Knowledge Base
and Policies")]
    end
    subgraph OUT["Outcomes"]
        O1(["Enrollment captured"])
        O2(["Tour scheduled"])
        O3(["Counselor callback"])
    end
    CALLER --> SIP --> STT --> NLU
    NLU -->|Lookup| TOOLS
    TOOLS  CRM
    TOOLS  CAL
    TOOLS  KB
    NLU --> TTS --> SIP --> CALLER
    NLU -->|Resolved| O1
    NLU -->|Schedule| O2
    NLU -->|Escalate| O3
    style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
    style O1 fill:#059669,stroke:#047857,color:#fff
    style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
    style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937
```

```python
from pydantic import BaseModel, Field, field_validator
from typing import List, Optional
from datetime import date

class ContactInfo(BaseModel):
    full_name: str
    email: Optional[str] = None
    phone: Optional[str] = None
    location: Optional[str] = Field(default=None, description="City, State or City, Country")
    linkedin_url: Optional[str] = None
    portfolio_url: Optional[str] = None

class WorkExperience(BaseModel):
    company: str
    title: str
    start_date: Optional[str] = Field(default=None, description="YYYY-MM format")
    end_date: Optional[str] = Field(default=None, description="YYYY-MM or 'Present'")
    location: Optional[str] = None
    description: Optional[str] = None
    achievements: List[str] = Field(default_factory=list)

class Education(BaseModel):
    institution: str
    degree: Optional[str] = None
    field_of_study: Optional[str] = None
    start_date: Optional[str] = None
    end_date: Optional[str] = None
    gpa: Optional[float] = Field(default=None, ge=0.0, le=4.0)

class ParsedResume(BaseModel):
    contact: ContactInfo
    summary: Optional[str] = Field(default=None, description="Professional summary or objective")
    work_experience: List[WorkExperience]
    education: List[Education]
    skills: List[str]
    certifications: List[str] = Field(default_factory=list)
    languages: List[str] = Field(default_factory=list)
```

Design choices matter here. Using `Optional` with `None` defaults means the model will not hallucinate values for missing fields. The `YYYY-MM` format for dates handles the common resume pattern where exact days are not listed.

## Step 2: Extract Text from PDF

Use PyMuPDF (fitz) for reliable text extraction:

```python
pip install pymupdf
```

```python
import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path: str) -> str:
    """Extract text from a PDF file, preserving basic structure."""
    doc = fitz.open(pdf_path)
    pages = []
    for page in doc:
        text = page.get_text("text")
        pages.append(text)
    doc.close()
    return "\n\n".join(pages)

# Usage
resume_text = extract_text_from_pdf("resume.pdf")
print(f"Extracted {len(resume_text)} characters")
```

PyMuPDF handles most PDF formats, including those with columns, tables, and embedded fonts. For scanned PDFs (images), you would need OCR — add `pytesseract` as a preprocessing step.

## Step 3: LLM Extraction

Send the extracted text to the LLM with your schema:

```python
import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

def parse_resume(resume_text: str) -> ParsedResume:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=ParsedResume,
        max_retries=3,
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an expert resume parser. Extract structured data "
                    "from the resume text. Rules:\n"
                    "- Only extract information explicitly stated in the resume\n"
                    "- Use null for fields not present in the text\n"
                    "- List achievements as separate bullet points\n"
                    "- Normalize dates to YYYY-MM format when possible\n"
                    "- List skills as individual items, not comma-separated strings"
                )
            },
            {"role": "user", "content": resume_text}
        ],
    )
```

## Step 4: Add Validation

Add validators that catch common LLM extraction errors:

```python
from pydantic import model_validator
import re

class ValidatedResume(ParsedResume):

    @model_validator(mode="after")
    def validate_work_dates(self) -> "ValidatedResume":
        """Ensure work experience dates are chronologically valid."""
        date_pattern = re.compile(r"^\d{4}-(0[1-9]|1[0-2])$")

        for job in self.work_experience:
            if job.start_date and not date_pattern.match(job.start_date):
                if job.start_date.lower() != "present":
                    raise ValueError(
                        f"Invalid start_date format: '{job.start_date}' for {job.company}"
                    )
            if job.end_date and job.end_date.lower() != "present":
                if not date_pattern.match(job.end_date):
                    raise ValueError(
                        f"Invalid end_date format: '{job.end_date}' for {job.company}"
                    )
        return self

    @field_validator("skills")
    @classmethod
    def deduplicate_skills(cls, v: List[str]) -> List[str]:
        """Remove duplicate skills (case-insensitive)."""
        seen = set()
        unique = []
        for skill in v:
            normalized = skill.lower().strip()
            if normalized not in seen:
                seen.add(normalized)
                unique.append(skill.strip())
        return unique
```

When Instructor detects a validation error, it automatically retries the LLM call with the error message appended. The model sees "Invalid start_date format: 'March 2022'" and corrects it to "2022-03" on the next attempt.

## Step 5: Output Formatting

Convert the parsed resume to your target format:

```python
import json

def resume_to_json(parsed: ParsedResume) -> str:
    """Export parsed resume as formatted JSON."""
    return parsed.model_dump_json(indent=2, exclude_none=True)

def resume_to_csv_row(parsed: ParsedResume) -> dict:
    """Flatten resume for CSV/spreadsheet export."""
    return {
        "name": parsed.contact.full_name,
        "email": parsed.contact.email,
        "phone": parsed.contact.phone,
        "location": parsed.contact.location,
        "latest_company": parsed.work_experience[0].company if parsed.work_experience else None,
        "latest_title": parsed.work_experience[0].title if parsed.work_experience else None,
        "years_experience": len(parsed.work_experience),
        "highest_degree": parsed.education[0].degree if parsed.education else None,
        "skills": ", ".join(parsed.skills),
        "num_certifications": len(parsed.certifications),
    }
```

## Complete Pipeline

```python
def process_resume(pdf_path: str) -> dict:
    """End-to-end resume processing pipeline."""
    # Extract text
    text = extract_text_from_pdf(pdf_path)

    if len(text.strip()) < 50:
        raise ValueError("PDF appears empty or unreadable. Try OCR.")

    # Parse with LLM
    parsed = parse_resume(text)

    # Return structured output
    return {
        "parsed": parsed.model_dump(exclude_none=True),
        "json": resume_to_json(parsed),
        "csv_row": resume_to_csv_row(parsed),
    }

result = process_resume("candidate_resume.pdf")
print(json.dumps(result["parsed"], indent=2))
```

## FAQ

### How accurate is LLM-based resume parsing compared to commercial parsers?

In tests on diverse resume formats, GPT-4o achieves 90-95% field-level accuracy on standard fields like name, email, and company names. Commercial parsers like Sovren or Textkernel achieve similar accuracy on standard formats but struggle more with creative or non-standard layouts where LLMs excel.

### How do I handle multi-page resumes?

PyMuPDF concatenates all pages automatically. For resumes over 4 pages, the full text may exceed the model's optimal extraction context. In that case, extract contact info and summary from page 1, work experience from middle pages, and education/skills from the final section — then merge the results.

### What about data privacy when sending resumes to OpenAI?

Resumes contain sensitive personal information. Use OpenAI's API data usage policy (API data is not used for training by default). For strict privacy requirements, run a local model via Ollama or vLLM with Instructor's OpenAI-compatible mode. This keeps all data on your infrastructure.

---

#ResumeParser #DataExtraction #PDF #StructuredOutputs #Tutorial #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/building-resume-parser-structured-outputs-end-to-end-tutorial
