Building a Knowledge Graph Construction Agent: Extracting Entities and Relations from Documents

Why Knowledge Graphs for AI Agents

RAG retrieves document chunks. Knowledge graphs retrieve structured facts. When a user asks "which companies has Dr. Sarah Chen co-authored papers with in the last 3 years," a RAG system must search through dozens of paper chunks and hope the LLM connects the dots. A knowledge graph stores the relationship directly: (Dr. Sarah Chen)-[CO_AUTHORED]->(Paper X)<-[PUBLISHED_BY]-(Company Y) and returns precise answers in milliseconds.

A knowledge graph construction agent automates the labor-intensive process of reading documents, extracting entities, identifying relationships, and building the graph. Once built, the graph serves as a structured memory that any downstream agent can query.

Entity and Relation Extraction with Structured Output

The first step is extracting entities and relationships from text. Use the LLM with structured output to ensure consistent extraction.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

from pydantic import BaseModel
from agents import Agent, Runner

class Entity(BaseModel):
    name: str
    type: str  # PERSON, ORGANIZATION, TECHNOLOGY, CONCEPT, LOCATION
    description: str

class Relation(BaseModel):
    source: str
    target: str
    relation_type: str  # WORKS_AT, FOUNDED, USES, COMPETES_WITH, etc.
    confidence: float
    evidence: str

class ExtractionResult(BaseModel):
    entities: list[Entity]
    relations: list[Relation]

extractor = Agent(
    name="Entity Extractor",
    instructions="""Extract all named entities and their relationships from the text.

Entity types: PERSON, ORGANIZATION, TECHNOLOGY, CONCEPT, LOCATION, EVENT, PRODUCT
Relation types: WORKS_AT, FOUNDED, ACQUIRED, PARTNERS_WITH, COMPETES_WITH,
                USES, DEVELOPED, LOCATED_IN, PART_OF, CAUSED

Rules:
- Only extract explicitly stated relationships, not inferred ones
- Set confidence between 0.0 and 1.0 based on how clearly the text states the relation
- Include the exact text evidence for each relation
- Normalize entity names (e.g., "Google" and "Google LLC" -> "Google")""",
    output_type=ExtractionResult,
)

Chunking Documents for Extraction

Large documents need to be chunked before extraction, with overlap to catch cross-boundary entities.

def chunk_document(text: str, chunk_size: int = 1500, overlap: int = 200) -> list[str]:
    """Split document into overlapping chunks for entity extraction."""
    words = text.split()
    chunks = []
    start = 0

    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += chunk_size - overlap

    return chunks

async def extract_from_document(document_text: str) -> ExtractionResult:
    """Extract entities and relations from a full document."""
    chunks = chunk_document(document_text)
    all_entities: dict[str, Entity] = {}
    all_relations: list[Relation] = []

    for chunk in chunks:
        result = await Runner.run(extractor, chunk)
        extraction = result.final_output_as(ExtractionResult)

        # Deduplicate entities by name
        for entity in extraction.entities:
            key = entity.name.lower().strip()
            if key not in all_entities:
                all_entities[key] = entity

        all_relations.extend(extraction.relations)

    # Deduplicate relations
    unique_relations = deduplicate_relations(all_relations)

    return ExtractionResult(
        entities=list(all_entities.values()),
        relations=unique_relations,
    )

def deduplicate_relations(relations: list[Relation]) -> list[Relation]:
    """Merge duplicate relations, keeping the highest confidence."""
    seen: dict[str, Relation] = {}
    for rel in relations:
        key = f"{rel.source}|{rel.relation_type}|{rel.target}"
        if key not in seen or rel.confidence > seen[key].confidence:
            seen[key] = rel
    return list(seen.values())

Storing in Neo4j

Neo4j is the natural storage layer for knowledge graphs. The Cypher query language makes both insertion and querying intuitive.

from neo4j import AsyncGraphDatabase

class KnowledgeGraphStore:
    def __init__(self, uri: str, user: str, password: str):
        self.driver = AsyncGraphDatabase.driver(uri, auth=(user, password))

    async def store_extraction(self, extraction: ExtractionResult):
        async with self.driver.session() as session:
            # Create entity nodes
            for entity in extraction.entities:
                await session.run(
                    """
                    MERGE (e:Entity {name: $name})
                    SET e.type = $type, e.description = $description
                    WITH e
                    CALL apoc.create.addLabels(e, [$type]) YIELD node
                    RETURN node
                    """,
                    name=entity.name,
                    type=entity.type,
                    description=entity.description,
                )

            # Create relationship edges
            for rel in extraction.relations:
                await session.run(
                    """
                    MATCH (source:Entity {name: $source})
                    MATCH (target:Entity {name: $target})
                    CALL apoc.merge.relationship(
                        source, $rel_type, {confidence: $confidence,
                        evidence: $evidence}, {}, target, {}
                    ) YIELD rel
                    RETURN rel
                    """,
                    source=rel.source,
                    target=rel.target,
                    rel_type=rel.relation_type,
                    confidence=rel.confidence,
                    evidence=rel.evidence,
                )

    async def query(self, cypher: str, params: dict = None) -> list[dict]:
        async with self.driver.session() as session:
            result = await session.run(cypher, params or {})
            return [record.data() async for record in result]

    async def close(self):
        await self.driver.close()

Natural Language Query Interface

Let the agent translate natural language questions into Cypher queries.

from agents import Agent, function_tool

graph_store = KnowledgeGraphStore(
    uri="bolt://localhost:7687", user="neo4j", password="password"
)

@function_tool
async def query_knowledge_graph(cypher_query: str) -> str:
    """Execute a Cypher query against the knowledge graph and return results."""
    try:
        results = await graph_store.query(cypher_query)
        return json.dumps(results, indent=2, default=str)
    except Exception as e:
        return f"Query error: {e}"

@function_tool
async def get_graph_schema() -> str:
    """Get the current schema of the knowledge graph."""
    results = await graph_store.query(
        "CALL db.schema.visualization() YIELD nodes, relationships RETURN *"
    )
    return json.dumps(results, default=str)

query_agent = Agent(
    name="Knowledge Graph Query Agent",
    instructions="""You answer questions using a Neo4j knowledge graph.

    First call get_graph_schema to understand the available entity types
    and relationships. Then construct a Cypher query to answer the question.

    Cypher tips:
    - Use MATCH patterns: (a:Entity)-[r:RELATION]->(b:Entity)
    - Use WHERE for filtering: WHERE a.type = 'PERSON'
    - Use RETURN to specify output columns
    - Use ORDER BY and LIMIT for ranking
    """,
    tools=[query_knowledge_graph, get_graph_schema],
)

Running the Full Pipeline

async def build_and_query_graph():
    # Step 1: Extract from documents
    documents = load_documents("./research_papers/")
    for doc in documents:
        extraction = await extract_from_document(doc.text)
        await graph_store.store_extraction(extraction)
        print(f"Stored {len(extraction.entities)} entities, "
              f"{len(extraction.relations)} relations from {doc.name}")

    # Step 2: Query the graph
    result = await Runner.run(
        query_agent,
        "Which organizations are working on transformer architectures?"
    )
    print(result.final_output)

FAQ

How do you handle entity resolution when the same entity appears with different names?

Entity resolution (also called entity linking) requires a normalization step. After extraction, run a secondary LLM pass that compares entity names and descriptions to identify duplicates. Use Levenshtein distance for similar spellings and cosine similarity of entity descriptions for semantic matching. When a match is found, merge the entities in Neo4j using MERGE with a canonical name.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

How large can the knowledge graph get before query performance degrades?

Neo4j handles millions of nodes and relationships efficiently with proper indexing. Create indexes on Entity.name and Entity.type. For graphs with over 10 million edges, use Neo4j's query profiling (PROFILE prefix) to identify slow traversals and add targeted composite indexes. Most natural language queries translate to 2-3 hop traversals, which remain fast even on large graphs.

Can you incrementally update the graph as new documents arrive?

Yes, and that is the primary advantage of MERGE over CREATE in the Cypher queries. MERGE creates the node or relationship only if it does not already exist. When a new document mentions an existing entity with new relationships, only the new edges are added. Track document provenance by adding PROCESSED_FROM relationships between entities and source document nodes.

#KnowledgeGraphs #EntityExtraction #Neo4j #NLP #GraphDatabases #AIAgents #StructuredData #InformationExtraction

Building a Knowledge Graph Construction Agent: Extracting Entities and Relations from Documents

Why Knowledge Graphs for AI Agents

Entity and Relation Extraction with Structured Output

Chunking Documents for Extraction

Storing in Neo4j

Natural Language Query Interface

Running the Full Pipeline

FAQ

How do you handle entity resolution when the same entity appears with different names?

How large can the knowledge graph get before query performance degrades?

Can you incrementally update the graph as new documents arrive?

Try CallSphere AI Voice Agents

Related Articles You May Like

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

Neo4j Knowledge Graph Memory for AI Agents in 2026

RAG With Structured Data: Tables, JSON, and Knowledge Graphs Together

GraphRAG in Production: Neo4j, Microsoft, and Graphiti Implementations Compared

Stop Sending Your Whole Repo To Claude — Build A Knowledge Graph Instead

Post-Call Analytics with GPT-4o-mini: Sentiment, Lead Scoring, and Intent