Skip to content
Keyword Extraction and Topic Modeling for Agent Knowledge Organization
Learn Agentic AI10 min read8 views

Keyword Extraction and Topic Modeling for Agent Knowledge Organization

Learn keyword extraction with TF-IDF and KeyBERT, topic modeling with BERTopic and LDA, and how to build agent knowledge organization systems that automatically categorize and cluster documents.

Why Agents Need Keyword and Topic Understanding

An AI agent managing a knowledge base of thousands of documents needs to organize, search, and retrieve information efficiently. Keyword extraction identifies the most representative terms in a document. Topic modeling discovers latent themes across a collection of documents. Together, they give agents the ability to automatically tag content, cluster related documents, and route queries to the most relevant knowledge source.

Keyword Extraction with TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) remains one of the most reliable keyword extraction methods. It identifies terms that are frequent in a specific document but rare across the corpus — exactly the terms that distinguish one document from another.

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

def extract_keywords_tfidf(
    documents: list[str],
    doc_index: int,
    top_n: int = 10,
) -> list[tuple[str, float]]:
    """Extract top keywords for a specific document using TF-IDF."""
    vectorizer = TfidfVectorizer(
        stop_words="english",
        ngram_range=(1, 2),
        max_features=10000,
    )
    tfidf_matrix = vectorizer.fit_transform(documents)
    feature_names = vectorizer.get_feature_names_out()

    doc_vector = tfidf_matrix[doc_index].toarray().flatten()
    top_indices = np.argsort(doc_vector)[-top_n:][::-1]

    return [
        (feature_names[i], round(doc_vector[i], 4))
        for i in top_indices
        if doc_vector[i] > 0
    ]

documents = [
    "Neural networks use backpropagation for gradient-based optimization.",
    "Kubernetes orchestrates container deployments across clusters.",
    "BERT embeddings capture contextual word representations.",
]

keywords = extract_keywords_tfidf(documents, doc_index=0, top_n=5)
# [('backpropagation', 0.4721), ('gradient', 0.3891), ...]

Keyword Extraction with KeyBERT

KeyBERT uses sentence embeddings to find keywords that are semantically closest to the overall document meaning. It produces more contextually relevant keywords than TF-IDF, especially for short texts.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
from keybert import KeyBERT

kw_model = KeyBERT(model="all-MiniLM-L6-v2")

def extract_keywords_bert(
    text: str,
    top_n: int = 10,
    diversity: float = 0.5,
) -> list[tuple[str, float]]:
    """Extract keywords using semantic similarity."""
    keywords = kw_model.extract_keywords(
        text,
        keyphrase_ngram_range=(1, 2),
        stop_words="english",
        top_n=top_n,
        use_mmr=True,           # Maximal Marginal Relevance
        diversity=diversity,     # 0 = most similar, 1 = most diverse
    )
    return keywords

text = """Reinforcement learning agents learn optimal policies through
trial and error, maximizing cumulative reward in an environment.
Policy gradient methods like PPO and SAC are widely used for
continuous control tasks in robotics and game playing."""

keywords = extract_keywords_bert(text, top_n=5)
# [('reinforcement learning', 0.72), ('policy gradient', 0.65),
#  ('cumulative reward', 0.58), ('continuous control', 0.54),
#  ('optimal policies', 0.51)]

The diversity parameter controls the trade-off between relevance and variety. Set it higher when you want keywords that cover different aspects of the document rather than clustering around a single theme.

Topic Modeling with BERTopic

BERTopic is the modern standard for topic modeling. It uses sentence embeddings, dimensionality reduction (UMAP), and clustering (HDBSCAN) to discover topics automatically.

from bertopic import BERTopic

def discover_topics(
    documents: list[str],
    min_topic_size: int = 5,
) -> tuple[BERTopic, list[int]]:
    """Discover topics in a document collection."""
    topic_model = BERTopic(
        language="english",
        min_topic_size=min_topic_size,
        verbose=False,
    )
    topics, probabilities = topic_model.fit_transform(documents)
    return topic_model, topics

# Example with a collection of support tickets
tickets = [
    "Cannot log in after password reset",
    "Login page shows 500 error",
    "Password reset email never arrived",
    "Invoice amount is incorrect",
    "Charged twice for same subscription",
    "Need a refund for duplicate charge",
    "App crashes on Android 14",
    "Mobile app freezes when uploading photos",
    "App not compatible with my phone",
]

model, topics = discover_topics(tickets, min_topic_size=2)

# View discovered topics
topic_info = model.get_topic_info()
print(topic_info[["Topic", "Count", "Name"]].head())

# Get the topic for a new document
new_topic, _ = model.transform(["My login credentials are not working"])

Classical Topic Modeling with LDA

Latent Dirichlet Allocation (LDA) is a probabilistic model that represents each document as a mixture of topics, where each topic is a distribution over words. It is lighter weight than BERTopic and works well when you need interpretable, fixed-size topic distributions.

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

def lda_topics(
    documents: list[str],
    n_topics: int = 5,
    top_words: int = 8,
) -> list[dict]:
    """Discover topics using LDA."""
    vectorizer = CountVectorizer(
        stop_words="english",
        max_features=5000,
    )
    doc_term_matrix = vectorizer.fit_transform(documents)
    feature_names = vectorizer.get_feature_names_out()

    lda = LatentDirichletAllocation(
        n_components=n_topics,
        random_state=42,
        max_iter=20,
    )
    lda.fit(doc_term_matrix)

    topics = []
    for idx, topic_dist in enumerate(lda.components_):
        top_indices = topic_dist.argsort()[-top_words:][::-1]
        words = [feature_names[i] for i in top_indices]
        topics.append({"topic_id": idx, "keywords": words})

    return topics

Building an Agent Knowledge Organizer

Here is a complete system that an agent can use to automatically organize and retrieve documents by topic.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

from dataclasses import dataclass
from keybert import KeyBERT
from bertopic import BERTopic

@dataclass
class TaggedDocument:
    text: str
    keywords: list[str]
    topic_id: int
    topic_label: str

class KnowledgeOrganizer:
    def __init__(self):
        self.keyword_model = KeyBERT(model="all-MiniLM-L6-v2")
        self.topic_model = None
        self.documents: list[TaggedDocument] = []

    def index_documents(self, texts: list[str]) -> list[TaggedDocument]:
        self.topic_model = BERTopic(min_topic_size=3, verbose=False)
        topics, _ = self.topic_model.fit_transform(texts)

        tagged = []
        for text, topic_id in zip(texts, topics):
            keywords = self.keyword_model.extract_keywords(
                text, top_n=5, stop_words="english"
            )
            label = self.topic_model.get_topic(topic_id)
            topic_label = label[0][0] if label and topic_id != -1 else "misc"

            doc = TaggedDocument(
                text=text,
                keywords=[kw for kw, _ in keywords],
                topic_id=topic_id,
                topic_label=topic_label,
            )
            tagged.append(doc)

        self.documents = tagged
        return tagged

    def find_related(self, query: str, top_n: int = 5) -> list[TaggedDocument]:
        topic, _ = self.topic_model.transform([query])
        return [
            doc for doc in self.documents
            if doc.topic_id == topic[0]
        ][:top_n]

FAQ

What is the difference between keyword extraction and topic modeling?

Keyword extraction operates on individual documents, identifying the most important terms within a single text. Topic modeling operates on a collection of documents, discovering shared themes that span multiple documents. Keywords describe what a single document is about. Topics describe what groups of documents have in common.

How do I choose between BERTopic and LDA for my agent?

Use BERTopic when you need high-quality, semantically coherent topics and have access to GPU resources. Use LDA when you need lightweight, interpretable topic distributions, when your documents are short, or when you need deterministic results for reproducibility. BERTopic generally produces better topics but requires more compute and memory.

How do I handle new documents that arrive after the initial topic model is trained?

BERTopic supports incremental topic assignment through its transform() method — pass new documents to get their topic assignments without retraining. For periodic retraining, use BERTopic's merge_models() to combine an existing model with a model trained on new data. Schedule full retraining when topic drift becomes noticeable.


#KeywordExtraction #TopicModeling #BERTopic #KeyBERT #NLP #AIAgents #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Agents

Personal AI Assistant: How to Pick One for Business in 2026

A founder's guide to the personal AI assistant market: best AI assistant apps, business-grade options, and how CallSphere's voice agent fits in.

AI Agents

Free AI Agents in 2026: When Free Wins and When It Costs You

A founder's guide to free AI agents, low-code AI agent builders, and how to know when you should pay for a real platform like CallSphere.

Agentic AI

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

Graphiti is the open-source temporal knowledge graph for AI agents in 2026. Learn how bi-temporal memory beats vector RAG for voice agents and long-running LLMs.

AI Agents

Chatbot App vs ChatGPT: What's the Difference, and Which Do I Need?

Chatbot app vs ChatGPT in 2026: a founder's clear take on the difference, when to use which, and how a real AI chatbot app development works.

HVAC

Building an HVAC After-Hours Emergency Escalation System: A Complete Engineering Guide

How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.

Enterprise AI

OpenAI Frontier vs Anthropic Managed Agents: 2026 Comparison

Head-to-head: OpenAI Frontier and Anthropic's managed agent stack — strengths, fit, and what each means for enterprise AI voice and chat deployment.