Skip to content
Learn Agentic AI
Learn Agentic AI10 min read2 views

Keyword Extraction and Topic Modeling for Agent Knowledge Organization

Learn keyword extraction with TF-IDF and KeyBERT, topic modeling with BERTopic and LDA, and how to build agent knowledge organization systems that automatically categorize and cluster documents.

Why Agents Need Keyword and Topic Understanding

An AI agent managing a knowledge base of thousands of documents needs to organize, search, and retrieve information efficiently. Keyword extraction identifies the most representative terms in a document. Topic modeling discovers latent themes across a collection of documents. Together, they give agents the ability to automatically tag content, cluster related documents, and route queries to the most relevant knowledge source.

Keyword Extraction with TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) remains one of the most reliable keyword extraction methods. It identifies terms that are frequent in a specific document but rare across the corpus — exactly the terms that distinguish one document from another.

flowchart TD
    START["Keyword Extraction and Topic Modeling for Agent K…"] --> A
    A["Why Agents Need Keyword and Topic Under…"]
    A --> B
    B["Keyword Extraction with TF-IDF"]
    B --> C
    C["Keyword Extraction with KeyBERT"]
    C --> D
    D["Topic Modeling with BERTopic"]
    D --> E
    E["Classical Topic Modeling with LDA"]
    E --> F
    F["Building an Agent Knowledge Organizer"]
    F --> G
    G["FAQ"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

def extract_keywords_tfidf(
    documents: list[str],
    doc_index: int,
    top_n: int = 10,
) -> list[tuple[str, float]]:
    """Extract top keywords for a specific document using TF-IDF."""
    vectorizer = TfidfVectorizer(
        stop_words="english",
        ngram_range=(1, 2),
        max_features=10000,
    )
    tfidf_matrix = vectorizer.fit_transform(documents)
    feature_names = vectorizer.get_feature_names_out()

    doc_vector = tfidf_matrix[doc_index].toarray().flatten()
    top_indices = np.argsort(doc_vector)[-top_n:][::-1]

    return [
        (feature_names[i], round(doc_vector[i], 4))
        for i in top_indices
        if doc_vector[i] > 0
    ]

documents = [
    "Neural networks use backpropagation for gradient-based optimization.",
    "Kubernetes orchestrates container deployments across clusters.",
    "BERT embeddings capture contextual word representations.",
]

keywords = extract_keywords_tfidf(documents, doc_index=0, top_n=5)
# [('backpropagation', 0.4721), ('gradient', 0.3891), ...]

Keyword Extraction with KeyBERT

KeyBERT uses sentence embeddings to find keywords that are semantically closest to the overall document meaning. It produces more contextually relevant keywords than TF-IDF, especially for short texts.

from keybert import KeyBERT

kw_model = KeyBERT(model="all-MiniLM-L6-v2")

def extract_keywords_bert(
    text: str,
    top_n: int = 10,
    diversity: float = 0.5,
) -> list[tuple[str, float]]:
    """Extract keywords using semantic similarity."""
    keywords = kw_model.extract_keywords(
        text,
        keyphrase_ngram_range=(1, 2),
        stop_words="english",
        top_n=top_n,
        use_mmr=True,           # Maximal Marginal Relevance
        diversity=diversity,     # 0 = most similar, 1 = most diverse
    )
    return keywords

text = """Reinforcement learning agents learn optimal policies through
trial and error, maximizing cumulative reward in an environment.
Policy gradient methods like PPO and SAC are widely used for
continuous control tasks in robotics and game playing."""

keywords = extract_keywords_bert(text, top_n=5)
# [('reinforcement learning', 0.72), ('policy gradient', 0.65),
#  ('cumulative reward', 0.58), ('continuous control', 0.54),
#  ('optimal policies', 0.51)]

The diversity parameter controls the trade-off between relevance and variety. Set it higher when you want keywords that cover different aspects of the document rather than clustering around a single theme.

Topic Modeling with BERTopic

BERTopic is the modern standard for topic modeling. It uses sentence embeddings, dimensionality reduction (UMAP), and clustering (HDBSCAN) to discover topics automatically.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

from bertopic import BERTopic

def discover_topics(
    documents: list[str],
    min_topic_size: int = 5,
) -> tuple[BERTopic, list[int]]:
    """Discover topics in a document collection."""
    topic_model = BERTopic(
        language="english",
        min_topic_size=min_topic_size,
        verbose=False,
    )
    topics, probabilities = topic_model.fit_transform(documents)
    return topic_model, topics

# Example with a collection of support tickets
tickets = [
    "Cannot log in after password reset",
    "Login page shows 500 error",
    "Password reset email never arrived",
    "Invoice amount is incorrect",
    "Charged twice for same subscription",
    "Need a refund for duplicate charge",
    "App crashes on Android 14",
    "Mobile app freezes when uploading photos",
    "App not compatible with my phone",
]

model, topics = discover_topics(tickets, min_topic_size=2)

# View discovered topics
topic_info = model.get_topic_info()
print(topic_info[["Topic", "Count", "Name"]].head())

# Get the topic for a new document
new_topic, _ = model.transform(["My login credentials are not working"])

Classical Topic Modeling with LDA

Latent Dirichlet Allocation (LDA) is a probabilistic model that represents each document as a mixture of topics, where each topic is a distribution over words. It is lighter weight than BERTopic and works well when you need interpretable, fixed-size topic distributions.

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

def lda_topics(
    documents: list[str],
    n_topics: int = 5,
    top_words: int = 8,
) -> list[dict]:
    """Discover topics using LDA."""
    vectorizer = CountVectorizer(
        stop_words="english",
        max_features=5000,
    )
    doc_term_matrix = vectorizer.fit_transform(documents)
    feature_names = vectorizer.get_feature_names_out()

    lda = LatentDirichletAllocation(
        n_components=n_topics,
        random_state=42,
        max_iter=20,
    )
    lda.fit(doc_term_matrix)

    topics = []
    for idx, topic_dist in enumerate(lda.components_):
        top_indices = topic_dist.argsort()[-top_words:][::-1]
        words = [feature_names[i] for i in top_indices]
        topics.append({"topic_id": idx, "keywords": words})

    return topics

Building an Agent Knowledge Organizer

Here is a complete system that an agent can use to automatically organize and retrieve documents by topic.

from dataclasses import dataclass
from keybert import KeyBERT
from bertopic import BERTopic

@dataclass
class TaggedDocument:
    text: str
    keywords: list[str]
    topic_id: int
    topic_label: str

class KnowledgeOrganizer:
    def __init__(self):
        self.keyword_model = KeyBERT(model="all-MiniLM-L6-v2")
        self.topic_model = None
        self.documents: list[TaggedDocument] = []

    def index_documents(self, texts: list[str]) -> list[TaggedDocument]:
        self.topic_model = BERTopic(min_topic_size=3, verbose=False)
        topics, _ = self.topic_model.fit_transform(texts)

        tagged = []
        for text, topic_id in zip(texts, topics):
            keywords = self.keyword_model.extract_keywords(
                text, top_n=5, stop_words="english"
            )
            label = self.topic_model.get_topic(topic_id)
            topic_label = label[0][0] if label and topic_id != -1 else "misc"

            doc = TaggedDocument(
                text=text,
                keywords=[kw for kw, _ in keywords],
                topic_id=topic_id,
                topic_label=topic_label,
            )
            tagged.append(doc)

        self.documents = tagged
        return tagged

    def find_related(self, query: str, top_n: int = 5) -> list[TaggedDocument]:
        topic, _ = self.topic_model.transform([query])
        return [
            doc for doc in self.documents
            if doc.topic_id == topic[0]
        ][:top_n]

FAQ

What is the difference between keyword extraction and topic modeling?

Keyword extraction operates on individual documents, identifying the most important terms within a single text. Topic modeling operates on a collection of documents, discovering shared themes that span multiple documents. Keywords describe what a single document is about. Topics describe what groups of documents have in common.

How do I choose between BERTopic and LDA for my agent?

Use BERTopic when you need high-quality, semantically coherent topics and have access to GPU resources. Use LDA when you need lightweight, interpretable topic distributions, when your documents are short, or when you need deterministic results for reproducibility. BERTopic generally produces better topics but requires more compute and memory.

How do I handle new documents that arrive after the initial topic model is trained?

BERTopic supports incremental topic assignment through its transform() method — pass new documents to get their topic assignments without retraining. For periodic retraining, use BERTopic's merge_models() to combine an existing model with a model trained on new data. Schedule full retraining when topic drift becomes noticeable.


#KeywordExtraction #TopicModeling #BERTopic #KeyBERT #NLP #AIAgents #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Use Cases

Automating Client Document Collection: How AI Agents Chase Missing Tax Documents and Reduce Filing Delays

See how AI agents automate tax document collection — chasing missing W-2s, 1099s, and receipts via calls and texts to eliminate the #1 CPA bottleneck.

Technical Guides

Post-Call Analytics with GPT-4o-mini: Sentiment, Lead Scoring, and Intent

Build a post-call analytics pipeline with GPT-4o-mini — sentiment, intent, lead scoring, satisfaction, and escalation detection.

Learn Agentic AI

API Design for AI Agent Tool Functions: Best Practices and Anti-Patterns

How to design tool functions that LLMs can use effectively with clear naming, enum parameters, structured responses, informative error messages, and documentation.

Learn Agentic AI

Computer Use in GPT-5.4: Building AI Agents That Navigate Desktop Applications

Technical guide to GPT-5.4's computer use capabilities for building AI agents that interact with desktop UIs, browser automation, and real-world application workflows.

Learn Agentic AI

Prompt Engineering for AI Agents: System Prompts, Tool Descriptions, and Few-Shot Patterns

Agent-specific prompt engineering techniques: crafting effective system prompts, writing clear tool descriptions for function calling, and few-shot examples that improve complex task performance.

Learn Agentic AI

AI Agents for IT Helpdesk: L1 Automation, Ticket Routing, and Knowledge Base Integration

Build IT helpdesk AI agents with multi-agent architecture for triage, device, network, and security issues. RAG-powered knowledge base, automated ticket creation, routing, and escalation.