---
title: "Keyword Extraction and Topic Modeling for Agent Knowledge Organization"
description: "Learn keyword extraction with TF-IDF and KeyBERT, topic modeling with BERTopic and LDA, and how to build agent knowledge organization systems that automatically categorize and cluster documents."
canonical: https://callsphere.ai/blog/keyword-extraction-topic-modeling-agent-knowledge-organization
category: "Learn Agentic AI"
tags: ["Keyword Extraction", "Topic Modeling", "BERTopic", "KeyBERT", "NLP", "AI Agents"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-31T19:32:29.609Z
---

# Keyword Extraction and Topic Modeling for Agent Knowledge Organization

> Learn keyword extraction with TF-IDF and KeyBERT, topic modeling with BERTopic and LDA, and how to build agent knowledge organization systems that automatically categorize and cluster documents.

## Why Agents Need Keyword and Topic Understanding

An AI agent managing a knowledge base of thousands of documents needs to organize, search, and retrieve information efficiently. Keyword extraction identifies the most representative terms in a document. Topic modeling discovers latent themes across a collection of documents. Together, they give agents the ability to automatically tag content, cluster related documents, and route queries to the most relevant knowledge source.

## Keyword Extraction with TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) remains one of the most reliable keyword extraction methods. It identifies terms that are frequent in a specific document but rare across the corpus — exactly the terms that distinguish one document from another.

```mermaid
flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus
classify"]
    PLAN["Plan and tool
selection"]
    AGENT["Agent loop
LLM plus tools"]
    GUARD{"Guardrails
and policy"}
    EXEC["Execute and
verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus
next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
```

```python
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

def extract_keywords_tfidf(
    documents: list[str],
    doc_index: int,
    top_n: int = 10,
) -> list[tuple[str, float]]:
    """Extract top keywords for a specific document using TF-IDF."""
    vectorizer = TfidfVectorizer(
        stop_words="english",
        ngram_range=(1, 2),
        max_features=10000,
    )
    tfidf_matrix = vectorizer.fit_transform(documents)
    feature_names = vectorizer.get_feature_names_out()

    doc_vector = tfidf_matrix[doc_index].toarray().flatten()
    top_indices = np.argsort(doc_vector)[-top_n:][::-1]

    return [
        (feature_names[i], round(doc_vector[i], 4))
        for i in top_indices
        if doc_vector[i] > 0
    ]

documents = [
    "Neural networks use backpropagation for gradient-based optimization.",
    "Kubernetes orchestrates container deployments across clusters.",
    "BERT embeddings capture contextual word representations.",
]

keywords = extract_keywords_tfidf(documents, doc_index=0, top_n=5)
# [('backpropagation', 0.4721), ('gradient', 0.3891), ...]
```

## Keyword Extraction with KeyBERT

KeyBERT uses sentence embeddings to find keywords that are semantically closest to the overall document meaning. It produces more contextually relevant keywords than TF-IDF, especially for short texts.

```python
from keybert import KeyBERT

kw_model = KeyBERT(model="all-MiniLM-L6-v2")

def extract_keywords_bert(
    text: str,
    top_n: int = 10,
    diversity: float = 0.5,
) -> list[tuple[str, float]]:
    """Extract keywords using semantic similarity."""
    keywords = kw_model.extract_keywords(
        text,
        keyphrase_ngram_range=(1, 2),
        stop_words="english",
        top_n=top_n,
        use_mmr=True,           # Maximal Marginal Relevance
        diversity=diversity,     # 0 = most similar, 1 = most diverse
    )
    return keywords

text = """Reinforcement learning agents learn optimal policies through
trial and error, maximizing cumulative reward in an environment.
Policy gradient methods like PPO and SAC are widely used for
continuous control tasks in robotics and game playing."""

keywords = extract_keywords_bert(text, top_n=5)
# [('reinforcement learning', 0.72), ('policy gradient', 0.65),
#  ('cumulative reward', 0.58), ('continuous control', 0.54),
#  ('optimal policies', 0.51)]
```

The `diversity` parameter controls the trade-off between relevance and variety. Set it higher when you want keywords that cover different aspects of the document rather than clustering around a single theme.

## Topic Modeling with BERTopic

BERTopic is the modern standard for topic modeling. It uses sentence embeddings, dimensionality reduction (UMAP), and clustering (HDBSCAN) to discover topics automatically.

```python
from bertopic import BERTopic

def discover_topics(
    documents: list[str],
    min_topic_size: int = 5,
) -> tuple[BERTopic, list[int]]:
    """Discover topics in a document collection."""
    topic_model = BERTopic(
        language="english",
        min_topic_size=min_topic_size,
        verbose=False,
    )
    topics, probabilities = topic_model.fit_transform(documents)
    return topic_model, topics

# Example with a collection of support tickets
tickets = [
    "Cannot log in after password reset",
    "Login page shows 500 error",
    "Password reset email never arrived",
    "Invoice amount is incorrect",
    "Charged twice for same subscription",
    "Need a refund for duplicate charge",
    "App crashes on Android 14",
    "Mobile app freezes when uploading photos",
    "App not compatible with my phone",
]

model, topics = discover_topics(tickets, min_topic_size=2)

# View discovered topics
topic_info = model.get_topic_info()
print(topic_info[["Topic", "Count", "Name"]].head())

# Get the topic for a new document
new_topic, _ = model.transform(["My login credentials are not working"])
```

## Classical Topic Modeling with LDA

Latent Dirichlet Allocation (LDA) is a probabilistic model that represents each document as a mixture of topics, where each topic is a distribution over words. It is lighter weight than BERTopic and works well when you need interpretable, fixed-size topic distributions.

```python
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

def lda_topics(
    documents: list[str],
    n_topics: int = 5,
    top_words: int = 8,
) -> list[dict]:
    """Discover topics using LDA."""
    vectorizer = CountVectorizer(
        stop_words="english",
        max_features=5000,
    )
    doc_term_matrix = vectorizer.fit_transform(documents)
    feature_names = vectorizer.get_feature_names_out()

    lda = LatentDirichletAllocation(
        n_components=n_topics,
        random_state=42,
        max_iter=20,
    )
    lda.fit(doc_term_matrix)

    topics = []
    for idx, topic_dist in enumerate(lda.components_):
        top_indices = topic_dist.argsort()[-top_words:][::-1]
        words = [feature_names[i] for i in top_indices]
        topics.append({"topic_id": idx, "keywords": words})

    return topics
```

## Building an Agent Knowledge Organizer

Here is a complete system that an agent can use to automatically organize and retrieve documents by topic.

```python
from dataclasses import dataclass
from keybert import KeyBERT
from bertopic import BERTopic

@dataclass
class TaggedDocument:
    text: str
    keywords: list[str]
    topic_id: int
    topic_label: str

class KnowledgeOrganizer:
    def __init__(self):
        self.keyword_model = KeyBERT(model="all-MiniLM-L6-v2")
        self.topic_model = None
        self.documents: list[TaggedDocument] = []

    def index_documents(self, texts: list[str]) -> list[TaggedDocument]:
        self.topic_model = BERTopic(min_topic_size=3, verbose=False)
        topics, _ = self.topic_model.fit_transform(texts)

        tagged = []
        for text, topic_id in zip(texts, topics):
            keywords = self.keyword_model.extract_keywords(
                text, top_n=5, stop_words="english"
            )
            label = self.topic_model.get_topic(topic_id)
            topic_label = label[0][0] if label and topic_id != -1 else "misc"

            doc = TaggedDocument(
                text=text,
                keywords=[kw for kw, _ in keywords],
                topic_id=topic_id,
                topic_label=topic_label,
            )
            tagged.append(doc)

        self.documents = tagged
        return tagged

    def find_related(self, query: str, top_n: int = 5) -> list[TaggedDocument]:
        topic, _ = self.topic_model.transform([query])
        return [
            doc for doc in self.documents
            if doc.topic_id == topic[0]
        ][:top_n]
```

## FAQ

### What is the difference between keyword extraction and topic modeling?

Keyword extraction operates on individual documents, identifying the most important terms within a single text. Topic modeling operates on a collection of documents, discovering shared themes that span multiple documents. Keywords describe what a single document is about. Topics describe what groups of documents have in common.

### How do I choose between BERTopic and LDA for my agent?

Use BERTopic when you need high-quality, semantically coherent topics and have access to GPU resources. Use LDA when you need lightweight, interpretable topic distributions, when your documents are short, or when you need deterministic results for reproducibility. BERTopic generally produces better topics but requires more compute and memory.

### How do I handle new documents that arrive after the initial topic model is trained?

BERTopic supports incremental topic assignment through its `transform()` method — pass new documents to get their topic assignments without retraining. For periodic retraining, use BERTopic's `merge_models()` to combine an existing model with a model trained on new data. Schedule full retraining when topic drift becomes noticeable.

---

#KeywordExtraction #TopicModeling #BERTopic #KeyBERT #NLP #AIAgents #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/keyword-extraction-topic-modeling-agent-knowledge-organization
