Skip to content
Learn Agentic AI
Learn Agentic AI11 min read2 views

Web Scraping Pipelines for Agent Knowledge: Crawling, Extracting, and Indexing Content

Build a production web scraping pipeline using Scrapy and Playwright that crawls websites, extracts structured content, deduplicates pages, and indexes knowledge for AI agent consumption.

Why Agents Need Web Scraping Pipelines

AI agents are only as useful as the knowledge they can access. Static document uploads cover internal knowledge, but many agent use cases demand fresh, continuously updated information from the open web — competitor pricing, regulatory updates, product documentation, forum discussions, and news.

A production scraping pipeline goes well beyond a simple requests.get() loop. It needs to handle JavaScript-rendered pages, respect rate limits and robots.txt, extract meaningful content from noisy HTML, deduplicate across crawls, and schedule recurring updates without manual intervention.

Architecture Overview

A robust scraping pipeline has four stages: crawling (fetching pages), extraction (pulling structured content from HTML), deduplication (avoiding redundant processing), and indexing (storing content for agent retrieval). Each stage runs independently so failures in one do not block the others.

flowchart TD
    START["Web Scraping Pipelines for Agent Knowledge: Crawl…"] --> A
    A["Why Agents Need Web Scraping Pipelines"]
    A --> B
    B["Architecture Overview"]
    B --> C
    C["Building the Crawler with Scrapy"]
    C --> D
    D["Content Extraction"]
    D --> E
    E["Deduplication Across Crawls"]
    E --> F
    F["Scheduling Recurring Crawls"]
    F --> G
    G["FAQ"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

Building the Crawler with Scrapy

Scrapy provides the crawling framework with built-in concurrency, politeness controls, and middleware support. For JavaScript-heavy sites, integrate Playwright as a download handler.

import scrapy
from scrapy import Request
from urllib.parse import urlparse
from datetime import datetime

class KnowledgeCrawler(scrapy.Spider):
    name = "knowledge_crawler"
    custom_settings = {
        "CONCURRENT_REQUESTS": 4,
        "DOWNLOAD_DELAY": 2,
        "ROBOTSTXT_OBEY": True,
        "DEPTH_LIMIT": 3,
        "CLOSESPIDER_PAGECOUNT": 500,
        "HTTPCACHE_ENABLED": True,
        "HTTPCACHE_EXPIRATION_SECS": 86400,
    }

    def __init__(self, start_urls: list, allowed_domains: list, **kwargs):
        super().__init__(**kwargs)
        self.start_urls = start_urls
        self.allowed_domains = allowed_domains

    def parse(self, response):
        # Skip non-HTML responses
        content_type = response.headers.get(
            "Content-Type", b""
        ).decode()
        if "text/html" not in content_type:
            return

        yield {
            "url": response.url,
            "html": response.text,
            "status": response.status,
            "crawled_at": datetime.utcnow().isoformat(),
            "domain": urlparse(response.url).netloc,
        }

        # Follow internal links
        for href in response.css("a::attr(href)").getall():
            yield response.follow(href, callback=self.parse)

The HTTPCACHE_ENABLED setting is critical — it prevents re-downloading pages that have not changed between crawl runs, saving bandwidth and respecting the target server.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Content Extraction

Raw HTML is useless for agents. The extraction stage strips navigation, ads, and boilerplate to isolate the main content.

from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import List, Optional
import hashlib

@dataclass
class ExtractedPage:
    url: str
    title: str
    content: str
    headings: List[str]
    content_hash: str
    word_count: int
    crawled_at: str

class ContentExtractor:
    NOISE_TAGS = [
        "script", "style", "nav", "footer",
        "header", "aside", "iframe", "form",
    ]
    NOISE_CLASSES = [
        "sidebar", "menu", "nav", "footer",
        "advertisement", "cookie", "popup",
    ]

    def extract(self, raw: dict) -> Optional[ExtractedPage]:
        soup = BeautifulSoup(raw["html"], "html.parser")

        # Remove noise elements
        for tag in self.NOISE_TAGS:
            for el in soup.find_all(tag):
                el.decompose()
        for cls in self.NOISE_CLASSES:
            for el in soup.find_all(class_=lambda c: c and cls in c.lower()):
                el.decompose()

        # Extract main content
        main = (
            soup.find("main")
            or soup.find("article")
            or soup.find("div", role="main")
            or soup.find("body")
        )
        if not main:
            return None

        text = main.get_text(separator="\n", strip=True)
        if len(text.split()) < 50:
            return None  # skip thin pages

        title = soup.title.string if soup.title else ""
        headings = [
            h.get_text(strip=True)
            for h in main.find_all(["h1", "h2", "h3"])
        ]
        content_hash = hashlib.sha256(text.encode()).hexdigest()

        return ExtractedPage(
            url=raw["url"],
            title=title.strip(),
            content=text,
            headings=headings,
            content_hash=content_hash,
            word_count=len(text.split()),
            crawled_at=raw["crawled_at"],
        )

Deduplication Across Crawls

Agents should not have duplicate information in their knowledge base. Content hashing catches exact duplicates, but near-duplicates require SimHash or MinHash.

from datasketch import MinHash, MinHashLSH

class Deduplicator:
    def __init__(self, threshold: float = 0.85):
        self.lsh = MinHashLSH(threshold=threshold, num_perm=128)
        self.seen_hashes = set()

    def is_duplicate(self, page: ExtractedPage) -> bool:
        # Exact duplicate check
        if page.content_hash in self.seen_hashes:
            return True
        self.seen_hashes.add(page.content_hash)

        # Near-duplicate check with MinHash
        mh = MinHash(num_perm=128)
        for word in page.content.lower().split():
            mh.update(word.encode("utf-8"))

        if self.lsh.query(mh):
            return True
        self.lsh.insert(page.url, mh)
        return False

Scheduling Recurring Crawls

Use a simple scheduler to re-crawl sources on different frequencies based on how often they update.

from apscheduler.schedulers.asyncio import AsyncIOScheduler

scheduler = AsyncIOScheduler()

# News sites: crawl every 6 hours
scheduler.add_job(
    run_crawl, "interval", hours=6,
    args=[["https://news.example.com"]],
    id="news_crawl",
)

# Documentation: crawl daily
scheduler.add_job(
    run_crawl, "interval", hours=24,
    args=[["https://docs.example.com"]],
    id="docs_crawl",
)

scheduler.start()

FAQ

How do I handle JavaScript-rendered pages that Scrapy cannot parse?

Install scrapy-playwright and set the DOWNLOAD_HANDLERS to use Playwright for specific domains. Add meta={"playwright": True} to requests targeting JS-heavy sites. This launches a headless browser for those pages while keeping standard HTTP requests for everything else, balancing speed and completeness.

How do I respect robots.txt and avoid getting blocked?

Scrapy respects robots.txt by default with ROBOTSTXT_OBEY: True. Beyond that, set a DOWNLOAD_DELAY of at least 2 seconds, rotate user agents, limit concurrent requests per domain, and add your contact info to the user agent string so site owners can reach you if needed.

Should I store raw HTML or just extracted text?

Store both. Raw HTML goes into object storage (S3 or local disk) as an archive, while extracted text goes into your vector database for retrieval. Keeping raw HTML lets you re-extract content when your extraction logic improves without re-crawling everything.


#WebScraping #DataPipelines #KnowledgeBase #Scrapy #Playwright #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Technical Guides

How to Train an AI Voice Agent on Your Business: Prompts, RAG, and Fine-Tuning

A practical guide to training an AI voice agent on your specific business — system prompts, RAG over knowledge bases, and when to fine-tune.

Learn Agentic AI

Using GPT-4 Vision to Understand Web Pages: Screenshot Analysis for AI Agents

Learn how to capture web page screenshots and send them to GPT-4 Vision for element identification, layout understanding, and structured analysis that powers browser automation agents.

Learn Agentic AI

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Learn how to capture full-page screenshots, element-level screenshots, and record browser session videos with Playwright, then feed them to GPT-4 Vision for automated visual analysis.

Learn Agentic AI

Playwright Network Interception: Capturing API Calls and Modifying Requests

Master Playwright's network interception API to capture API responses, log request/response data, mock endpoints, and extract structured data from XHR and fetch calls in your AI agents.

Learn Agentic AI

Building a Vision-Based Web Navigator: GPT-4V Sees and Acts on Web Pages

Build a complete screenshot-action loop where GPT-4V analyzes web pages, decides where to click, and navigates autonomously. Learn coordinate extraction, click targeting, and navigation decision-making.

Learn Agentic AI

Building a Web Scraping Agent with Playwright: Dynamic Content and JavaScript-Rendered Pages

Build a production-grade web scraping AI agent using Playwright that handles SPAs, infinite scroll, pagination, dynamic content loading, and basic anti-detection strategies.