---
title: "Web Scraping Pipelines for Agent Knowledge: Crawling, Extracting, and Indexing Content"
description: "Build a production web scraping pipeline using Scrapy and Playwright that crawls websites, extracts structured content, deduplicates pages, and indexes knowledge for AI agent consumption."
canonical: https://callsphere.ai/blog/web-scraping-pipelines-agent-knowledge-crawling-indexing
category: "Learn Agentic AI"
tags: ["Web Scraping", "Data Pipelines", "Knowledge Base", "Scrapy", "Playwright"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:44.580Z
---

# Web Scraping Pipelines for Agent Knowledge: Crawling, Extracting, and Indexing Content

> Build a production web scraping pipeline using Scrapy and Playwright that crawls websites, extracts structured content, deduplicates pages, and indexes knowledge for AI agent consumption.

## Why Agents Need Web Scraping Pipelines

AI agents are only as useful as the knowledge they can access. Static document uploads cover internal knowledge, but many agent use cases demand fresh, continuously updated information from the open web — competitor pricing, regulatory updates, product documentation, forum discussions, and news.

A production scraping pipeline goes well beyond a simple `requests.get()` loop. It needs to handle JavaScript-rendered pages, respect rate limits and robots.txt, extract meaningful content from noisy HTML, deduplicate across crawls, and schedule recurring updates without manual intervention.

## Architecture Overview

A robust scraping pipeline has four stages: crawling (fetching pages), extraction (pulling structured content from HTML), deduplication (avoiding redundant processing), and indexing (storing content for agent retrieval). Each stage runs independently so failures in one do not block the others.

```mermaid
flowchart LR
    SRC[("Sources
DB, S3, APIs")]
    EXT["Extract
CDC or batch"]
    STAGE[("Raw zone")]
    XFRM["Transform
dbt models"]
    QUAL["Quality checks
Great Expectations"]
    CURATED[("Curated zone")]
    LOAD["Load to warehouse"]
    DW[("Snowflake or BigQuery")]
    ML[("Feature store")]
    SRC --> EXT --> STAGE --> XFRM --> QUAL --> CURATED --> LOAD
    LOAD --> DW
    LOAD --> ML
    style XFRM fill:#4f46e5,stroke:#4338ca,color:#fff
    style QUAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style DW fill:#059669,stroke:#047857,color:#fff
```

## Building the Crawler with Scrapy

Scrapy provides the crawling framework with built-in concurrency, politeness controls, and middleware support. For JavaScript-heavy sites, integrate Playwright as a download handler.

```python
import scrapy
from scrapy import Request
from urllib.parse import urlparse
from datetime import datetime

class KnowledgeCrawler(scrapy.Spider):
    name = "knowledge_crawler"
    custom_settings = {
        "CONCURRENT_REQUESTS": 4,
        "DOWNLOAD_DELAY": 2,
        "ROBOTSTXT_OBEY": True,
        "DEPTH_LIMIT": 3,
        "CLOSESPIDER_PAGECOUNT": 500,
        "HTTPCACHE_ENABLED": True,
        "HTTPCACHE_EXPIRATION_SECS": 86400,
    }

    def __init__(self, start_urls: list, allowed_domains: list, **kwargs):
        super().__init__(**kwargs)
        self.start_urls = start_urls
        self.allowed_domains = allowed_domains

    def parse(self, response):
        # Skip non-HTML responses
        content_type = response.headers.get(
            "Content-Type", b""
        ).decode()
        if "text/html" not in content_type:
            return

        yield {
            "url": response.url,
            "html": response.text,
            "status": response.status,
            "crawled_at": datetime.utcnow().isoformat(),
            "domain": urlparse(response.url).netloc,
        }

        # Follow internal links
        for href in response.css("a::attr(href)").getall():
            yield response.follow(href, callback=self.parse)
```

The `HTTPCACHE_ENABLED` setting is critical — it prevents re-downloading pages that have not changed between crawl runs, saving bandwidth and respecting the target server.

## Content Extraction

Raw HTML is useless for agents. The extraction stage strips navigation, ads, and boilerplate to isolate the main content.

```python
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import List, Optional
import hashlib

@dataclass
class ExtractedPage:
    url: str
    title: str
    content: str
    headings: List[str]
    content_hash: str
    word_count: int
    crawled_at: str

class ContentExtractor:
    NOISE_TAGS = [
        "script", "style", "nav", "footer",
        "header", "aside", "iframe", "form",
    ]
    NOISE_CLASSES = [
        "sidebar", "menu", "nav", "footer",
        "advertisement", "cookie", "popup",
    ]

    def extract(self, raw: dict) -> Optional[ExtractedPage]:
        soup = BeautifulSoup(raw["html"], "html.parser")

        # Remove noise elements
        for tag in self.NOISE_TAGS:
            for el in soup.find_all(tag):
                el.decompose()
        for cls in self.NOISE_CLASSES:
            for el in soup.find_all(class_=lambda c: c and cls in c.lower()):
                el.decompose()

        # Extract main content
        main = (
            soup.find("main")
            or soup.find("article")
            or soup.find("div", role="main")
            or soup.find("body")
        )
        if not main:
            return None

        text = main.get_text(separator="\n", strip=True)
        if len(text.split())  bool:
        # Exact duplicate check
        if page.content_hash in self.seen_hashes:
            return True
        self.seen_hashes.add(page.content_hash)

        # Near-duplicate check with MinHash
        mh = MinHash(num_perm=128)
        for word in page.content.lower().split():
            mh.update(word.encode("utf-8"))

        if self.lsh.query(mh):
            return True
        self.lsh.insert(page.url, mh)
        return False
```

## Scheduling Recurring Crawls

Use a simple scheduler to re-crawl sources on different frequencies based on how often they update.

```python
from apscheduler.schedulers.asyncio import AsyncIOScheduler

scheduler = AsyncIOScheduler()

# News sites: crawl every 6 hours
scheduler.add_job(
    run_crawl, "interval", hours=6,
    args=[["https://news.example.com"]],
    id="news_crawl",
)

# Documentation: crawl daily
scheduler.add_job(
    run_crawl, "interval", hours=24,
    args=[["https://docs.example.com"]],
    id="docs_crawl",
)

scheduler.start()
```

## FAQ

### How do I handle JavaScript-rendered pages that Scrapy cannot parse?

Install `scrapy-playwright` and set the `DOWNLOAD_HANDLERS` to use Playwright for specific domains. Add `meta={"playwright": True}` to requests targeting JS-heavy sites. This launches a headless browser for those pages while keeping standard HTTP requests for everything else, balancing speed and completeness.

### How do I respect robots.txt and avoid getting blocked?

Scrapy respects robots.txt by default with `ROBOTSTXT_OBEY: True`. Beyond that, set a `DOWNLOAD_DELAY` of at least 2 seconds, rotate user agents, limit concurrent requests per domain, and add your contact info to the user agent string so site owners can reach you if needed.

### Should I store raw HTML or just extracted text?

Store both. Raw HTML goes into object storage (S3 or local disk) as an archive, while extracted text goes into your vector database for retrieval. Keeping raw HTML lets you re-extract content when your extraction logic improves without re-crawling everything.

---

#WebScraping #DataPipelines #KnowledgeBase #Scrapy #Playwright #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/web-scraping-pipelines-agent-knowledge-crawling-indexing
