---
title: "Token Optimization: Reducing LLM Input Size Without Losing Quality"
description: "Master prompt compression, context pruning, conversation summarization, and selective history techniques to cut LLM costs and latency while preserving response quality in your AI agents."
canonical: https://callsphere.ai/blog/token-optimization-reducing-llm-input-size-without-losing-quality
category: "Learn Agentic AI"
tags: ["Token Optimization", "Prompt Engineering", "Cost Reduction", "Context Management", "Python"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-07T14:05:56.981Z
---

# Token Optimization: Reducing LLM Input Size Without Losing Quality

> Master prompt compression, context pruning, conversation summarization, and selective history techniques to cut LLM costs and latency while preserving response quality in your AI agents.

## Why Token Count Is Your Primary Cost and Latency Driver

Every token sent to an LLM costs money and adds latency. Input tokens are priced per thousand, and the time the model spends processing your prompt scales roughly linearly with token count. A 4,000-token prompt processes noticeably faster than a 16,000-token prompt — and costs 75% less.

For AI agents that maintain conversation history, tool outputs, and system instructions, token counts grow rapidly. A 20-turn conversation with tool results can easily reach 30,000+ input tokens per completion call. Optimizing this is not premature — it is essential for production viability.

## Prompt Compression: Saying the Same Thing in Fewer Tokens

System prompts are sent with every request. Compressing them yields compounding savings. The key principle is to remove redundancy without removing information.

```mermaid
flowchart TD
    SPEC(["Task spec"])
    SYSTEM["System prompt
role plus rules"]
    SHOTS["Few shot examples
3 to 5"]
    VARS["Variable injection
Jinja or f-string"]
    COT["Chain of thought
or scratchpad"]
    CONSTR["Output constraint
JSON schema"]
    LLM["LLM call"]
    EVAL["Offline eval
LLM as judge plus regex"]
    GATE{"Score over
threshold?"}
    COMMIT(["Promote to prod
version pinned"])
    REVISE(["Revise prompt"])
    SPEC --> SYSTEM --> SHOTS --> VARS --> COT --> CONSTR --> LLM --> EVAL --> GATE
    GATE -->|Yes| COMMIT
    GATE -->|No| REVISE --> SYSTEM
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style EVAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style COMMIT fill:#059669,stroke:#047857,color:#fff
```

```python
# BEFORE: 87 tokens
VERBOSE_PROMPT = """
You are a helpful customer service assistant for our company.
You should always be polite and professional in your responses.
When a customer asks a question, you should try to provide
a helpful and accurate answer. If you do not know the answer,
you should let the customer know that you will escalate their
question to a human agent who can help them.
"""

# AFTER: 34 tokens (61% reduction)
COMPRESSED_PROMPT = """You are a customer service assistant. Be polite and professional.
Answer accurately. If unsure, escalate to a human agent."""
```

Rules for prompt compression without quality loss: remove filler words ("try to", "should always"), eliminate repeated instructions, use imperative mood, and combine related sentences.

## Context Pruning: Keeping Only What Matters

Not every message in a conversation is relevant to the current turn. Context pruning removes or shortens messages that no longer contribute to the response.

```python
from dataclasses import dataclass

@dataclass
class Message:
    role: str
    content: str
    turn_number: int
    token_count: int

class ContextPruner:
    def __init__(self, max_tokens: int = 8000):
        self.max_tokens = max_tokens

    def prune(self, messages: list[Message], current_turn: int) -> list[Message]:
        """Keep system prompt, recent messages, and summarize old ones."""
        system_msgs = [m for m in messages if m.role == "system"]
        conversation = [m for m in messages if m.role != "system"]

        # Always keep the last 6 messages (3 turns)
        recent = conversation[-6:]
        older = conversation[:-6]

        # Calculate remaining token budget
        system_tokens = sum(m.token_count for m in system_msgs)
        recent_tokens = sum(m.token_count for m in recent)
        budget = self.max_tokens - system_tokens - recent_tokens

        # From older messages, keep only those within budget
        kept_older = []
        used = 0
        for msg in reversed(older):
            if used + msg.token_count  str:
        """Compress a window of messages into a concise summary."""
        formatted = "\n".join(
            f"{m['role']}: {m['content']}" for m in messages
        )
        response = await self.client.chat.completions.create(
            model="gpt-4o-mini",  # Use a cheap model for summarization
            messages=[
                {
                    "role": "system",
                    "content": "Summarize this conversation in 2-3 sentences. "
                    "Preserve key facts, decisions, and user preferences.",
                },
                {"role": "user", "content": formatted},
            ],
            max_tokens=150,
        )
        return response.choices[0].message.content

class SlidingWindowManager:
    def __init__(self, summarizer: ConversationSummarizer, window_size: int = 10):
        self.summarizer = summarizer
        self.window_size = window_size
        self.summary: str = ""
        self.messages: list[dict] = []

    async def add_and_compact(self, message: dict) -> list[dict]:
        self.messages.append(message)

        if len(self.messages) > self.window_size:
            # Summarize the oldest half
            split = len(self.messages) // 2
            to_summarize = self.messages[:split]
            self.messages = self.messages[split:]

            new_summary = await self.summarizer.summarize_window(to_summarize)
            self.summary = (
                f"{self.summary} {new_summary}".strip() if self.summary else new_summary
            )

        # Build the context for the LLM
        context = []
        if self.summary:
            context.append({
                "role": "system",
                "content": f"Conversation summary so far: {self.summary}",
            })
        context.extend(self.messages)
        return context
```

The cost of the summarization call (using a cheap model like gpt-4o-mini) is far less than sending the full history to an expensive model on every turn.

## Selective History: Including Only Relevant Turns

Instead of sending the entire conversation, you can use embedding similarity to select only the turns that are relevant to the current query.

```python
import numpy as np

class SelectiveHistory:
    def __init__(self, embedder, top_k: int = 5):
        self.embedder = embedder
        self.top_k = top_k
        self.history: list[dict] = []
        self.embeddings: list[np.ndarray] = []

    async def add_turn(self, message: dict):
        self.history.append(message)
        embedding = await self.embedder.embed(message["content"])
        self.embeddings.append(embedding)

    async def get_relevant_context(self, query: str) -> list[dict]:
        if len(self.history)  str:
    """Reduce tool output size while preserving structure."""
    try:
        data = json.loads(output)
        if isinstance(data, list) and len(data) > 5:
            truncated = data[:5]
            return json.dumps(truncated) + f"\n... ({len(data) - 5} more items)"
        return json.dumps(data, indent=None, separators=(",", ":"))
    except json.JSONDecodeError:
        # Plain text: truncate by character count (rough token estimate)
        char_limit = max_tokens * 4
        if len(output) > char_limit:
            return output[:char_limit] + "... (truncated)"
        return output
```

## FAQ

### Does reducing tokens actually change the quality of LLM responses?

It depends on what you remove. Removing filler words, redundant instructions, and irrelevant old messages has minimal impact on quality. Removing recent context, key user preferences, or important facts will degrade responses. The techniques above specifically target low-information content.

### When should I use summarization vs. pruning vs. selective history?

Use pruning when conversations are short-to-medium (under 30 turns) and you just need to stay within the context window. Use summarization for long-running sessions where old context still matters broadly. Use selective history when conversations cover many topics and only specific past turns are relevant to the current query.

### How do I measure whether my token optimization is hurting quality?

Run A/B evaluations. Send the same set of test queries through both the full-context and optimized-context paths, then compare response quality using an LLM-as-judge or human reviewers. Track a metric like "answer correctness" alongside your token savings to find the optimal tradeoff.

---

#TokenOptimization #PromptEngineering #CostReduction #ContextManagement #Python #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/token-optimization-reducing-llm-input-size-without-losing-quality
