Skip to content
Learn Agentic AI
Learn Agentic AI10 min read2 views

Token Healing and Output Recovery: Fixing Common LLM Generation Artifacts

Learn techniques for detecting and repairing common LLM output problems including truncated responses, malformed JSON, encoding artifacts, and broken code blocks through robust post-processing pipelines.

The Reality of LLM Outputs

LLM outputs are not always clean. Even the best models produce artifacts: truncated responses when hitting token limits, malformed JSON with trailing commas or missing brackets, code blocks that open but never close, and Unicode encoding errors from tokenizer edge cases. In agentic pipelines where outputs feed into downstream parsers, tools, and other models, these artifacts cause cascading failures.

Token healing and output recovery are the defensive techniques that make agent pipelines robust against these inevitable generation imperfections.

Token Healing: Fixing Tokenization Boundary Issues

Token healing addresses a specific problem at the boundary between a prompt and the model's completion. When a prompt ends mid-token (for example, ending with a partial URL or code string), the model may generate an unexpected continuation because the tokenizer splits the boundary differently than intended.

flowchart TD
    START["Token Healing and Output Recovery: Fixing Common …"] --> A
    A["The Reality of LLM Outputs"]
    A --> B
    B["Token Healing: Fixing Tokenization Boun…"]
    B --> C
    C["Truncation Recovery"]
    C --> D
    D["Format Repair Pipeline"]
    D --> E
    E["Post-Processing Best Practices"]
    E --> F
    F["Common Artifacts and Their Fixes"]
    F --> G
    G["FAQ"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

The solution is to back up by one token from the prompt boundary and let the model regenerate from that point with a constrained prefix:

import tiktoken

def heal_token_boundary(prompt: str, completion: str, model: str = "gpt-4") -> str:
    """Fix artifacts at the prompt-completion boundary."""
    encoding = tiktoken.encoding_for_model(model)

    # Encode the last few characters of the prompt
    prompt_tokens = encoding.encode(prompt)
    if not prompt_tokens:
        return completion

    # Decode the last token to see if it might be a partial match
    last_token_text = encoding.decode([prompt_tokens[-1]])
    prompt_suffix = prompt[-len(last_token_text):]

    # If the prompt's trailing text does not match the last token's
    # full decoded form, we have a boundary issue
    if prompt_suffix != last_token_text:
        # Re-encode the boundary region
        boundary = prompt_suffix + completion[:10]
        healed_tokens = encoding.encode(boundary)
        healed_text = encoding.decode(healed_tokens)
        # Replace the boundary region with the healed version
        completion = healed_text[len(prompt_suffix):] + completion[10:]

    return completion

Truncation Recovery

When responses hit the max_tokens limit, they are cut off mid-sentence or mid-structure. For structured outputs, this is catastrophic — a truncated JSON string is unparseable. Recovery strategies depend on the output format:

import json
import re

def recover_truncated_json(raw: str) -> dict | None:
    """Attempt to recover a valid JSON object from truncated output."""
    # Strip markdown fences if present
    raw = re.sub(r"```json\s*", "", raw)
    raw = re.sub(r"```\s*$", "", raw)
    raw = raw.strip()

    # Try parsing as-is first
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        pass

    # Strategy 1: Close unclosed brackets and braces
    open_braces = raw.count("{") - raw.count("}")
    open_brackets = raw.count("[") - raw.count("]")

    repaired = raw.rstrip(",\n ")  # remove trailing commas
    # Remove any incomplete key-value pair at the end
    repaired = re.sub(r',\s*"[^"]*"\s*:\s*$', "", repaired)
    repaired = re.sub(r',\s*"[^"]*$', "", repaired)
    repaired = re.sub(r',\s*$', "", repaired)

    repaired += "]" * max(0, open_brackets)
    repaired += "}" * max(0, open_braces)

    try:
        return json.loads(repaired)
    except json.JSONDecodeError:
        pass

    # Strategy 2: Find the last valid JSON prefix
    for end in range(len(raw), 0, -1):
        candidate = raw[:end]
        open_b = candidate.count("{") - candidate.count("}")
        open_k = candidate.count("[") - candidate.count("]")
        candidate += "]" * max(0, open_k) + "}" * max(0, open_b)
        try:
            return json.loads(candidate)
        except json.JSONDecodeError:
            continue

    return None

Format Repair Pipeline

A robust format repair pipeline applies multiple repair strategies in sequence, from cheapest to most expensive:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

from dataclasses import dataclass
from typing import Callable

@dataclass
class RepairResult:
    success: bool
    data: any
    strategy_used: str

def build_repair_pipeline(
    strategies: list[tuple[str, Callable[[str], any]]],
) -> Callable[[str], RepairResult]:
    """Build a repair pipeline that tries strategies in order."""
    def repair(raw_output: str) -> RepairResult:
        for name, strategy in strategies:
            try:
                result = strategy(raw_output)
                if result is not None:
                    return RepairResult(success=True, data=result, strategy_used=name)
            except Exception:
                continue
        return RepairResult(success=False, data=None, strategy_used="none")

    return repair

# Configure the pipeline
json_repair = build_repair_pipeline([
    ("direct_parse", lambda s: json.loads(s)),
    ("strip_fences", lambda s: json.loads(re.sub(r"```\w*\n?|\n?```", "", s).strip())),
    ("truncation_recovery", recover_truncated_json),
    ("extract_first_object", lambda s: json.loads(re.search(r"\{.*\}", s, re.DOTALL).group())),
])

# Usage
result = json_repair(llm_output)
if result.success:
    print(f"Parsed using: {result.strategy_used}")
    process(result.data)
else:
    trigger_retry_or_escalate()

Post-Processing Best Practices

Always validate structure before content. Check that JSON is valid before checking that it has the right keys. Check that code compiles before checking that it runs correctly. Structural validation is cheap and catches the most common artifacts.

Log repair actions. Every repair is a signal that something went wrong upstream. Track which repair strategies fire most often and use that data to improve your prompts, adjust token limits, or switch models.

Set repair budgets. A post-processing pipeline should not retry indefinitely. Define a maximum number of repair attempts and a fallback behavior (return a default, escalate to a human, return a graceful error).

Common Artifacts and Their Fixes

Trailing commas in JSON arrays and objects — strip with regex before parsing. Missing closing quotes — count quote parity and append if needed. Markdown code fences wrapping structured output — strip known fence patterns. HTML entities in plain text responses — decode with html.unescape(). Repeated tokens (model degeneration) — detect consecutive duplicate n-grams and truncate.

FAQ

When should I use output recovery versus retrying the LLM call?

Use output recovery first — it is faster and cheaper than an LLM retry. Retry only when recovery fails or when the content itself (not just the format) is inadequate. A good rule of thumb: if the semantic content is present but the format is broken, repair it. If the content is missing or wrong, retry.

How do I handle truncation proactively?

Monitor the finish_reason field in the API response. If it is length instead of stop, the output was truncated. For structured outputs, set max_tokens high enough to accommodate the expected output plus a 30% buffer. For variable-length outputs, implement continuation — send a follow-up request asking the model to continue from where it stopped.

Does token healing apply to all models?

The boundary artifact that token healing addresses is specific to byte-pair encoding (BPE) tokenizers, which are used by GPT, Llama, Mistral, and most major models. Models using character-level or word-level tokenizers do not exhibit this specific artifact, but they have their own edge cases.


#TokenHealing #OutputRecovery #PostProcessing #ErrorHandling #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Learn Agentic AI

Fine-Tuning LLMs for Agentic Tasks: When and How to Customize Foundation Models

When fine-tuning beats prompting for AI agents: dataset creation from agent traces, SFT and DPO training approaches, evaluation methodology, and cost-benefit analysis for agentic fine-tuning.

AI Interview Prep

7 Agentic AI & Multi-Agent System Interview Questions for 2026

Real agentic AI and multi-agent system interview questions from Anthropic, OpenAI, and Microsoft in 2026. Covers agent design patterns, memory systems, safety, orchestration frameworks, tool calling, and evaluation.

Learn Agentic AI

How NVIDIA Vera CPU Solves the Agentic AI Bottleneck: Architecture Deep Dive

Technical analysis of NVIDIA's Vera CPU designed for agentic AI workloads — why the CPU is the bottleneck, how Vera's architecture addresses it, and what it means for agent performance.

Learn Agentic AI

Adaptive Thinking in Claude 4.6: How AI Agents Decide When and How Much to Reason

Technical exploration of adaptive thinking in Claude 4.6 — how the model dynamically adjusts reasoning depth, its impact on agent architectures, and practical implementation patterns.

Learn Agentic AI

Self-Correcting AI Agents: Reflection, Retry, and Validation Loop Patterns

How to build AI agents that catch and fix their own errors through output validation, reflection prompting, retry with feedback, and graceful escalation when self-correction fails.

Learn Agentic AI

Claude Opus 4.6 with 1M Context Window: Complete Developer Guide for Agentic AI

Complete guide to Claude Opus 4.6 GA — 1M context at standard pricing, 128K output tokens, adaptive thinking, and production patterns for building agentic AI systems.