Token Healing and Output Recovery: Fixing Common LLM Generation Artifacts

The Reality of LLM Outputs

LLM outputs are not always clean. Even the best models produce artifacts: truncated responses when hitting token limits, malformed JSON with trailing commas or missing brackets, code blocks that open but never close, and Unicode encoding errors from tokenizer edge cases. In agentic pipelines where outputs feed into downstream parsers, tools, and other models, these artifacts cause cascading failures.

Token healing and output recovery are the defensive techniques that make agent pipelines robust against these inevitable generation imperfections.

Token Healing: Fixing Tokenization Boundary Issues

Token healing addresses a specific problem at the boundary between a prompt and the model's completion. When a prompt ends mid-token (for example, ending with a partial URL or code string), the model may generate an unexpected continuation because the tokenizer splits the boundary differently than intended.

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

The solution is to back up by one token from the prompt boundary and let the model regenerate from that point with a constrained prefix:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

import tiktoken

def heal_token_boundary(prompt: str, completion: str, model: str = "gpt-4") -> str:
    """Fix artifacts at the prompt-completion boundary."""
    encoding = tiktoken.encoding_for_model(model)

    # Encode the last few characters of the prompt
    prompt_tokens = encoding.encode(prompt)
    if not prompt_tokens:
        return completion

    # Decode the last token to see if it might be a partial match
    last_token_text = encoding.decode([prompt_tokens[-1]])
    prompt_suffix = prompt[-len(last_token_text):]

    # If the prompt's trailing text does not match the last token's
    # full decoded form, we have a boundary issue
    if prompt_suffix != last_token_text:
        # Re-encode the boundary region
        boundary = prompt_suffix + completion[:10]
        healed_tokens = encoding.encode(boundary)
        healed_text = encoding.decode(healed_tokens)
        # Replace the boundary region with the healed version
        completion = healed_text[len(prompt_suffix):] + completion[10:]

    return completion

Truncation Recovery

When responses hit the max_tokens limit, they are cut off mid-sentence or mid-structure. For structured outputs, this is catastrophic — a truncated JSON string is unparseable. Recovery strategies depend on the output format:

import json
import re

def recover_truncated_json(raw: str) -> dict | None:
    """Attempt to recover a valid JSON object from truncated output."""
    # Strip markdown fences if present
    raw = re.sub(r"```json\s*", "", raw)
    raw = re.sub(r"```\s*$", "", raw)
    raw = raw.strip()

    # Try parsing as-is first
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        pass

    # Strategy 1: Close unclosed brackets and braces
    open_braces = raw.count("{") - raw.count("}")
    open_brackets = raw.count("[") - raw.count("]")

    repaired = raw.rstrip(",\n ")  # remove trailing commas
    # Remove any incomplete key-value pair at the end
    repaired = re.sub(r',\s*"[^"]*"\s*:\s*$', "", repaired)
    repaired = re.sub(r',\s*"[^"]*$', "", repaired)
    repaired = re.sub(r',\s*$', "", repaired)

    repaired += "]" * max(0, open_brackets)
    repaired += "}" * max(0, open_braces)

    try:
        return json.loads(repaired)
    except json.JSONDecodeError:
        pass

    # Strategy 2: Find the last valid JSON prefix
    for end in range(len(raw), 0, -1):
        candidate = raw[:end]
        open_b = candidate.count("{") - candidate.count("}")
        open_k = candidate.count("[") - candidate.count("]")
        candidate += "]" * max(0, open_k) + "}" * max(0, open_b)
        try:
            return json.loads(candidate)
        except json.JSONDecodeError:
            continue

    return None

Format Repair Pipeline

A robust format repair pipeline applies multiple repair strategies in sequence, from cheapest to most expensive:

from dataclasses import dataclass
from typing import Callable

@dataclass
class RepairResult:
    success: bool
    data: any
    strategy_used: str

def build_repair_pipeline(
    strategies: list[tuple[str, Callable[[str], any]]],
) -> Callable[[str], RepairResult]:
    """Build a repair pipeline that tries strategies in order."""
    def repair(raw_output: str) -> RepairResult:
        for name, strategy in strategies:
            try:
                result = strategy(raw_output)
                if result is not None:
                    return RepairResult(success=True, data=result, strategy_used=name)
            except Exception:
                continue
        return RepairResult(success=False, data=None, strategy_used="none")

    return repair

# Configure the pipeline
json_repair = build_repair_pipeline([
    ("direct_parse", lambda s: json.loads(s)),
    ("strip_fences", lambda s: json.loads(re.sub(r"```\w*\n?|\n?```", "", s).strip())),
    ("truncation_recovery", recover_truncated_json),
    ("extract_first_object", lambda s: json.loads(re.search(r"\{.*\}", s, re.DOTALL).group())),
])

# Usage
result = json_repair(llm_output)
if result.success:
    print(f"Parsed using: {result.strategy_used}")
    process(result.data)
else:
    trigger_retry_or_escalate()

Post-Processing Best Practices

Always validate structure before content. Check that JSON is valid before checking that it has the right keys. Check that code compiles before checking that it runs correctly. Structural validation is cheap and catches the most common artifacts.

Log repair actions. Every repair is a signal that something went wrong upstream. Track which repair strategies fire most often and use that data to improve your prompts, adjust token limits, or switch models.

Set repair budgets. A post-processing pipeline should not retry indefinitely. Define a maximum number of repair attempts and a fallback behavior (return a default, escalate to a human, return a graceful error).

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Common Artifacts and Their Fixes

Trailing commas in JSON arrays and objects — strip with regex before parsing. Missing closing quotes — count quote parity and append if needed. Markdown code fences wrapping structured output — strip known fence patterns. HTML entities in plain text responses — decode with html.unescape(). Repeated tokens (model degeneration) — detect consecutive duplicate n-grams and truncate.

FAQ

When should I use output recovery versus retrying the LLM call?

Use output recovery first — it is faster and cheaper than an LLM retry. Retry only when recovery fails or when the content itself (not just the format) is inadequate. A good rule of thumb: if the semantic content is present but the format is broken, repair it. If the content is missing or wrong, retry.

How do I handle truncation proactively?

Monitor the finish_reason field in the API response. If it is length instead of stop, the output was truncated. For structured outputs, set max_tokens high enough to accommodate the expected output plus a 30% buffer. For variable-length outputs, implement continuation — send a follow-up request asking the model to continue from where it stopped.

Does token healing apply to all models?

The boundary artifact that token healing addresses is specific to byte-pair encoding (BPE) tokenizers, which are used by GPT, Llama, Mistral, and most major models. Models using character-level or word-level tokenizers do not exhibit this specific artifact, but they have their own edge cases.

#TokenHealing #OutputRecovery #PostProcessing #ErrorHandling #AgenticAI #LearnAI #AIEngineering

Token Healing and Output Recovery: Fixing Common LLM Generation Artifacts

The Reality of LLM Outputs

Token Healing: Fixing Tokenization Boundary Issues

Truncation Recovery

Format Repair Pipeline

Post-Processing Best Practices

Common Artifacts and Their Fixes

FAQ

When should I use output recovery versus retrying the LLM call?

How do I handle truncation proactively?

Does token healing apply to all models?

Try CallSphere AI Voice Agents

Related Articles You May Like

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

Anthropic Skills System: Loadable Tool Packs for Claude Agents

Enterprise CIO Guide: Harvey AI — Legal Agents Move from Pilot to Practice

Enterprise CIO Guide: Perplexity Comet — The Agentic Browser Goes Mass Market

Enterprise CIO Guide: Hippocratic AI — Healthcare Agents at Scale

Designing Agent Loops with the Claude Agent SDK