---
title: "Token Healing and Output Recovery: Fixing Common LLM Generation Artifacts"
description: "Learn techniques for detecting and repairing common LLM output problems including truncated responses, malformed JSON, encoding artifacts, and broken code blocks through robust post-processing pipelines."
canonical: https://callsphere.ai/blog/token-healing-output-recovery-fixing-llm-generation-artifacts
category: "Learn Agentic AI"
tags: ["Token Healing", "Output Recovery", "Post-Processing", "Error Handling", "Agentic AI"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-06-03T18:46:47.822Z
---

# Token Healing and Output Recovery: Fixing Common LLM Generation Artifacts

> Learn techniques for detecting and repairing common LLM output problems including truncated responses, malformed JSON, encoding artifacts, and broken code blocks through robust post-processing pipelines.

## The Reality of LLM Outputs

LLM outputs are not always clean. Even the best models produce artifacts: truncated responses when hitting token limits, malformed JSON with trailing commas or missing brackets, code blocks that open but never close, and Unicode encoding errors from tokenizer edge cases. In agentic pipelines where outputs feed into downstream parsers, tools, and other models, these artifacts cause cascading failures.

Token healing and output recovery are the defensive techniques that make agent pipelines robust against these inevitable generation imperfections.

## Token Healing: Fixing Tokenization Boundary Issues

Token healing addresses a specific problem at the boundary between a prompt and the model's completion. When a prompt ends mid-token (for example, ending with a partial URL or code string), the model may generate an unexpected continuation because the tokenizer splits the boundary differently than intended.

```mermaid
flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus
classify"]
    PLAN["Plan and tool
selection"]
    AGENT["Agent loop
LLM plus tools"]
    GUARD{"Guardrails
and policy"}
    EXEC["Execute and
verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus
next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
```

The solution is to back up by one token from the prompt boundary and let the model regenerate from that point with a constrained prefix:

```python
import tiktoken

def heal_token_boundary(prompt: str, completion: str, model: str = "gpt-4") -> str:
    """Fix artifacts at the prompt-completion boundary."""
    encoding = tiktoken.encoding_for_model(model)

    # Encode the last few characters of the prompt
    prompt_tokens = encoding.encode(prompt)
    if not prompt_tokens:
        return completion

    # Decode the last token to see if it might be a partial match
    last_token_text = encoding.decode([prompt_tokens[-1]])
    prompt_suffix = prompt[-len(last_token_text):]

    # If the prompt's trailing text does not match the last token's
    # full decoded form, we have a boundary issue
    if prompt_suffix != last_token_text:
        # Re-encode the boundary region
        boundary = prompt_suffix + completion[:10]
        healed_tokens = encoding.encode(boundary)
        healed_text = encoding.decode(healed_tokens)
        # Replace the boundary region with the healed version
        completion = healed_text[len(prompt_suffix):] + completion[10:]

    return completion
```

## Truncation Recovery

When responses hit the `max_tokens` limit, they are cut off mid-sentence or mid-structure. For structured outputs, this is catastrophic — a truncated JSON string is unparseable. Recovery strategies depend on the output format:

```python
import json
import re

def recover_truncated_json(raw: str) -> dict | None:
    """Attempt to recover a valid JSON object from truncated output."""
    # Strip markdown fences if present
    raw = re.sub(r"```json\s*", "", raw)
    raw = re.sub(r"```\s*$", "", raw)
    raw = raw.strip()

    # Try parsing as-is first
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        pass

    # Strategy 1: Close unclosed brackets and braces
    open_braces = raw.count("{") - raw.count("}")
    open_brackets = raw.count("[") - raw.count("]")

    repaired = raw.rstrip(",\n ")  # remove trailing commas
    # Remove any incomplete key-value pair at the end
    repaired = re.sub(r',\s*"[^"]*"\s*:\s*$', "", repaired)
    repaired = re.sub(r',\s*"[^"]*$', "", repaired)
    repaired = re.sub(r',\s*$', "", repaired)

    repaired += "]" * max(0, open_brackets)
    repaired += "}" * max(0, open_braces)

    try:
        return json.loads(repaired)
    except json.JSONDecodeError:
        pass

    # Strategy 2: Find the last valid JSON prefix
    for end in range(len(raw), 0, -1):
        candidate = raw[:end]
        open_b = candidate.count("{") - candidate.count("}")
        open_k = candidate.count("[") - candidate.count("]")
        candidate += "]" * max(0, open_k) + "}" * max(0, open_b)
        try:
            return json.loads(candidate)
        except json.JSONDecodeError:
            continue

    return None
```

## Format Repair Pipeline

A robust format repair pipeline applies multiple repair strategies in sequence, from cheapest to most expensive:

```python
from dataclasses import dataclass
from typing import Callable

@dataclass
class RepairResult:
    success: bool
    data: any
    strategy_used: str

def build_repair_pipeline(
    strategies: list[tuple[str, Callable[[str], any]]],
) -> Callable[[str], RepairResult]:
    """Build a repair pipeline that tries strategies in order."""
    def repair(raw_output: str) -> RepairResult:
        for name, strategy in strategies:
            try:
                result = strategy(raw_output)
                if result is not None:
                    return RepairResult(success=True, data=result, strategy_used=name)
            except Exception:
                continue
        return RepairResult(success=False, data=None, strategy_used="none")

    return repair

# Configure the pipeline
json_repair = build_repair_pipeline([
    ("direct_parse", lambda s: json.loads(s)),
    ("strip_fences", lambda s: json.loads(re.sub(r"```\w*\n?|\n?```", "", s).strip())),
    ("truncation_recovery", recover_truncated_json),
    ("extract_first_object", lambda s: json.loads(re.search(r"\{.*\}", s, re.DOTALL).group())),
])

# Usage
result = json_repair(llm_output)
if result.success:
    print(f"Parsed using: {result.strategy_used}")
    process(result.data)
else:
    trigger_retry_or_escalate()
```

## Post-Processing Best Practices

**Always validate structure before content.** Check that JSON is valid before checking that it has the right keys. Check that code compiles before checking that it runs correctly. Structural validation is cheap and catches the most common artifacts.

**Log repair actions.** Every repair is a signal that something went wrong upstream. Track which repair strategies fire most often and use that data to improve your prompts, adjust token limits, or switch models.

**Set repair budgets.** A post-processing pipeline should not retry indefinitely. Define a maximum number of repair attempts and a fallback behavior (return a default, escalate to a human, return a graceful error).

## Common Artifacts and Their Fixes

Trailing commas in JSON arrays and objects — strip with regex before parsing. Missing closing quotes — count quote parity and append if needed. Markdown code fences wrapping structured output — strip known fence patterns. HTML entities in plain text responses — decode with `html.unescape()`. Repeated tokens (model degeneration) — detect consecutive duplicate n-grams and truncate.

## FAQ

### When should I use output recovery versus retrying the LLM call?

Use output recovery first — it is faster and cheaper than an LLM retry. Retry only when recovery fails or when the content itself (not just the format) is inadequate. A good rule of thumb: if the semantic content is present but the format is broken, repair it. If the content is missing or wrong, retry.

### How do I handle truncation proactively?

Monitor the `finish_reason` field in the API response. If it is `length` instead of `stop`, the output was truncated. For structured outputs, set `max_tokens` high enough to accommodate the expected output plus a 30% buffer. For variable-length outputs, implement continuation — send a follow-up request asking the model to continue from where it stopped.

### Does token healing apply to all models?

The boundary artifact that token healing addresses is specific to byte-pair encoding (BPE) tokenizers, which are used by GPT, Llama, Mistral, and most major models. Models using character-level or word-level tokenizers do not exhibit this specific artifact, but they have their own edge cases.

---

#TokenHealing #OutputRecovery #PostProcessing #ErrorHandling #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/token-healing-output-recovery-fixing-llm-generation-artifacts
