---
title: "Temperature and Sampling: Controlling LLM Output Creativity"
description: "Master the sampling parameters that control LLM behavior — temperature, top-p, top-k, frequency penalty, and presence penalty — with practical examples showing when to use each."
canonical: https://callsphere.ai/blog/temperature-and-sampling-controlling-llm-output-creativity
category: "Learn Agentic AI"
tags: ["Temperature", "Sampling", "LLM", "Prompt Engineering", "API Parameters"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:42.543Z
---

# Temperature and Sampling: Controlling LLM Output Creativity

> Master the sampling parameters that control LLM behavior — temperature, top-p, top-k, frequency penalty, and presence penalty — with practical examples showing when to use each.

## How LLMs Choose Their Words

When an LLM generates text, it does not produce words directly. At each step, it computes a probability distribution over its entire vocabulary — typically 50,000 to 100,000 tokens. The model assigns a probability to every possible next token, and then it samples from that distribution. The sampling parameters you set control how that sampling happens, which in turn controls the character of the output.

This is the most practical lever you have for controlling LLM behavior without changing the prompt itself.

## Temperature: The Master Dial

Temperature scales the logits (raw scores) before they are converted to probabilities via the softmax function. It is the single most important sampling parameter.

```mermaid
flowchart TD
    SPEC(["Task spec"])
    SYSTEM["System prompt
role plus rules"]
    SHOTS["Few shot examples
3 to 5"]
    VARS["Variable injection
Jinja or f-string"]
    COT["Chain of thought
or scratchpad"]
    CONSTR["Output constraint
JSON schema"]
    LLM["LLM call"]
    EVAL["Offline eval
LLM as judge plus regex"]
    GATE{"Score over
threshold?"}
    COMMIT(["Promote to prod
version pinned"])
    REVISE(["Revise prompt"])
    SPEC --> SYSTEM --> SHOTS --> VARS --> COT --> CONSTR --> LLM --> EVAL --> GATE
    GATE -->|Yes| COMMIT
    GATE -->|No| REVISE --> SYSTEM
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style EVAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style COMMIT fill:#059669,stroke:#047857,color:#fff
```

```python
import numpy as np

def softmax_with_temperature(logits, temperature=1.0):
    """
    Apply temperature to logits before softmax.

    temperature  1.0: flatter distribution (more random)
    """
    scaled_logits = logits / temperature
    exp_logits = np.exp(scaled_logits - np.max(scaled_logits))
    return exp_logits / np.sum(exp_logits)

# Example: model raw logits for 5 candidate tokens
logits = np.array([5.0, 3.0, 1.0, 0.5, 0.1])
tokens = ["the", "a", "this", "my", "that"]

for temp in [0.1, 0.5, 1.0, 1.5, 2.0]:
    probs = softmax_with_temperature(logits, temperature=temp)
    print(f"Temperature {temp}:")
    for token, prob in zip(tokens, probs):
        bar = "#" * int(prob * 50)
        print(f"  {token:6s} {prob:.4f} {bar}")
    print()
```

At temperature 0.1, the highest-probability token gets almost all the weight — the output becomes nearly deterministic. At temperature 2.0, the probabilities are spread more evenly, and the model frequently picks less-likely tokens.

**Temperature 0** is a special case. Most APIs treat it as greedy decoding — always pick the highest-probability token. This makes the output completely deterministic (same input produces the same output):

```python
from openai import OpenAI

client = OpenAI()

# Deterministic output: always produces the same response
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2 + 2?"}],
    temperature=0,  # Greedy decoding — deterministic
)
print(response.choices[0].message.content)

# Creative output: varies between runs
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about programming."}],
    temperature=1.2,  # Higher creativity
)
print(response.choices[0].message.content)
```

## Top-p (Nucleus Sampling): Dynamic Vocabulary Filtering

Top-p sampling, also called nucleus sampling, takes a different approach. Instead of scaling all probabilities, it only considers the smallest set of tokens whose cumulative probability exceeds the threshold p:

```python
def top_p_sampling(logits, p=0.9):
    """
    Nucleus sampling: only consider the top tokens
    whose cumulative probability exceeds p.
    """
    probs = softmax_with_temperature(logits, temperature=1.0)

    # Sort tokens by probability (descending)
    sorted_indices = np.argsort(probs)[::-1]
    sorted_probs = probs[sorted_indices]

    # Find the cutoff where cumulative probability exceeds p
    cumulative = np.cumsum(sorted_probs)
    cutoff_index = np.searchsorted(cumulative, p) + 1

    # Zero out tokens below the cutoff
    allowed_indices = sorted_indices[:cutoff_index]
    filtered_probs = np.zeros_like(probs)
    filtered_probs[allowed_indices] = probs[allowed_indices]

    # Re-normalize
    filtered_probs /= np.sum(filtered_probs)

    return filtered_probs

logits = np.array([5.0, 3.0, 1.0, 0.5, 0.1, -1.0, -2.0])
tokens = ["the", "a", "this", "my", "that", "our", "their"]

for p in [0.5, 0.9, 0.95]:
    probs = top_p_sampling(logits, p=p)
    active = [(t, pr) for t, pr in zip(tokens, probs) if pr > 0.001]
    print(f"top_p={p}: {len(active)} tokens considered: {active}")
```

The advantage of top-p over temperature is adaptability. When the model is confident (one token has 95% probability), top-p=0.9 keeps only that token. When the model is uncertain (many tokens with similar probabilities), it lets more through. Temperature applies the same scaling regardless of the distribution shape.

## Top-k Sampling: Fixed Vocabulary Cutoff

Top-k is the simplest filtering strategy: keep the k highest-probability tokens, discard the rest:

```python
def top_k_sampling(logits, k=10):
    """Only consider the top k tokens."""
    probs = softmax_with_temperature(logits, temperature=1.0)

    # Find indices of top k tokens
    top_k_indices = np.argsort(probs)[-k:]

    # Zero out everything else
    filtered_probs = np.zeros_like(probs)
    filtered_probs[top_k_indices] = probs[top_k_indices]
    filtered_probs /= np.sum(filtered_probs)

    return filtered_probs
```

Top-k is less commonly used with modern APIs because it does not adapt to the confidence level. With k=50, the model considers 50 tokens whether it is very confident or very uncertain. Top-p is generally preferred for this reason.

## Frequency and Presence Penalties

These parameters address repetition, one of the most common LLM failure modes:

```python
# Frequency penalty: reduces probability proportional to how many times
# a token has already appeared. Higher values = less repetition.
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a paragraph about the ocean."}],
    frequency_penalty=0.5,  # Range: -2.0 to 2.0
)

# Presence penalty: reduces probability of any token that has appeared at all,
# regardless of how many times. Encourages topic diversity.
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a paragraph about the ocean."}],
    presence_penalty=0.5,  # Range: -2.0 to 2.0
)
```

The difference is subtle but important. Frequency penalty penalizes tokens more each time they appear — saying "ocean" three times gets penalized more than saying it once. Presence penalty applies a flat penalty once a token has appeared at all. Use frequency penalty to reduce repetitive phrases within a response and presence penalty to encourage the model to explore new topics.

## Practical Parameter Recommendations

Different use cases call for different parameter combinations:

```python
# Factual Q&A: deterministic, focused
factual_params = {
    "temperature": 0,
    "top_p": 1.0,
    "frequency_penalty": 0,
    "presence_penalty": 0,
}

# Code generation: low temperature, slight penalty for repetition
code_params = {
    "temperature": 0.2,
    "top_p": 0.95,
    "frequency_penalty": 0.1,
    "presence_penalty": 0,
}

# Creative writing: higher temperature, topic diversity
creative_params = {
    "temperature": 0.9,
    "top_p": 0.95,
    "frequency_penalty": 0.3,
    "presence_penalty": 0.5,
}

# Brainstorming: high temperature, strong diversity
brainstorm_params = {
    "temperature": 1.2,
    "top_p": 0.9,
    "frequency_penalty": 0.5,
    "presence_penalty": 0.8,
}

# Data extraction / classification: fully deterministic
extraction_params = {
    "temperature": 0,
    "top_p": 1.0,
    "frequency_penalty": 0,
    "presence_penalty": 0,
}
```

## The Interaction Between Temperature and Top-p

A common mistake is setting both temperature and top-p to extreme values simultaneously. They interact in ways that can produce unexpected results:

```python
# GOOD: Use one or the other as your primary control
# Option A: Temperature-based control
{"temperature": 0.3, "top_p": 1.0}   # top_p=1.0 means no filtering

# Option B: top_p-based control
{"temperature": 1.0, "top_p": 0.5}   # temperature=1.0 means no scaling

# AVOID: Both aggressive simultaneously
{"temperature": 0.2, "top_p": 0.5}   # Double restriction — very rigid
{"temperature": 1.5, "top_p": 0.99}  # Temperature adds randomness that top_p barely filters
```

OpenAI's documentation recommends adjusting either temperature or top-p, but not both. In practice, temperature is the more intuitive control for most developers.

## FAQ

### What temperature should I use for a production chatbot?

For most production chatbots, start with temperature 0.7 and top_p 1.0. This produces natural-sounding responses with enough variation to avoid feeling robotic, while staying focused enough to be reliable. For customer service bots where accuracy matters more than creativity, drop to 0.3. For creative applications like story generation, go up to 0.9 or 1.0. Always test with real user queries before committing to a value.

### Why does temperature 0 sometimes give different outputs?

Floating-point arithmetic on GPUs is not perfectly deterministic across different hardware configurations. Even with temperature 0, tiny numerical differences can cause a different token to be selected when two tokens have very similar probabilities. OpenAI provides a `seed` parameter that improves determinism but does not guarantee it. For applications requiring exact reproducibility, cache the responses rather than relying on deterministic generation.

### Can I change sampling parameters mid-conversation?

Yes. Sampling parameters are set per API call, not per conversation. You can use temperature 0 for a factual lookup, then switch to temperature 0.8 for a creative follow-up. This is a useful technique for multi-step agents that need different modes for different tasks — structured data extraction with temperature 0 followed by user-facing summary generation with temperature 0.7.

---

#Temperature #Sampling #LLM #PromptEngineering #APIParameters #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/temperature-and-sampling-controlling-llm-output-creativity
