Temperature, top_p, top_k: The Parameters That Control How Your LLM Thinks

One number changes your LLM from a deterministic calculator to a creative writer. Here's what temperature and sampling parameters actually do.

Author

AICredits Team

Published

25 Mar 2026

Reading time

9 min read

The question nobody explains properly

When you first call an LLM API, you see a wall of parameters: temperature, top_p, top_k, frequency_penalty, presence_penalty, seed. Most tutorials say "temperature controls creativity" and move on. That explanation is technically correct and practically useless.

This post explains what these parameters actually do to the model's internals — clearly enough that you can make confident decisions about which values to use for your specific use case.

How an LLM picks the next word

Before diving into parameters, you need a mental model of how generation works.

An LLM doesn't generate a whole sentence at once. It generates one token at a time, where a token is roughly 3-4 characters. For each new token, the model looks at everything it has generated so far and produces a score (called a logit) for every token in its vocabulary — that's typically 32,000 to 128,000 possible tokens.

Raw logit scores aren't probabilities. They're just numbers, possibly negative, and they don't sum to 1. To turn them into a probability distribution, the model applies softmax:

probability(token_i) = exp(logit_i) / sum(exp(all logits))

After softmax, every token has a probability between 0 and 1, and all probabilities sum to 1. The model then samples from this distribution — it's like rolling a weighted die. High-probability tokens get picked more often; low-probability tokens rarely get picked.

Temperature is applied before softmax, and it changes the shape of the distribution before sampling happens. Everything flows from that.

Temperature: the most important parameter

Temperature rescales the logit scores before softmax is applied:

adjusted_logit_i = logit_i / temperature

Dividing by a small number (< 1) sharpens the distribution — the highest-probability tokens become even more dominant. Dividing by a large number (> 1) flattens the distribution — probabilities become more uniform across tokens.

Here's what this looks like in practice. Imagine the model is generating the next word after "The capital of France is ___". The raw top-5 candidates might have these probabilities after softmax at different temperature settings:

| Token | temp=0.1 | temp=0.7 | temp=1.0 | temp=2.0 | |-----------|----------|----------|----------|----------| | Paris | 99.8% | 94.2% | 87.5% | 62.1% | | Lyon | 0.1% | 2.8% | 5.3% | 14.2% | | Bordeaux | 0.05% | 1.5% | 3.1% | 10.4% | | Nice | 0.03% | 0.9% | 2.2% | 7.8% | | Marseille | 0.02% | 0.6% | 1.9% | 5.5% |

At temperature=0.1, the model almost always picks "Paris". At temperature=2.0, it still picks "Paris" most of the time, but now Lyon, Bordeaux, and Nice are meaningful alternatives.

temperature=0.0 is a special case. Division by zero is undefined, so APIs interpret this as greedy decoding: always pick the single highest-probability token. The output becomes completely deterministic — given the same prompt, you always get the same response.

temperature > 1.0 is technically valid but often produces incoherent output because it gives real probability mass to tokens the model considers very unlikely. You rarely want this outside of creative experimentation.

The practical range for most use cases is 0.0 to 1.0.

from openai import OpenAI
 
client = OpenAI(
    base_url="https://api.aicredits.in/v1",
    api_key="sk-your-aicredits-key",
)
 
prompt = "Write a one-sentence product description for a noise-cancelling headphone."
 
# Factual, consistent output
response_low = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.1,
)
 
# Creative, varied output
response_high = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.9,
)
 
print("Low temp:", response_low.choices[0].message.content)
print("High temp:", response_high.choices[0].message.content)

Run this ten times. At temperature=0.1, you'll get nearly identical sentences. At temperature=0.9, every run produces something different.

top_p: nucleus sampling

top_p (also called nucleus sampling) is a different way to constrain which tokens the model considers when sampling.

Instead of using the full probability distribution, top_p builds the smallest set of tokens whose cumulative probability exceeds the threshold P, and samples only from that set.

For example, with top_p=0.9:

Sort all tokens by probability, highest first.
Keep adding tokens to the "nucleus" until their cumulative probability reaches 90%.
Sample only from those tokens (renormalizing probabilities to sum to 1).

The key insight: the nucleus size adapts to the situation. When the model is very confident (high-entropy scenarios), the nucleus might contain only 3-5 tokens. When the model is uncertain (many plausible continuations), the nucleus expands to dozens or hundreds of tokens.

This is more nuanced than top_k (explained next) because it doesn't arbitrarily cap the candidate count — it responds to the actual shape of the probability distribution at each step.

A top_p=0.9 setting means you're excluding the bottom 10% of unlikely tokens. In practice, top_p=0.9 or top_p=0.95 are common defaults that work well across most tasks.

top_k: keeping only the K best candidates

top_k is simpler: before sampling, discard every token except the K most probable ones.

With top_k=50, the model only ever considers the 50 most likely next tokens, regardless of their absolute probabilities. All tokens ranked 51st and below are set to zero probability.

The downside compared to top_p is that top_k is a fixed count. In a highly uncertain situation, 50 tokens might still include many very unlikely options. In a highly certain situation, even the 2nd-most-likely token might have negligible probability, making most of the 50 tokens noise.

top_k is not supported by all providers — notably, OpenAI's API does not expose top_k (it's applied internally). Google Gemini and Anthropic Claude expose it. For cross-provider compatibility, top_p is the safer choice.

Temperature vs. top_p: which to use?

The common misconception is that you should tune both simultaneously. In practice, pick one as your primary dial.

The recommendation from OpenAI and Anthropic is to alter temperature OR top_p, not both. Using both compounds the effect and makes behavior harder to reason about.

If you want to control randomness / creativity: use temperature.
If you want to control diversity by trimming the probability tail: use top_p with temperature fixed at 1.0.
If you're unsure: stick with temperature and leave top_p=1.0 (disabled).

The main scenario where top_p is strictly better than temperature alone: tasks with variable difficulty. When the correct answer is obvious (e.g., completing "2 + 2 ="), you want the model to be very focused. When the answer is genuinely ambiguous (e.g., "describe the mood of this painting"), you want more options. top_p naturally handles both without you manually adjusting temperature.

frequency_penalty and presence_penalty: reducing repetition

These two parameters tackle a distinct problem: LLMs tend to repeat themselves, especially in long outputs.

frequency_penalty (range: -2.0 to 2.0) reduces the probability of a token proportionally to how many times it has already appeared in the output. The more a token has been used, the more it's penalized. A value of 0.5 is a reasonable starting point for long-form content.

presence_penalty (range: -2.0 to 2.0) applies a flat penalty to any token that has appeared at least once, regardless of how many times. It discourages repetition of any previously used word, pushing the model toward novel vocabulary.

In practice:

frequency_penalty is better for preventing the model from overusing a specific phrase.
presence_penalty is better for encouraging topic diversity in longer outputs.
Both at 0.0 means no penalty (default behavior).
Negative values do the opposite — they encourage repetition, which is almost never what you want.

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Write a 200-word essay about innovation."}],
    temperature=0.7,
    frequency_penalty=0.5,   # discourage word repetition
    presence_penalty=0.3,    # encourage new vocabulary
)

seed: reproducibility for testing

The seed parameter sets the random number generator to a fixed state before sampling. When you use the same seed value with the same model, temperature, and prompt, you should get identical (or near-identical) output every time.

This is invaluable for:

Regression testing — verifying a prompt change actually changes output.
A/B testing — comparing two prompts with the same randomness.
Debugging — reproducing a specific output to investigate an issue.

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Suggest a startup name for an AI calendar app."}],
    temperature=0.9,
    seed=42,  # same seed = same output every run
)

Note: seed reproducibility is best-effort. Model updates, infrastructure changes, or floating-point differences across hardware can cause minor variations even with the same seed. Don't rely on byte-for-byte identical output in production logic.

max_tokens: cost control, not quality control

max_tokens caps the number of tokens the model generates in its response. It is not a quality parameter — truncating output doesn't make it better or worse, it just stops it mid-sentence if the model would have generated more.

Use max_tokens for:

Cost control: a runaway response won't drain your balance unexpectedly.
Latency control: shorter responses return faster.
Structured tasks: if you know the answer should be 1-2 sentences, cap it at 100 tokens.

Don't set it so low that legitimate responses get truncated. For conversational use cases, 512-2048 is usually plenty. For long-form generation, you might go up to 4096 or higher.

Practical settings cheat sheet

| Use Case | temperature | top_p | frequency_penalty | Notes | |----------------------|-------------|-------|-------------------|------------------------------------| | Factual Q&A | 0.1 | 0.9 | 0.0 | Near-deterministic, accurate | | Data extraction/JSON | 0.0 | — | 0.0 | Always use greedy for structured output | | Code generation | 0.2 | 0.9 | 0.0 | Low variance; correctness matters | | Chat assistant | 0.7 | 1.0 | 0.3 | Natural, not repetitive | | Creative writing | 0.9 | 0.95 | 0.5 | Varied, engaging prose | | Brainstorming | 1.0 | 1.0 | 0.6 | Maximum diversity of ideas | | Summarization | 0.3 | 0.9 | 0.2 | Faithful to source, slight variety | | Translation | 0.1 | 0.9 | 0.0 | Precision over creativity |

These are starting points. Every model and every domain is different — treat this table as a first experiment, not a final answer.

Provider differences

Not every provider supports every parameter. Here's what you can rely on across the main providers available through AICredits:

| Parameter | OpenAI | Anthropic Claude | Google Gemini | DeepSeek | Mistral | |--------------------|--------|------------------|---------------|----------|---------| | temperature | Yes | Yes | Yes | Yes | Yes | | top_p | Yes | Yes | Yes | Yes | Yes | | top_k | No | Yes | Yes | No | No | | frequency_penalty | Yes | No | No | Yes | Yes | | presence_penalty | Yes | No | No | Yes | Yes | | seed | Yes | No | No | Yes | No | | max_tokens | Yes | Yes (max_tokens) | Yes | Yes | Yes |

When writing provider-agnostic code, limit yourself to temperature, top_p, and max_tokens — these work everywhere. If you need top_k or seed, check the specific model's documentation first.

Code example: the same prompt, three temperatures

from openai import OpenAI
 
client = OpenAI(
    base_url="https://api.aicredits.in/v1",
    api_key="sk-your-aicredits-key",
)
 
prompt = "Explain what blockchain is in one sentence."
temperatures = [0.0, 0.7, 1.2]
 
for temp in temperatures:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=temp,
        max_tokens=80,
    )
    print(f"\ntemperature={temp}:")
    print(response.choices[0].message.content)

Running this will show you the effect concretely. At temperature=0.0 you'll get the same output every single run. At temperature=1.2 you might get an unusual analogy or an unexpected framing — sometimes illuminating, sometimes incoherent.

A note on cost

None of these parameters affect your token count. You pay for the number of input tokens (your prompt) plus output tokens (the model's response). Temperature, top_p, top_k, penalties, and seed have zero effect on cost — they change which tokens are generated, not how many.

The only parameters that directly affect cost are max_tokens (caps output) and the model you choose. Everything else is free to experiment with.

If you're optimizing costs, the levers are: choosing a smaller model, shortening your system prompt, caching repeated identical requests, and setting a reasonable max_tokens. Read our guide on cutting LLM API costs by 50% for a full breakdown.

Summary

Temperature is the primary creativity dial. Use 0.0 for deterministic output, 0.1–0.3 for factual/technical tasks, 0.7–1.0 for creative tasks.
top_p trims the probability tail dynamically. More sophisticated than top_k; prefer it for cross-provider code.
top_k limits candidates to K tokens. Provider support is inconsistent; use top_p instead when possible.
frequency_penalty / presence_penalty reduce repetition in longer outputs. Start at 0.3–0.5 for conversational use cases.
seed enables reproducibility. Useful for testing; don't rely on it for production logic.
max_tokens controls cost and latency. Set it to a reasonable ceiling, not a tight constraint.
None of these parameters change what you pay. Cost is purely a function of token count and model choice.

The instinct to leave everything at default is understandable — defaults are sensible. But understanding these parameters takes maybe 30 minutes and pays back every time you're debugging unexpected output, tuning a production prompt, or explaining to a stakeholder why the model "said something weird."

Using the Anthropic SDK with AICredits (Python & TypeScript)

7 min read

The Prompting Cheat Sheet: 10 Patterns Every Developer Should Know