Context Window Management: Don't Waste Tokens

Your system prompt, conversation history, and injected documents all compete for the same context window. Here is how to manage token budget and avoid costly waste.

Author

AICredits Team

Published

1 Apr 2026

Reading time

7 min read

Why context window management matters

Every token you send in a prompt costs money. A 5,000-token system prompt repeated across 10,000 daily requests costs 50 million tokens per day — at Claude Sonnet input prices (₹289/M), that is ₹14,450 per day just for the system prompt. Cut it to 1,000 tokens and save ₹11,560 daily.

Context window management is also about quality. Irrelevant context dilutes the signal and increases the chance the model ignores important instructions.

Auditing your token budget

Start by measuring what you are actually sending:

import tiktoken
 
enc = tiktoken.encoding_for_model("gpt-4o")
 
def count_tokens(text: str) -> int:
    return len(enc.encode(text))
 
def audit_request(system: str, history: list[dict], user_message: str) -> dict:
    system_tokens  = count_tokens(system)
    history_tokens = sum(count_tokens(m["content"]) for m in history)
    user_tokens    = count_tokens(user_message)
    total          = system_tokens + history_tokens + user_tokens
 
    return {
        "system_prompt":          system_tokens,
        "conversation_history":   history_tokens,
        "user_message":           user_tokens,
        "total_input":            total,
        "estimated_cost_inr":     round(total / 1_000_000 * 14, 5),  # GPT-4o Mini
    }
 
breakdown = audit_request(
    system="You are a helpful assistant. Be concise. Do not repeat yourself.",
    history=[
        {"role": "user",      "content": "What is an API gateway?"},
        {"role": "assistant", "content": "An API gateway is middleware that..."},
    ],
    user_message="How does it handle rate limiting?",
)
print(breakdown)

Sliding window for conversation history

For multi-turn conversations, keep only the last N turns:

from openai import OpenAI
 
client = OpenAI(base_url="https://api.aicredits.in/v1", api_key="sk-your-aicredits-key")
 
def chat_with_window(history: list[dict], new_message: str, window: int = 10) -> str:
    """Keep only the last {window} turns to control token cost."""
    first  = history[:1]   # always keep the first message (richest context)
    recent = history[1:][-window * 2:]  # each turn = 2 messages
 
    messages = first + recent + [{"role": "user", "content": new_message}]
    response = client.chat.completions.create(
        model="openai/gpt-4o-mini",
        messages=messages,
    )
 
    reply = response.choices[0].message.content
    history.append({"role": "user",      "content": new_message})
    history.append({"role": "assistant", "content": reply})
    return reply
 
history = [{"role": "user", "content": "I'm building a customer support bot."}]
print(chat_with_window(history, "What model should I use for ticket classification?"))
print(chat_with_window(history, "And for generating the actual reply?"))

Trimming the system prompt

Treat your system prompt like production code — review it regularly and remove redundancy:

| Before | After | Savings | |--------|-------|---------| | "Always be helpful, polite, and accurate" | (delete — model defaults to this) | ~12 tokens | | "When the user asks a question that you are unsure about, you should let them know..." | "Acknowledge uncertainty. Recommend verification for uncertain answers." | ~20 tokens saved | | 3,000-token prompt with examples | 800-token prompt + dynamic few-shot retrieval | ~2,200 tokens saved |

Hard limits per component

Set explicit token budgets in code to prevent gradual drift:

MAX_SYSTEM_TOKENS   = 1_000
MAX_HISTORY_TOKENS  = 4_000
MAX_CONTEXT_TOKENS  = 3_000
MAX_USER_TOKENS     = 2_000
 
def validate_token_budget(system: str, history: list[dict], context: str, user_message: str):
    assert count_tokens(system)      <= MAX_SYSTEM_TOKENS,  f"System prompt too long: {count_tokens(system)} tokens"
    assert count_tokens(context)     <= MAX_CONTEXT_TOKENS, f"Context too long: {count_tokens(context)} tokens"
    assert count_tokens(user_message)<= MAX_USER_TOKENS,    f"User message too long: {count_tokens(user_message)} tokens"
 
    history_total = sum(count_tokens(m["content"]) for m in history)
    assert history_total             <= MAX_HISTORY_TOKENS, f"History too long: {history_total} tokens"

Review and adjust these limits quarterly based on actual usage data from your AICredits dashboard.

Agentic AI Costs: How One Loop Burned ₹5,000 in 10 Minutes (And How to Prevent It)

9 min read

How to Build a Retry Strategy for LLM API Calls

6 min read

Building a Simple LLM Router in Python (Best Model for Each Task)

8 min read

Continue in Docs

Need implementation commands and endpoint details? Go to quickstart or API reference.