Chain-of-Thought Prompting: A Deep Practical Guide for Developers

Chain-of-Thought Prompting: A Deep Practical Guide for Developers

Learn how chain-of-thought prompting works under the hood, when to use it, and how to implement zero-shot, few-shot, and tree-of-thought variants without blowing your token budget.

Author

AICredits Team

Published

15 Mar 2026

Reading time

14 min read

What is chain-of-thought prompting and why does it work?

Chain-of-thought (CoT) prompting is a technique where you instruct a language model to reason through a problem step by step before producing its final answer. Instead of asking "What is the answer?", you ask the model to show its work — and that act of showing work turns out to dramatically improve accuracy on tasks that require multiple reasoning steps.

The original finding came from Google Brain researchers in 2022. They showed that simply adding the phrase "Let's think step by step" to a prompt improved performance on grade-school math benchmarks from around 18% to over 78% on some models. That is not a small tweak — it is a near-quadrupling of accuracy from four words.

Why does it actually work?

There are two complementary explanations.

The information-theoretic view: When a model generates an answer directly, it has to compress multi-step reasoning into a single forward pass up to the token representing the answer. When it writes out intermediate steps, each step provides additional context that informs the next token prediction. The model is, in effect, using its own output as working memory. LLMs do not have persistent internal state between tokens — the KV cache holds attention over the context window, and more context means more signal.

The training distribution view: Human-written text that addresses hard problems (textbooks, worked solutions, engineering documentation, legal reasoning) almost always includes intermediate steps. When you prompt the model to reason step by step, you are steering the generation distribution toward this high-quality subset of its training data. The model has seen thousands of examples of "problem → steps → answer" and relatively few of "problem → answer" for non-trivial problems.

Both effects are real and they compound. This is why CoT helps most on tasks that are genuinely multi-step: math, logic puzzles, code debugging, planning sequences, and legal or financial analysis.


Zero-shot CoT: The simplest version

Zero-shot CoT requires no examples. You just add a reasoning instruction to your prompt.

Without CoT

from openai import OpenAI
 
client = OpenAI(
    base_url="https://api.aicredits.in/v1",
    api_key="sk-your-aicredits-key",
)
 
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": (
                "A train travels from Mumbai to Pune, a distance of 148 km. "
                "It departs at 6:15 AM and arrives at 8:45 AM. "
                "It makes one stop of 12 minutes along the way. "
                "What is the average moving speed of the train in km/h?"
            ),
        }
    ],
)
print(response.choices[0].message.content)

Typical output without CoT:

The average moving speed is approximately 65.8 km/h.

This answer is wrong. The model divided 148 by 2.5 hours without subtracting the stop time.

With zero-shot CoT

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": (
                "A train travels from Mumbai to Pune, a distance of 148 km. "
                "It departs at 6:15 AM and arrives at 8:45 AM. "
                "It makes one stop of 12 minutes along the way. "
                "What is the average moving speed of the train in km/h?\n\n"
                "Let's think step by step."
            ),
        }
    ],
)
print(response.choices[0].message.content)

Typical output with zero-shot CoT:

Step 1: Find the total elapsed time.
Departure: 6:15 AM, Arrival: 8:45 AM
Total elapsed time = 2 hours 30 minutes = 150 minutes

Step 2: Subtract the stop time.
Moving time = 150 minutes - 12 minutes = 138 minutes = 2.3 hours

Step 3: Calculate the average moving speed.
Speed = Distance / Moving time = 148 km / 2.3 hours ≈ 64.35 km/h

The average moving speed of the train is approximately 64.3 km/h.

The difference is not stylistic — the second answer is correct and the first is not.

Other zero-shot CoT triggers that work well

  • "Let's think step by step." — the classic, works broadly
  • "Think carefully before answering." — gentler, useful for shorter problems
  • "Work through this methodically." — good for engineering or debugging tasks
  • "Break this down into smaller parts." — useful for planning tasks
  • "First, identify what is being asked. Then..." — useful when the question is ambiguous

Note: "Let's think step by step" works better on capable models (GPT-4 class, Claude Sonnet, Gemini 1.5 Pro). On very small models (under 7B parameters), zero-shot CoT can produce verbose but wrong reasoning — in that case, few-shot CoT is more reliable.


Few-shot CoT: Providing worked examples

Few-shot CoT is more powerful than zero-shot for domain-specific reasoning. You provide two to five examples that demonstrate the reasoning pattern you want, and the model follows that pattern on the new problem.

FEW_SHOT_SYSTEM = """
You are a financial reasoning assistant. When given a problem, always reason
through it step by step using the same structure as these examples.
 
Example 1:
Problem: A SaaS product charges ₹999/month. They have 1,200 subscribers.
Their monthly server cost is ₹4,50,000 and team cost is ₹8,00,000.
What is their monthly profit margin?
 
Reasoning:
- Monthly Revenue = 1,200 × ₹999 = ₹11,98,800
- Total Monthly Costs = ₹4,50,000 + ₹8,00,000 = ₹12,50,000
- Monthly Profit = ₹11,98,800 − ₹12,50,000 = −₹51,200 (a loss)
- Profit Margin = (−₹51,200 / ₹11,98,800) × 100 ≈ −4.3%
 
Answer: The company is currently losing money with a margin of approximately −4.3%.
 
Example 2:
Problem: An e-commerce store has a 3.2% conversion rate on 45,000 monthly
visitors. Average order value is ₹1,850. Their CAC is ₹320. What is their
approximate monthly ROAS?
 
Reasoning:
- Monthly conversions = 45,000 × 0.032 = 1,440 orders
- Monthly revenue = 1,440 × ₹1,850 = ₹26,64,000
- Estimated monthly ad spend = 1,440 customers × ₹320 CAC = ₹4,60,800
- ROAS = Revenue / Ad Spend = ₹26,64,000 / ₹4,60,800 ≈ 5.78
 
Answer: The approximate ROAS is 5.78x.
"""
 
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": FEW_SHOT_SYSTEM},
        {
            "role": "user",
            "content": (
                "Problem: A startup raised ₹2 crore in seed funding 18 months ago. "
                "They are burning ₹9,50,000 per month and currently have ₹42,00,000 "
                "in the bank. How many months of runway do they have, and when did "
                "their burn rate exceed their funding rate?"
            ),
        },
    ],
)
print(response.choices[0].message.content)

What makes a good few-shot CoT example?

  1. The reasoning steps match the structure of the target problem. If your target involves subtraction before division, your examples should too.
  2. The examples are diverse enough to show the pattern, not just a template. Two examples that are too similar teach memorization, not reasoning.
  3. The label (final answer) is clearly separated from the reasoning. The model needs to know where the reasoning ends and the answer begins.
  4. Avoid examples where the reasoning steps happen to be trivial. If your example only needs one step, the model will not learn to use multiple steps on hard problems.

Auto-CoT: Letting the model generate its own examples

When you have a large dataset of questions but no hand-written reasoning chains, you can use Auto-CoT: ask the model to generate reasoning chains for a sample of questions, then use those as few-shot examples for the rest.

def generate_reasoning_chain(question: str) -> str:
    """Have the model generate a CoT example for a given question."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": (
                    f"Question: {question}\n\n"
                    "Let's think step by step. Show your full reasoning, "
                    "then give the final answer on a new line starting with 'Answer:'"
                ),
            }
        ],
        temperature=0.0,  # Deterministic for example generation
    )
    return response.choices[0].message.content
 
 
def build_auto_cot_prompt(seed_questions: list[str], target_question: str) -> str:
    """Build a few-shot CoT prompt using auto-generated examples."""
    examples = []
    for q in seed_questions[:3]:  # Use 3 diverse seed questions
        chain = generate_reasoning_chain(q)
        examples.append(f"Question: {q}\n{chain}")
 
    examples_text = "\n\n---\n\n".join(examples)
    return (
        f"{examples_text}\n\n---\n\n"
        f"Question: {target_question}\n"
        "Let's think step by step."
    )

Auto-CoT works well when you can afford the upfront cost of generating examples (which you pay once, then cache) and when you have a diverse set of seed questions that covers the reasoning patterns you need.

Watch out: Auto-CoT can propagate errors. If the model reasons incorrectly on a seed question and you use that as an example, you are teaching the model to reason incorrectly on similar problems. Always sanity-check generated reasoning chains on problems where you know the correct answer.


Tree-of-Thought: Branching reasoning for complex problems

Tree-of-Thought (ToT) extends chain-of-thought by exploring multiple reasoning paths simultaneously and selecting the most promising branch. It is useful when the problem has branches (e.g., multi-step planning, code that could be structured multiple ways) and a single linear reasoning chain might get stuck.

The key idea: instead of one chain A → B → C → answer, you generate multiple partial chains (A → B1, A → B2, A → B3), score each partial path, and continue expanding only the promising ones.

def tree_of_thought(problem: str, branches: int = 3, depth: int = 3) -> str:
    """
    Simple Tree-of-Thought implementation.
    Generates multiple reasoning branches and picks the best continuation.
    """
    # Step 1: Generate initial thoughts
    initial_response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": (
                    f"Problem: {problem}\n\n"
                    f"Generate {branches} different initial approaches to solving "
                    "this problem. Number each approach. Be concise."
                ),
            }
        ],
    )
    approaches = initial_response.choices[0].message.content
 
    # Step 2: Evaluate and continue the most promising approach
    evaluation_response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": (
                    f"Problem: {problem}\n\n"
                    f"Here are {branches} possible approaches:\n{approaches}\n\n"
                    "Which approach is most likely to lead to a correct solution and why? "
                    "Then fully develop that approach to its conclusion, "
                    "showing all reasoning steps."
                ),
            }
        ],
    )
    return evaluation_response.choices[0].message.content
 
 
# Example: Scheduling problem that has multiple valid solutions
result = tree_of_thought(
    problem=(
        "We need to schedule 5 database migrations on a production system. "
        "Migrations 2 and 3 depend on migration 1. Migration 5 depends on "
        "migrations 3 and 4. Migration 4 has no dependencies. "
        "We can run at most 2 migrations in parallel. "
        "Find a schedule that minimizes total time if each migration takes 10 minutes."
    )
)
print(result)

ToT is overkill for most tasks. Use it when:

  • The solution space has genuinely branching structure (planning, scheduling, combinatorial problems)
  • A wrong early decision cannot be recovered from easily
  • You can afford 3–5x the token cost of standard CoT

When CoT helps most

CoT consistently improves accuracy on:

Multi-step arithmetic and algebra. Any problem requiring more than two calculation steps. Percentages of percentages, unit conversions, compound interest — the model needs to track intermediate values.

Logical and deductive reasoning. Syllogisms, constraint satisfaction, puzzles. "If A is true and B implies not-A, what can we conclude?" Direct answers here are mostly wrong; step-by-step reasoning is mostly right.

Code debugging and review. Asking the model to trace through code execution step by step ("What is the value of x after line 3? After line 7?") dramatically improves bug detection.

Planning and scheduling. Any task with dependencies, ordering constraints, or resource limits. The model needs to track state across steps.

Long document Q&A. When the answer requires synthesizing information from multiple sections. "First, find what the document says about X. Then find what it says about Y. Now reconcile these two claims."

Legal and financial analysis. Multi-clause contracts, tax calculations, eligibility checks with multiple conditions. The model needs to work through each condition explicitly.


When CoT hurts

CoT is not always the right choice. There are real cases where it makes things worse or adds cost for no benefit.

Simple factual lookups. "What is the capital of Karnataka?" Adding CoT here wastes tokens and can introduce hallucinated reasoning steps that lead to a wrong answer. The model knows the answer directly.

Classification with clear criteria. "Is this email spam?" If you have a clear system prompt with rules, direct classification is faster and equally accurate.

Latency-sensitive applications. CoT generates 3–10x more tokens than a direct answer on many problems. If you are building a real-time chat interface or a high-throughput batch processor, CoT can make response times unacceptable or cost prohibitive.

Very short, simple prompts on large models. Frontier models like GPT-4o or Claude Opus have enough capacity to answer many problems correctly without explicit CoT. Adding "Let's think step by step" can cause over-elaboration and actually reduce answer conciseness without improving accuracy.

Rule of thumb: Use CoT when the task has more than two logical dependencies, when accuracy matters more than latency, and when you can verify the reasoning chain (either manually or programmatically). Skip it for lookups, simple classification, and anything where you are optimizing for throughput.


Practical code examples using AICredits

Here is a production-ready CoT wrapper you can drop into an existing application:

from openai import OpenAI
from typing import Literal
 
client = OpenAI(
    base_url="https://api.aicredits.in/v1",
    api_key="sk-your-aicredits-key",
)
 
 
def ask_with_cot(
    question: str,
    mode: Literal["zero_shot", "structured", "off"] = "zero_shot",
    model: str = "gpt-4o-mini",
    system_prompt: str | None = None,
) -> dict:
    """
    Ask a question with optional chain-of-thought reasoning.
 
    Returns a dict with 'reasoning' and 'answer' fields.
    When mode='off', 'reasoning' will be None.
    """
    messages = []
 
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})
 
    if mode == "zero_shot":
        content = f"{question}\n\nLet's think step by step."
    elif mode == "structured":
        content = (
            f"{question}\n\n"
            "Please reason through this carefully:\n"
            "1. Identify what is being asked\n"
            "2. Identify the relevant information\n"
            "3. Work through the solution step by step\n"
            "4. State your final answer clearly\n\n"
            "Begin reasoning:"
        )
    else:
        content = question
 
    messages.append({"role": "user", "content": content})
 
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.1 if mode != "off" else 0.7,
    )
 
    full_response = response.choices[0].message.content
    usage = response.usage
 
    # For structured mode, try to split reasoning from final answer
    if mode == "structured" and "final answer" in full_response.lower():
        parts = full_response.lower().split("final answer")
        reasoning = full_response[: len(parts[0])]
        answer = full_response[len(parts[0]) :]
    else:
        reasoning = full_response if mode != "off" else None
        answer = full_response
 
    return {
        "reasoning": reasoning,
        "answer": answer,
        "prompt_tokens": usage.prompt_tokens,
        "completion_tokens": usage.completion_tokens,
        "total_tokens": usage.total_tokens,
    }
 
 
# Example usage
result = ask_with_cot(
    question=(
        "A developer at an Indian startup uses the AICredits API. "
        "She makes 500 requests/day, each consuming an average of 800 input tokens "
        "and 400 output tokens, using GPT-4o-mini. "
        "At a cost of $0.15/1M input tokens and $0.60/1M output tokens, "
        "and an exchange rate of ₹86/USD with 5% markup, "
        "what is her approximate monthly bill in INR?"
    ),
    mode="structured",
    model="gpt-4o-mini",
)
 
print("Reasoning:\n", result["reasoning"])
print("\nAnswer:\n", result["answer"])
print(f"\nTokens used: {result['total_tokens']}")

Cost implications and how to manage them via AICredits

CoT is not free. Here is what it actually costs in tokens:

| Task type | Direct answer | Zero-shot CoT | Few-shot CoT (3 examples) | |---|---|---|---| | Simple math | ~20 tokens | ~150 tokens | ~500 tokens | | Multi-step word problem | ~30 tokens | ~250 tokens | ~700 tokens | | Code debugging | ~100 tokens | ~400 tokens | ~1,000 tokens | | Planning task | ~80 tokens | ~500 tokens | ~1,200 tokens |

On GPT-4o-mini ($0.15/$0.60 per 1M input/output tokens), an extra 500 output tokens costs roughly $0.0003 — trivial for a single call. At 100,000 calls/day, that is $30/day extra, or about ₹2,580/day at current rates.

Strategies to manage CoT cost on AICredits

1. Use a tiered model strategy. Route CoT-heavy tasks to a cheaper model. GPT-4o-mini and Gemini 2.0 Flash handle CoT well at a fraction of the cost of frontier models. Use the AICredits unified API to switch models without changing any other code:

# For CoT-heavy analytical tasks, use a cost-efficient model
COT_MODEL = "gpt-4o-mini"       # ₹ efficient, good CoT
DIRECT_MODEL = "gpt-4o-mini"    # Same model, no CoT trigger
COMPLEX_MODEL = "claude-sonnet-4-5"  # For problems that need more power
 
def route_request(question: str, complexity: str) -> str:
    if complexity == "simple":
        return ask_with_cot(question, mode="off", model=DIRECT_MODEL)["answer"]
    elif complexity == "medium":
        return ask_with_cot(question, mode="zero_shot", model=COT_MODEL)["answer"]
    else:
        return ask_with_cot(question, mode="structured", model=COMPLEX_MODEL)["answer"]

2. Cache CoT results for repeated questions. AICredits has built-in semantic caching. If your users ask variations of the same analytical question (e.g., "calculate my bill for X requests at Y model"), the cached result will be returned without hitting the LLM at all — zero tokens, full answer.

3. Set per-key budgets. Use the AICredits API key budget feature to cap monthly spend per service or per customer. If a particular workflow starts generating unexpectedly long CoT chains, the budget cap prevents runaway costs.

4. Monitor token usage per workflow. The usage field in every response tells you prompt + completion tokens. Log these and alert if a workflow's average completion token count spikes — it often means CoT is being triggered unnecessarily or the model is over-explaining.


Common mistakes to avoid

Mistake 1: Truncating the reasoning chain

Some developers strip out the reasoning before storing or displaying the response to reduce storage cost. This is fine for the final output — but do not truncate mid-response. If you cut off the model before it finishes reasoning and before it states its answer, you get neither the full reasoning nor the answer.

If you want to separate reasoning from answers, use structured CoT with a clear delimiter:

# Ask the model to use a clear separator
content = (
    f"{question}\n\n"
    "Think through this carefully, showing your work. "
    "When you have your final answer, write it after the line: "
    "---FINAL ANSWER---"
)

Then split on ---FINAL ANSWER--- in your code.

Mistake 2: Using few-shot examples from the wrong domain

Few-shot examples teach the model the reasoning pattern. If your few-shot examples are arithmetic problems but your target question is a logical deduction problem, the model may try to apply an arithmetic reasoning structure to a logic problem — and fail. Match your examples to the structure of the target task, not just the surface topic.

Mistake 3: Too many reasoning steps for simple problems

Adding "Let's think step by step" to "What is 2 + 2?" can cause the model to generate:

First, I need to identify the operands: 2 and 2.
Next, I apply the addition operation...
The result is 4.

Three tokens would have sufficed. Worse, on some models, very long reasoning chains for trivial problems can loop or hallucinate intermediate steps. Build a lightweight classifier or use heuristics (e.g., if the question is under 50 tokens and contains no numbers or logical operators, skip CoT).

Mistake 4: Using CoT with temperature > 0.5 for reasoning tasks

High temperature introduces randomness into reasoning steps. The model might reason "Step 1: assume X" where X is false, then correctly derive a wrong conclusion from it. For CoT, use temperature=0.0 to temperature=0.2. Save higher temperatures for creative generation where you want variety, not for structured reasoning where you want correctness.

Mistake 5: Assuming CoT reasoning is always trustworthy

Critically, CoT output shows you the model's stated reasoning — not its actual computational process. A model can write plausible-looking reasoning steps that lead to a wrong answer, or write steps that are internally consistent but based on a false premise. Always validate CoT-generated answers on problems where ground truth is available, especially for financial calculations, medical queries, or legal analysis.


Putting it all together

Chain-of-thought prompting is one of the highest-leverage techniques in your prompt engineering toolkit. A single line — "Let's think step by step" — can turn a 20% accuracy rate into 70%+ on multi-step reasoning tasks. Few-shot CoT pushes accuracy higher still by teaching the model the exact reasoning structure you need.

The practical workflow for Indian developers building on AICredits:

  1. Start with zero-shot CoT for any task involving math, logic, or planning. Measure accuracy on a test set.
  2. If zero-shot CoT is not accurate enough, move to few-shot CoT with 2–3 hand-written examples that match your domain.
  3. For problems with branching structure (scheduling, optimization, multi-path planning), try Tree-of-Thought.
  4. Use AICredits' budget controls and semantic caching to keep CoT costs predictable at scale.
  5. Monitor token usage per workflow and route simple queries to direct answering to avoid unnecessary overhead.

The token cost of CoT is real but manageable. On most production workloads, the accuracy improvement is worth 3–5x the token cost — because the alternative is paying for API calls that return wrong answers that your users or systems act on.

Related Articles

Continue in Docs

Need implementation commands and endpoint details? Go to quickstart or API reference.