Streaming LLM Responses in Python: The Complete Guide

Server-sent events, async generators, error handling, and UI integration — everything you need to stream LLM responses to your users in real time.

Author

AICredits Team

Published

5 Apr 2026

Reading time

8 min read

Why streaming matters for user experience

A non-streaming LLM call makes the user wait silently for the entire response before seeing any text — which can be 5–30 seconds for long outputs. Streaming starts delivering text within 300–500ms of the request. The perceived latency drops from "waiting forever" to "I can see it working".

Streaming is now expected behaviour in any conversational AI interface.

Basic streaming with the OpenAI Python SDK

from openai import OpenAI
 
client = OpenAI(
    base_url="https://api.aicredits.in/v1",
    api_key="sk-your-aicredits-key",
)
 
stream = client.chat.completions.create(
    model="openai/gpt-4o-mini",
    stream=True,
    messages=[{"role": "user", "content": "Explain how JWT tokens work in 3 sentences."}],
)
 
full_response = ""
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)  # tokens appear as they arrive
        full_response += delta
 
print()  # newline after stream ends

Async streaming for FastAPI backends

For web backends, use AsyncOpenAI and return a StreamingResponse:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import AsyncOpenAI
 
app = FastAPI()
client = AsyncOpenAI(
    base_url="https://api.aicredits.in/v1",
    api_key="sk-your-aicredits-key",
)
 
@app.get("/chat")
async def chat(prompt: str):
    async def generate():
        stream = await client.chat.completions.create(
            model="openai/gpt-4o-mini",
            stream=True,
            messages=[{"role": "user", "content": prompt}],
        )
        async for chunk in stream:
            delta = chunk.choices[0].delta.content
            if delta:
                yield f"data: {delta}\n\n"
        yield "data: [DONE]\n\n"
 
    return StreamingResponse(generate(), media_type="text/event-stream")

Consuming streams in the browser

const response = await fetch('/chat?prompt=Explain+streaming', {
  headers: { 'Accept': 'text/event-stream' },
});
 
const reader = response.body.getReader();
const decoder = new TextDecoder();
let output = '';
 
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
 
  const text = decoder.decode(value);
  for (const line of text.split('\n')) {
    if (line.startsWith('data: ') && line !== 'data: [DONE]') {
      output += line.slice(6);
      document.getElementById('output').textContent = output;
    }
  }
}

Error handling in streams

Streaming errors are tricky because the HTTP response starts with status 200 before the error occurs mid-stream. Always wrap stream iteration in a try/except:

from openai import APIStatusError, APIConnectionError
 
full_response = ""
try:
    stream = client.chat.completions.create(
        model="openai/gpt-4o-mini",
        stream=True,
        messages=[{"role": "user", "content": prompt}],
    )
    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            full_response += delta
            yield delta
except (APIStatusError, APIConnectionError) as e:
    # Partial response may be usable — log and surface to user
    print(f"Stream error after {len(full_response)} chars: {e}")
    yield "\n\n[Response interrupted. Please try again.]"

Token counting for streamed responses

Set stream_options: {"include_usage": true} to get token counts in the final chunk:

stream = client.chat.completions.create(
    model="openai/gpt-4o-mini",
    stream=True,
    stream_options={"include_usage": True},
    messages=[{"role": "user", "content": prompt}],
)
 
usage = None
for chunk in stream:
    if chunk.usage:
        usage = chunk.usage  # only set on the final chunk
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
 
if usage:
    cost_inr = (usage.prompt_tokens / 1_000_000 * 14) + (usage.completion_tokens / 1_000_000 * 58)
    print(f"\nTokens: {usage.prompt_tokens} in + {usage.completion_tokens} out | Cost: ₹{cost_inr:.5f}")

Using the Anthropic SDK with AICredits (Python & TypeScript)

7 min read

The Prompting Cheat Sheet: 10 Patterns Every Developer Should Know

9 min read

How to Get Structured JSON Output from Any LLM (Reliably)