
Streaming LLM Responses in Python: The Complete Guide
Server-sent events, async generators, error handling, and UI integration — everything you need to stream LLM responses to your users in real time.
Author
AICredits Team
Published
5 Apr 2026
Reading time
8 min read
Why streaming matters for user experience
A non-streaming LLM call makes the user wait silently for the entire response before seeing any text — which can be 5–30 seconds for long outputs. Streaming starts delivering text within 300–500ms of the request. The perceived latency drops from "waiting forever" to "I can see it working".
Streaming is now expected behaviour in any conversational AI interface.
Basic streaming with the OpenAI Python SDK
from openai import OpenAI
client = OpenAI(
base_url="https://api.aicredits.in/v1",
api_key="sk-your-aicredits-key",
)
stream = client.chat.completions.create(
model="openai/gpt-4o-mini",
stream=True,
messages=[{"role": "user", "content": "Explain how JWT tokens work in 3 sentences."}],
)
full_response = ""
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True) # tokens appear as they arrive
full_response += delta
print() # newline after stream endsAsync streaming for FastAPI backends
For web backends, use AsyncOpenAI and return a StreamingResponse:
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import AsyncOpenAI
app = FastAPI()
client = AsyncOpenAI(
base_url="https://api.aicredits.in/v1",
api_key="sk-your-aicredits-key",
)
@app.get("/chat")
async def chat(prompt: str):
async def generate():
stream = await client.chat.completions.create(
model="openai/gpt-4o-mini",
stream=True,
messages=[{"role": "user", "content": prompt}],
)
async for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
yield f"data: {delta}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")Consuming streams in the browser
const response = await fetch('/chat?prompt=Explain+streaming', {
headers: { 'Accept': 'text/event-stream' },
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let output = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
const text = decoder.decode(value);
for (const line of text.split('\n')) {
if (line.startsWith('data: ') && line !== 'data: [DONE]') {
output += line.slice(6);
document.getElementById('output').textContent = output;
}
}
}Error handling in streams
Streaming errors are tricky because the HTTP response starts with status 200 before the error occurs mid-stream. Always wrap stream iteration in a try/except:
from openai import APIStatusError, APIConnectionError
full_response = ""
try:
stream = client.chat.completions.create(
model="openai/gpt-4o-mini",
stream=True,
messages=[{"role": "user", "content": prompt}],
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
full_response += delta
yield delta
except (APIStatusError, APIConnectionError) as e:
# Partial response may be usable — log and surface to user
print(f"Stream error after {len(full_response)} chars: {e}")
yield "\n\n[Response interrupted. Please try again.]"Token counting for streamed responses
Set stream_options: {"include_usage": true} to get token counts in the final chunk:
stream = client.chat.completions.create(
model="openai/gpt-4o-mini",
stream=True,
stream_options={"include_usage": True},
messages=[{"role": "user", "content": prompt}],
)
usage = None
for chunk in stream:
if chunk.usage:
usage = chunk.usage # only set on the final chunk
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
if usage:
cost_inr = (usage.prompt_tokens / 1_000_000 * 14) + (usage.completion_tokens / 1_000_000 * 58)
print(f"\nTokens: {usage.prompt_tokens} in + {usage.completion_tokens} out | Cost: ₹{cost_inr:.5f}")Related Articles
Continue in Docs
Need implementation commands and endpoint details? Go to quickstart or API reference.