
Production Observability for AI Gateways
What to monitor in a unified AI gateway: latency, provider errors, fallback rates, token drift, and wallet burn.
Author
Reliability Team
Published
17 Jan 2026
Reading time
8 min read
Why AI gateway observability is different
Traditional API observability focuses on latency, error rate, and throughput. AI gateways add a cost dimension: every request burns tokens, and token burn rate varies by model, prompt length, and response verbosity. An AI gateway that is "healthy" by traditional metrics can still be quietly exhausting your budget.
Good observability for an AI gateway tracks both reliability signals (is it working?) and economic signals (is it spending correctly?).
Golden signals
Track p50/p95 latency per provider-model pair, non-retryable error rates, and fallback activation frequency. Add budget burn velocity by API key.
These signals make routing incidents visible before customers notice degradation.
The five metrics that matter most
1. Provider latency by model (p50, p95, p99) LLM latency varies enormously — a fast GPT-4o Mini call might be 400ms while a slow Claude Sonnet response could be 8s. Track these separately per provider and model. A p95 spike on one model while others are stable points to that provider having issues.
2. Fallback activation rate What percentage of requests triggered a fallback to an alternative provider? A rate above 5% in a 5-minute window is a signal worth investigating. Above 20% means your primary provider is severely degraded and you may need to manually adjust routing.
3. Token drift Monitor the ratio of output tokens to input tokens over time. A sudden increase in this ratio means your model is generating longer responses — either your prompts changed or the model behaviour changed. Both can significantly impact cost without triggering any error.
4. Wallet burn velocity How fast is your INR balance decreasing? Set an alert if the 1-hour burn rate exceeds 2× the 7-day average. This catches runaway loops, misconfigured prompts, and unexpected traffic spikes before they exhaust your balance.
5. Per-key cost concentration What percentage of your total spend comes from a single API key? If one key accounts for more than 60% of spend unexpectedly, investigate that key's usage pattern. It may indicate a feature that is over-calling the API or a staging environment with no budget limit.
Alerting model
Set separate alerts for availability, latency, and cost anomalies. One blended alert tends to hide root cause and increases mean time to recovery.
Recommended alert thresholds for production:
| Signal | Warning | Critical | |--------|---------|----------| | Provider error rate | > 2% / 5min | > 10% / 5min | | Fallback activation rate | > 5% / 5min | > 25% / 5min | | p95 latency delta from baseline | +50% | +200% | | Wallet burn rate vs 7-day avg | 2× | 5× |
Using AICredits dashboard for observability
AICredits logs every request with model, provider, tokens, cost, and latency. The usage dashboard gives you per-request data exportable as CSV. For real-time alerting, set budget limits per API key — when a key hits its limit, requests fail immediately and you get a clear signal to investigate.
Related Articles
Continue in Docs
Need implementation commands and endpoint details? Go to quickstart or API reference.