Back to blogEngineering

Production Observability for AI Gateways

What to monitor in a unified AI gateway: latency, provider errors, fallback rates, token drift, and wallet burn.

Author

Reliability Team

Published

17 Jan 2026

Reading time

8 min read

Why AI gateway observability is different

Traditional API observability focuses on latency, error rate, and throughput. AI gateways add a cost dimension: every request burns tokens, and token burn rate varies by model, prompt length, and response verbosity. An AI gateway that is "healthy" by traditional metrics can still be quietly exhausting your budget.

Good observability for an AI gateway tracks both reliability signals (is it working?) and economic signals (is it spending correctly?).

Golden signals

Track p50/p95 latency per provider-model pair, non-retryable error rates, and fallback activation frequency. Add budget burn velocity by API key.

These signals make routing incidents visible before customers notice degradation.

The five metrics that matter most

1. Provider latency by model (p50, p95, p99) LLM latency varies enormously — a fast GPT-4o Mini call might be 400ms while a slow Claude Sonnet response could be 8s. Track these separately per provider and model. A p95 spike on one model while others are stable points to that provider having issues.

2. Fallback activation rate What percentage of requests triggered a fallback to an alternative provider? A rate above 5% in a 5-minute window is a signal worth investigating. Above 20% means your primary provider is severely degraded and you may need to manually adjust routing.

3. Token drift Monitor the ratio of output tokens to input tokens over time. A sudden increase in this ratio means your model is generating longer responses — either your prompts changed or the model behaviour changed. Both can significantly impact cost without triggering any error.

4. Wallet burn velocity How fast is your INR balance decreasing? Set an alert if the 1-hour burn rate exceeds 2× the 7-day average. This catches runaway loops, misconfigured prompts, and unexpected traffic spikes before they exhaust your balance.

5. Per-key cost concentration What percentage of your total spend comes from a single API key? If one key accounts for more than 60% of spend unexpectedly, investigate that key's usage pattern. It may indicate a feature that is over-calling the API or a staging environment with no budget limit.

Alerting model

Set separate alerts for availability, latency, and cost anomalies. One blended alert tends to hide root cause and increases mean time to recovery.

Recommended alert thresholds for production:

| Signal | Warning | Critical | |--------|---------|----------| | Provider error rate | > 2% / 5min | > 10% / 5min | | Fallback activation rate | > 5% / 5min | > 25% / 5min | | p95 latency delta from baseline | +50% | +200% | | Wallet burn rate vs 7-day avg | 2× | 5× |

Using AICredits dashboard for observability

AICredits logs every request with model, provider, tokens, cost, and latency. The usage dashboard gives you per-request data exportable as CSV. For real-time alerting, set budget limits per API key — when a key hits its limit, requests fail immediately and you get a clear signal to investigate.

Agentic AI Costs: How One Loop Burned ₹5,000 in 10 Minutes (And How to Prevent It)

9 min read

How to Build a Retry Strategy for LLM API Calls

6 min read

Context Window Management: Don't Waste Tokens

7 min read

Continue in Docs

Need implementation commands and endpoint details? Go to quickstart or API reference.