Traditional application monitoring โ CPU, memory, request latency, error rate โ is necessary but not sufficient for LLM apps. A model can return a 200 OK in 800ms while producing completely wrong output. AI observability means monitoring what the model actually does, not just whether it responded.
The 5 key metrics for LLM production monitoring
๐ Output quality score
Automated evaluation of response quality using LLM-as-judge or golden dataset comparison. The most important metric, and the hardest to get right.
โฑ Time-to-first-token (TTFT)
User-perceived latency for streaming responses. Track p50, p95, p99 โ tail latency on LLMs is much worse than traditional APIs.
๐ฐ Cost per request
Input tokens + cached tokens + output tokens ร price. Broken down by model, endpoint, and user segment. Catches cost regressions before they hit your bill.
๐ฏ Task completion rate
For agentic workflows: % of tasks completed without human intervention or fallback. Drops indicate model degradation or prompt drift.
๐ Retry and fallback rate
% of requests that hit an error, exceeded context, or triggered a fallback to a backup model. Rising rate = something is wrong upstream.
Distributed tracing for agents
A single user request to an AI agent may involve dozens of LLM calls, tool invocations, and database queries. Without distributed tracing, debugging failures is nearly impossible. Every span should capture: model, prompt hash, token counts, latency, cost, and output quality score.
Detecting prompt drift
Prompt drift happens when model behavior changes without any code change โ due to model updates, input distribution shifts, or gradual prompt degradation. Detect it by:
- Golden dataset regression tests: Run a fixed set of test cases on every deployment. Alert if quality score drops more than 5%.
- Output distribution monitoring: Track distributions of output length, format compliance, and sentiment. Sudden shifts indicate drift.
- User feedback signals: Thumbs up/down, edit rates, and re-ask rates are the ground truth signal for quality degradation.
- A/B evaluation: When rolling out a new prompt version, shadow-test against the old version and compare quality scores before full rollout.
Observability stack recommendations
- Tracing: OpenTelemetry + MoltBot SDK (automatic LLM span instrumentation)
- Metrics: Prometheus + Grafana or MoltBot built-in dashboards
- Quality evaluation: MoltBot eval suite (LLM-as-judge + golden dataset)
- Alerts: PagerDuty or OpsGenie on quality score drops, cost spikes, and error rate thresholds
Full AI observability built into MoltBot
Automatic tracing, cost per request, quality scoring, drift alerts. Zero config. 14-day free trial.
Start Free Trial โ