AI Observability: How to Monitor LLM Apps in Production

Traditional application monitoring — CPU, memory, request latency, error rate — is necessary but not sufficient for LLM apps. A model can return a 200 OK in 800ms while producing completely wrong output. AI observability means monitoring what the model actually does, not just whether it responded.

The 5 key metrics for LLM production monitoring

📊 Output quality score

Automated evaluation of response quality using LLM-as-judge or golden dataset comparison. The most important metric, and the hardest to get right.

⏱ Time-to-first-token (TTFT)

User-perceived latency for streaming responses. Track p50, p95, p99 — tail latency on LLMs is much worse than traditional APIs.

💰 Cost per request

Input tokens + cached tokens + output tokens × price. Broken down by model, endpoint, and user segment. Catches cost regressions before they hit your bill.

🎯 Task completion rate

For agentic workflows: % of tasks completed without human intervention or fallback. Drops indicate model degradation or prompt drift.

🔄 Retry and fallback rate

% of requests that hit an error, exceeded context, or triggered a fallback to a backup model. Rising rate = something is wrong upstream.

Distributed tracing for agents

A single user request to an AI agent may involve dozens of LLM calls, tool invocations, and database queries. Without distributed tracing, debugging failures is nearly impossible. Every span should capture: model, prompt hash, token counts, latency, cost, and output quality score.

from moltbot.observability import trace

with trace("research_agent", user_id=user_id, session_id=session_id) as t:
    # Every LLM call inside this context is automatically traced
    plan = planner.run(goal)         # span: planner, tokens, cost
    results = executor.run(plan)     # span: executor steps + tools
    report = writer.run(results)     # span: writer, quality score

# t.spans → full trace with cost breakdown and quality scores
# Automatically sent to MoltBot observability dashboard
      

Detecting prompt drift

Prompt drift happens when model behavior changes without any code change — due to model updates, input distribution shifts, or gradual prompt degradation. Detect it by:

Golden dataset regression tests: Run a fixed set of test cases on every deployment. Alert if quality score drops more than 5%.
Output distribution monitoring: Track distributions of output length, format compliance, and sentiment. Sudden shifts indicate drift.
User feedback signals: Thumbs up/down, edit rates, and re-ask rates are the ground truth signal for quality degradation.
A/B evaluation: When rolling out a new prompt version, shadow-test against the old version and compare quality scores before full rollout.

Observability stack recommendations

Tracing: OpenTelemetry + MoltBot SDK (automatic LLM span instrumentation)
Metrics: Prometheus + Grafana or MoltBot built-in dashboards
Quality evaluation: MoltBot eval suite (LLM-as-judge + golden dataset)
Alerts: PagerDuty or OpsGenie on quality score drops, cost spikes, and error rate thresholds

Full AI observability built into MoltBot

Automatic tracing, cost per request, quality scoring, drift alerts. Zero config. 14-day free trial.

Start Free Trial →