AI agents fail in ways that are qualitatively different from traditional software. A 504 timeout is obvious. An agent that confidently returns a plausible-but-wrong answer is not. Effective observability for agents means capturing not just runtime metrics but also the semantic quality of agent behavior.
The 4 layers of agent observability
Execution traces
Full step-by-step logs of every agent run: which tools were called, with what arguments, what they returned, and how long each step took. The equivalent of distributed tracing for agent workflows.
LLM call logs
Every prompt sent to the LLM and every response received โ including token counts, latency, model version, and temperature. Essential for debugging unexpected outputs and cost attribution.
Output evaluations (evals)
Automated quality scoring of agent outputs against ground truth or LLM-as-judge rubrics. Catches quality regressions before users notice them. Run on every output or sampled at a configured rate.
Operational metrics & alerts
Aggregate metrics: task success rate, error rate by type, p50/p95 latency, token spend per task, and cost per outcome. Alerts via Slack or PagerDuty when thresholds are breached.
Implementing traces in MoltBot
Running evals in production
Evals are the hardest part of agent observability to get right. There are three approaches, each with different cost/coverage tradeoffs:
- Ground truth comparison: Check agent output against labeled examples. High precision, requires labeled data. Good for structured output tasks (classification, extraction).
- LLM-as-judge: Use a separate model (typically GPT-5 or Claude Opus 4) to score agent outputs on rubrics: accuracy, helpfulness, safety, format compliance. No labeled data required. Works for open-ended generation.
- Human-in-the-loop spot-check: Sample 2โ5% of production outputs for human review. Catches distribution shifts that automated evals miss. Essential for high-stakes tasks.
What to alert on
Configure alerts for: task error rate > 5% (15-min window), p95 latency > 30s, eval score drop > 10% vs 7-day average, any tool call hitting a 3ร error rate spike. Don't alert on individual failures โ alert on rates.
Built-in observability on MoltBot
Full traces, eval scoring, and operational metrics out of the box. 14-day free trial.
Start Free Trial โ