Agent Observability: How to Monitor, Debug & Audit AI Agents in Production

AI agents fail in ways that are qualitatively different from traditional software. A 504 timeout is obvious. An agent that confidently returns a plausible-but-wrong answer is not. Effective observability for agents means capturing not just runtime metrics but also the semantic quality of agent behavior.

The 4 layers of agent observability

Execution traces

Full step-by-step logs of every agent run: which tools were called, with what arguments, what they returned, and how long each step took. The equivalent of distributed tracing for agent workflows.

LLM call logs

Every prompt sent to the LLM and every response received — including token counts, latency, model version, and temperature. Essential for debugging unexpected outputs and cost attribution.

Output evaluations (evals)

Automated quality scoring of agent outputs against ground truth or LLM-as-judge rubrics. Catches quality regressions before users notice them. Run on every output or sampled at a configured rate.

Operational metrics & alerts

Aggregate metrics: task success rate, error rate by type, p50/p95 latency, token spend per task, and cost per outcome. Alerts via Slack or PagerDuty when thresholds are breached.

Implementing traces in MoltBot

# Enable full tracing on an agent
agent = Agent(
    model="claude-opus-4",
    tracing=Tracing(
        enabled=True,
        export_to="langfuse",  # or "langsmith", "arize", "custom"
        sample_rate=1.0,     # trace 100% of runs in staging
        include_prompts=True
    )
)

# Each run automatically creates a trace with:
# - Full LLM prompt/response pairs
# - Tool call inputs/outputs
# - Step latency breakdown
# - Token usage and cost attribution
      

Running evals in production

Evals are the hardest part of agent observability to get right. There are three approaches, each with different cost/coverage tradeoffs:

Ground truth comparison: Check agent output against labeled examples. High precision, requires labeled data. Good for structured output tasks (classification, extraction).
LLM-as-judge: Use a separate model (typically GPT-5 or Claude Opus 4) to score agent outputs on rubrics: accuracy, helpfulness, safety, format compliance. No labeled data required. Works for open-ended generation.
Human-in-the-loop spot-check: Sample 2–5% of production outputs for human review. Catches distribution shifts that automated evals miss. Essential for high-stakes tasks.

What to alert on

Configure alerts for: task error rate > 5% (15-min window), p95 latency > 30s, eval score drop > 10% vs 7-day average, any tool call hitting a 3× error rate spike. Don't alert on individual failures — alert on rates.

Built-in observability on MoltBot

Full traces, eval scoring, and operational metrics out of the box. 14-day free trial.

Start Free Trial →