Unit testing a function is straightforward. Evaluating an agent โ which may take dozens of tool calls, produce non-deterministic outputs, and interact with external systems โ is fundamentally different. You need a framework that tolerates variance while still catching meaningful regressions.
The four evaluation methods
๐ 1. Golden Dataset Evaluation
A curated set of input/expected-output pairs. Run your agent on every input and score its output against the expected answer using exact match, fuzzy match, or a rubric. The most interpretable eval method.
๐ค 2. LLM-as-Judge
Use a strong model (Claude Opus 4, GPT-5) to grade your agent's outputs against a rubric. Scales to open-ended tasks where exact-match scoring doesn't work. The judge evaluates helpfulness, accuracy, tone, and task completion.
๐ 3. Regression Testing
Re-run your eval suite every time you change the model, prompt, or tool configuration. Track scores over time. Alert when any scenario drops below its baseline. Prevents silent regressions when you upgrade to a new model version.
๐ด 4. Red-Teaming
Adversarial testing: prompt injection attempts, edge case inputs, malformed tool outputs, and out-of-distribution requests. Systematically tries to make the agent fail, misbehave, or produce harmful outputs.
MoltBot eval configuration
What to measure
- Task completion rate: Did the agent complete the assigned task? Binary yes/no, measured per scenario.
- Correctness: For tasks with ground truth, what % of outputs are correct? Track by category.
- Tool call accuracy: Did the agent call the right tools in the right order? Hallucinated or spurious tool calls are a common regression signal.
- Cost per task: Track token usage per eval run. Regressions in quality often come with regressions in cost.
- Latency: P50 and P95 completion time. New model upgrades sometimes improve quality but regress on latency.
Built-in eval suite on MoltBot
Golden datasets, LLM-as-judge, regression tracking, CI/CD integration. 14-day free trial.
Start Free Trial โ