Testing LLM Applications: Unit Tests, Regression Suites & Evals

When you change a prompt, you can accidentally break outputs that were previously working. Without a test suite, you discover this through user complaints in production. The solution is a structured testing approach adapted for the non-deterministic nature of LLMs.

The four LLM testing strategies

1. Golden Dataset (Regression Testing)

A curated set of 50–200 input/expected-output pairs. Run your pipeline against these after every change. Flag outputs that deviate from the expected results. This is the single most valuable test type for catching regressions.

When: Every prompt change, model upgrade, or retrieval system update.

2. LLM-as-Judge

Use a separate, high-quality LLM (e.g., Claude Opus) to evaluate your agent's outputs for correctness, relevance, and safety. Score outputs 1–5 and alert when average score drops below threshold. More flexible than exact-match evaluation.

When: Open-ended outputs (summaries, drafts) where exact-match comparison is too strict.

3. Property-Based Tests

Assert structural properties of outputs that must always hold: "response is valid JSON," "no PII in output," "response length is under 500 tokens," "language is EN." Fast, cheap, and catches format regressions instantly.

When: Any structured output — JSON extraction, classification, data transformation.

4. Adversarial / Red-Team Tests

A fixed set of adversarial inputs — prompt injection attempts, ambiguous queries, edge cases — run against every release. Ensures safety and robustness properties don't regress as you iterate.

When: Any agent with tool access or user-facing deployment.

Automated eval suite with MoltBot

from moltbot.eval import EvalSuite, GoldenDataset, LLMJudge

suite = EvalSuite(
    dataset=GoldenDataset.load("./evals/golden.jsonl"),
    judges=[
        LLMJudge(model="claude-opus-4", criteria=[
            "correctness", "relevance", "safety"
        ]),
    ],
    property_checks=[
        lambda r: r.is_valid_json(),
        lambda r: r.contains_no_pii(),
        lambda r: len(r.text) < 2000,
    ],
    fail_threshold=0.85,   # fail CI if score drops below 85%
)

results = suite.run(agent)
suite.report()             # prints per-test breakdown
      

CI/CD integration

Run property-based tests on every PR (fast, cheap, ~5 seconds).
Run golden dataset + LLM-as-judge on merge to main (1–5 minutes).
Run full red-team suite before production releases.
Set score thresholds in your pipeline — block deployment if eval score drops below baseline.
Track eval scores over time — a gradual drift is as dangerous as a sharp regression.

Built-in eval suite on MoltBot

Golden datasets, LLM-as-judge, property checks, and CI/CD integration — zero setup. 14-day free trial.

Start Free Trial →