Evaluation & Regression Testing

You upgrade the model. Agent behavior changes. No tests caught it.

On this page

The Failure Scenario
Why This Matters
How to Implement
Production Checklist
Common Pitfalls
Terminal Output

The Failure Scenario

Your team runs an insurance-claims agent on GPT-4-Turbo. It works well: it correctly classifies claims, routes urgent cases, and generates denial letters with appropriate regulatory language. The provider releases a new model version. Your team upgrades on a Friday because the new model is cheaper and faster. Nobody runs the eval suite, because there is no eval suite.

Monday morning, the compliance team flags that the agent has started approving claims that should require manual review. The new model interprets the classification rubric differently, treating "borderline" cases as approvals instead of escalations. By the time anyone notices, 340 claims have been auto-approved without human review, totaling $2.1M in payouts that should have been audited.

The root cause isn't the model upgrade. It's the absence of a test suite that would have caught the behavioral shift in staging. Traditional software ships with unit tests, integration tests, and regression tests. Agent systems ship with vibes and a Slack message that says "seems to be working."

Why This Matters

Agents are non-deterministic by nature. The same input can produce different outputs across runs, model versions, and even temperature settings. This makes traditional assertion-based testing insufficient on its own, but it does not make testing optional. It means you need a different testing strategy: one built around behavioral expectations, statistical thresholds, and golden datasets rather than exact string matching.

The cost of not testing is amplified by the speed at which agent behavior can change. A traditional software regression requires a code change, a merge, and a deployment. An agent regression can happen because a provider updated their model weights, changed their system prompt processing, or modified their safety filters. You didn't change anything, but your agent's behavior changed anyway.

Evaluation frameworks are also the only reliable way to compare agent architectures. Should you use chain-of-thought prompting or ReAct? Should the retrieval step come before or after classification? Without a scored eval suite, these decisions are made on intuition. With one, they're made on data.

How to Implement

Start with a golden dataset: 50-200 input-output pairs that represent your agent's core behaviors. Each entry should include the input, the expected output (or acceptable output range), and the behavioral assertion. Not "output must equal X", but "output must contain a denial reason" or "output must route to the escalation queue." Store these as versioned fixtures alongside your agent code.

Build your eval runner as a CI-integrated script that executes every golden-dataset case against the current agent configuration and scores the results. Use assertion functions that check semantic properties: does the response mention the required regulatory clause? Did the agent call the correct tool? Did it refuse an out-of-scope request? Score each case as pass/fail and track the pass rate over time. Set a threshold. For example, the agent must pass 95% of golden cases to ship.

For model upgrades specifically, run the eval suite against both the old and new model in parallel. Diff the results. Any case that passed on the old model and fails on the new one is a regression candidate that needs human review before the upgrade ships.

evals/eval_runner.py

import json
from dataclasses import dataclass
from agent import run_agent  # your agent entrypoint

@dataclass
class EvalCase:
    id: str
    input: str
    assertions: list[dict]  # {"type": "contains"|"tool_called"|"classified_as", "value": str}

@dataclass
class EvalResult:
    case_id: str
    passed: bool
    failures: list[str]

def run_eval_suite(suite_path: str, model: str) -> dict:
    with open(suite_path) as f:
        cases = [EvalCase(**c) for c in json.load(f)]

    results = []
    for case in cases:
        output = run_agent(case.input, model=model)
        failures = []
        for assertion in case.assertions:
            if assertion["type"] == "contains" and assertion["value"] not in output.text:
                failures.append(f"Missing: {assertion['value']}")
            if assertion["type"] == "tool_called" and assertion["value"] not in output.tools_used:
                failures.append(f"Tool not called: {assertion['value']}")
            if assertion["type"] == "classified_as" and output.classification != assertion["value"]:
                failures.append(f"Expected class '{assertion['value']}', got '{output.classification}'")
        results.append(EvalResult(case.id, len(failures) == 0, failures))

    passed = sum(1 for r in results if r.passed)
    return {"total": len(results), "passed": passed, "rate": passed / len(results), "results": results}

Production Checklist

✓Build a golden dataset of at least 50 input-output pairs covering core agent behaviors, edge cases, and refusal scenarios
✓Implement assertion-based checks (contains, tool_called, classified_as, refused) rather than exact-match comparisons
✓Run the eval suite in CI on every prompt-template or agent-config change, with a pass-rate threshold that blocks deployment
✓Before any model upgrade, run the full suite against both old and new models and review all regressions
✓Track eval pass rates as a time-series metric. Declining scores indicate drift even without explicit changes
✓Include adversarial cases in the golden dataset: prompt injections, out-of-scope requests, and attempts to bypass guardrails
✓Version your golden dataset alongside your agent code. Eval cases are test fixtures, not throwaway scripts
✓Set up a nightly eval run against production prompts to catch provider-side model changes that happen without notice
✓Add a human-review workflow for cases where the new model produces different but potentially valid outputs

Common Pitfalls

The biggest mistake is writing evals that are too brittle. If your assertion is "output must exactly equal this 200-word paragraph," every model change will break it. Evals should test behavioral properties: did the agent take the right action, refuse the wrong request, include the required information? Use semantic checks, not string equality.

Another pitfall is building a golden dataset once and never updating it. As your agent's scope expands, your eval suite must expand with it. Every production bug should become an eval case. Every new feature should ship with at least 5 new golden-dataset entries. Treat your eval suite the way you treat your test suite. It's a living artifact.

Finally, teams often skip adversarial eval cases because they feel like security's problem. They're not. An agent that passes all happy-path evals but folds under a basic prompt injection is not a passing agent. Include at least 10-15% adversarial cases in every eval suite: jailbreak attempts, scope violations, and data-exfiltration prompts.

Terminal Output

terminal

$ clawproof --check 07

  CHECK 07 — Evaluation & Regression Testing
  ─────────────────────────────────────────────
  ✓ Golden dataset found: 127 cases in evals/golden/
  ✓ Assertion types: contains (43), tool_called (31), classified_as (28), refused (25)
  ✓ CI integration detected: eval suite runs on PR merge
  ✗ FAIL: No model-comparison workflow configured for upgrades
  ✓ Adversarial cases present: 19/127 (15%)
  ✗ FAIL: Last eval run was 23 days ago — no nightly schedule detected
  ✓ Pass-rate threshold set: 95% (current: 97.6%)

  Result: 2 issues found — configure upgrade diffing and nightly runs
  Severity: MEDIUM — behavioral regressions will go undetected

$ clawproof --related

Referenced In

playbookSecuring LangChain Agent Pipelines

Previous← #06 Secrets Management Next#08 Data Boundaries & RAG Governance →