📊 Evaluate
Test sets and golden datasets
Black-box, glass-box, and white-box evaluation.
Three modes of evaluation, all necessary
Black-box
Final response eval
Judge only the final answer. Simple, cheap, but blind to bad process.
Did it resolve the ticket? (yes/no)
Glass-box
Trajectory eval
Grade the sequence of tool calls and decisions. Catches right answer / wrong route.
Did it call retrieve-policy before drafting?
White-box
Single-step eval
Grade each decision in isolation. Best for regression testing individual steps.
Given state X, did it pick tool Y?
- Black-box (final-response) eval scores only the end result. Cheap, simple, and blind to bad process.
- Glass-box (trajectory) eval scores the sequence of tool calls and decisions. Catches the right-answer-wrong-route failure mode.
- White-box (single-step) eval scores each decision in isolation. Best for regression testing and for isolating which step regressed when quality drops.
Production teams run all three. They cost different amounts to maintain and tell you different things.
Golden datasets, silver datasets
Your test set is the most valuable thing in your eval infrastructure. Two tiers:
- Gold — hand-curated, human-approved, representative of your most common and most critical interactions. Each case includes the user input, expected facts, expected tool trajectory, and expected parameters. Grows slowly; you protect it.
- Silver — synthetic or production-sampled. Much larger, much cheaper to grow, and much less trusted. You promote silver → gold through human review.
A typical pattern: silver is where you catch regressions; gold is where you publish pass rates.
How to build the initial golden dataset
- Pick 20–50 cases that span: the happy path, three realistic edge cases, two adversarial inputs, two ambiguous requests, and two handoff cases.
- For each, write the user input and the expected outcome in plain English.
- For a handful, also specify the expected tool trajectory (the "should have called
lookupAccountbeforedraftReply" check). - Review with a domain expert before you wire it into CI.
Synthetic test generation
Two legitimate ways to grow your test set with synthetic data:
- Paraphrase expansion — take a gold case, ask a model to generate N paraphrasings of the user input, verify that the expected outcome still applies, promote to gold.
- Failure-mode generation — take the errors the agent has made, and ask a model to generate variations that would trigger similar errors. Great for red-teaming; requires careful human review.
Sources and further reading
- Anthropic — Demystifying evals for AI agents
- Maxim — The Evolution of AI Quality: From Benchmarks to Simulation
- Confident AI — LLM Agent Evaluation Guide
- LangChain —
agentevals