📊 Evaluate

Test sets and golden datasets

Black-box, glass-box, and white-box evaluation.

Three modes of evaluation, all necessary

Three modes of agent evaluation. Production teams use all three; each catches a different class of failure.

Black-box

Final response eval

Judge only the final answer. Simple, cheap, but blind to bad process.

Did it resolve the ticket? (yes/no)

Glass-box

Trajectory eval

Grade the sequence of tool calls and decisions. Catches right answer / wrong route.

Did it call retrieve-policy before drafting?

White-box

Single-step eval

Grade each decision in isolation. Best for regression testing individual steps.

Given state X, did it pick tool Y?

Black-box (final-response) eval scores only the end result. Cheap, simple, and blind to bad process.
Glass-box (trajectory) eval scores the sequence of tool calls and decisions. Catches the right-answer-wrong-route failure mode.
White-box (single-step) eval scores each decision in isolation. Best for regression testing and for isolating which step regressed when quality drops.

Production teams run all three. They cost different amounts to maintain and tell you different things.

Your test set is the most valuable thing in your eval infrastructure. Two tiers:

Gold — hand-curated, human-approved, representative of your most common and most critical interactions. Each case includes the user input, expected facts, expected tool trajectory, and expected parameters. Grows slowly; you protect it.
Silver — synthetic or production-sampled. Much larger, much cheaper to grow, and much less trusted. You promote silver → gold through human review.

A typical pattern: silver is where you catch regressions; gold is where you publish pass rates.

Pick 20–50 cases that span: the happy path, three realistic edge cases, two adversarial inputs, two ambiguous requests, and two handoff cases.
For each, write the user input and the expected outcome in plain English.
For a handful, also specify the expected tool trajectory (the "should have called lookupAccount before draftReply" check).
Review with a domain expert before you wire it into CI.

Two legitimate ways to grow your test set with synthetic data:

Paraphrase expansion — take a gold case, ask a model to generate N paraphrasings of the user input, verify that the expected outcome still applies, promote to gold.
Failure-mode generation — take the errors the agent has made, and ask a model to generate variations that would trigger similar errors. Great for red-teaming; requires careful human review.