📊 Evaluate

Why evaluation matters

From model benchmarks to agent-level simulation.

Why evaluation is now a first-class pillar

Through 2024, most AI evaluation work happened at the model level — benchmarks like MMLU, GSM8K, HumanEval. Through 2025, the industry shifted decisively to agent-level and product-level evaluation. The reason: benchmarks predict nothing about whether your agent correctly triages a Tier-1 ticket, cites the right source, or refuses the right jailbreak.

Evaluation is where design meets reality. It is also where governance gets its evidence.

The eval loop. A flywheel, not a launch gate.

What makes agent evaluation hard

Three things distinguish agent eval from model eval:

Outputs are trajectories, not just answers. An agent can reach the right answer via the wrong route — calling a high-risk tool it should have refused, or leaking data it should have masked. End-state evaluation alone cannot catch this.
Ground truth is subjective for many tasks. Was the customer reply good? Was the research summary faithful? Reasonable humans disagree. You need rubrics, calibration, and often LLM-as-judge to scale.
Production traffic keeps changing. An eval suite that passes today can drift into irrelevance as the user population, the product, or the world changes. Evaluation is a flywheel, not a launch checklist.

What this module covers

Success metrics — what to measure.
Test sets — how to build a golden dataset.
LLM-as-judge — rubrics, bias, calibration.
Red-teaming — adversarial testing, OWASP LLM Top 10.
HITL review — humans in the review loop, at scale.
Monitoring — post-deployment drift and regressions.
The toolkit — what the working practitioner uses.
Five tests — the minimum every agent must pass.

Orchestration patterns

Success metrics