Scenarios Canvases Playgrounds My Creations Glossary Tutor

📊 Evaluate

Evaluate

Evals, test sets, LLM-as-judge, red-teaming, monitoring.

Why evaluation matters

From model benchmarks to agent-level simulation.

Success metrics

What to actually measure.

Test sets and golden datasets

Black-box, glass-box, and white-box evaluation.

Rubrics, calibration, and the bias traps.

Prompt injection, tool misuse, OWASP LLM Top 10.

Human-in-the-loop review

Sampling, inter-rater reliability, avoiding approval fatigue.

Post-deployment monitoring

Drift, regressions, and circuit breakers.

The practitioner toolkit

Braintrust, Langfuse, Arize, Weave, Inspect AI, more.

Five tests every agent must pass

Happy, Edge, Adversarial, Ambiguous, Handoff.

📊 EvaluateQuick quiz

0/5

Q1.What does trajectory (glass-box) eval catch that black-box eval misses?

Q2.A classic LLM-as-judge pitfall is…

Q3.What's the golden-vs-silver distinction?

Q4.Which is the strongest mitigation for prompt injection?

Q5.Approval fatigue is best addressed by…