📊 Evaluate
Evaluate
Evals, test sets, LLM-as-judge, red-teaming, monitoring.
Lesson 1
Why evaluation matters
From model benchmarks to agent-level simulation.
Open
Lesson 2
Success metrics
What to actually measure.
Open
Lesson 3
Test sets and golden datasets
Black-box, glass-box, and white-box evaluation.
Open
Lesson 4
LLM-as-judge
Rubrics, calibration, and the bias traps.
Open
Lesson 5
Red-teaming
Prompt injection, tool misuse, OWASP LLM Top 10.
Open
Lesson 6
Human-in-the-loop review
Sampling, inter-rater reliability, avoiding approval fatigue.
Open
Lesson 7
Post-deployment monitoring
Drift, regressions, and circuit breakers.
Open
Lesson 8
The practitioner toolkit
Braintrust, Langfuse, Arize, Weave, Inspect AI, more.
Open
Lesson 9
Five tests every agent must pass
Happy, Edge, Adversarial, Ambiguous, Handoff.
Open
📊 EvaluateQuick quiz
0/5
Q1.What does trajectory (glass-box) eval catch that black-box eval misses?
Q2.A classic LLM-as-judge pitfall is…
Q3.What's the golden-vs-silver distinction?
Q4.Which is the strongest mitigation for prompt injection?
Q5.Approval fatigue is best addressed by…