📊 Evaluate

Human-in-the-loop review

Sampling, inter-rater reliability, avoiding approval fatigue.

Human review, at scale, without the rubber-stamping trap

LLM-judges scale judgement, but they do not substitute for it. Every production agent needs a human-review loop. The challenge is keeping review meaningful as volume rises.

The pathology to avoid: approval fatigue. When humans have to approve thousands of indistinguishable agent actions, they stop reading carefully. Only about 17% of reviewers review meaningfully after 15+ back-to-back decisions. Your oversight becomes a legal fiction.

Sampling strategies

You cannot review every action. Pick a sampling strategy that fits the risk profile.

Random sampling — easy baseline. Good for coverage; bad for rare high-risk cases.
Risk-tiered sampling — sample high-risk actions at 100%, medium at 10%, low at 1%. This is the default for regulated industries.
Disagreement-triggered — when the LLM-judge and a cheap rule-based check disagree, sample at 100%. Catches edge cases cheaply.
Uncertainty-triggered — when the judge reports low confidence, sample at 100%.
Anomaly-triggered — when the action pattern is statistically unusual (new tool combination, new user cohort), sample at 100%.

Most production setups combine two or three of these.

Inter-rater reliability

When multiple humans review the same cases, you want them to agree with each other. Disagreement signals an ambiguous rubric.

Track inter-rater agreement (Cohen's κ or similar) on a rotating sample.
When agreement falls below a threshold, pause and re-clarify the rubric.
Rotate reviewers so a case gets two independent eyes about 5–10% of the time.

Avoiding approval fatigue

Risk-tier the approvals. Humans approve only the decisions that need judgement, not every mechanical action.
Batch and dashboard low-risk actions. A reviewer sees "the agent took these 500 low-risk actions this morning, any flags?" rather than approving each one.
Rotate reviewers. Same reviewer on the same queue for weeks → fatigue. Mix the queue or mix the reviewer.
Monitor the monitors. Track per-reviewer approval rates. Sudden 99% approval is a signal to investigate, not a signal of quality.

The data flywheel

Human-review disagreements, corrections, and escalations are the richest signal you have. Pipe them back into:

Your golden dataset (promote corrected cases).
Your system prompt (are we consistently catching the same failure class?).
Your tool design (did a wrong tool fire? Should it have required approval?).

Teams that do this compound quality month over month. Teams that do not stall after the first deployment.

Red-teaming

Post-deployment monitoring