Human-in-the-loop review
Sampling, inter-rater reliability, avoiding approval fatigue.
Human review, at scale, without the rubber-stamping trap
LLM-judges scale judgement, but they do not substitute for it. Every production agent needs a human-review loop. The challenge is keeping review meaningful as volume rises.
The pathology to avoid: approval fatigue. When humans have to approve thousands of indistinguishable agent actions, they stop reading carefully. Only about 17% of reviewers review meaningfully after 15+ back-to-back decisions. Your oversight becomes a legal fiction.
Sampling strategies
You cannot review every action. Pick a sampling strategy that fits the risk profile.
- Random sampling — easy baseline. Good for coverage; bad for rare high-risk cases.
- Risk-tiered sampling — sample high-risk actions at 100%, medium at 10%, low at 1%. This is the default for regulated industries.
- Disagreement-triggered — when the LLM-judge and a cheap rule-based check disagree, sample at 100%. Catches edge cases cheaply.
- Uncertainty-triggered — when the judge reports low confidence, sample at 100%.
- Anomaly-triggered — when the action pattern is statistically unusual (new tool combination, new user cohort), sample at 100%.
Most production setups combine two or three of these.
Inter-rater reliability
When multiple humans review the same cases, you want them to agree with each other. Disagreement signals an ambiguous rubric.
- Track inter-rater agreement (Cohen's κ or similar) on a rotating sample.
- When agreement falls below a threshold, pause and re-clarify the rubric.
- Rotate reviewers so a case gets two independent eyes about 5–10% of the time.
Avoiding approval fatigue
- Risk-tier the approvals. Humans approve only the decisions that need judgement, not every mechanical action.
- Batch and dashboard low-risk actions. A reviewer sees "the agent took these 500 low-risk actions this morning, any flags?" rather than approving each one.
- Rotate reviewers. Same reviewer on the same queue for weeks → fatigue. Mix the queue or mix the reviewer.
- Monitor the monitors. Track per-reviewer approval rates. Sudden 99% approval is a signal to investigate, not a signal of quality.
The data flywheel
Human-review disagreements, corrections, and escalations are the richest signal you have. Pipe them back into:
- Your golden dataset (promote corrected cases).
- Your system prompt (are we consistently catching the same failure class?).
- Your tool design (did a wrong tool fire? Should it have required approval?).
Teams that do this compound quality month over month. Teams that do not stall after the first deployment.