📊 Evaluate

LLM-as-judge

Rubrics, calibration, and the bias traps.

Scaling judgement

For most agent outputs, deterministic rules (exact match, regex, JSON schema validation) do not capture what "good" looks like. You need judgement at scale — and the most practical way to get it is LLM-as-judge.

An LLM-as-judge grader takes: the input, the agent's output, optional ground truth or a rubric, and produces a score and reasoning. Calibrated well, it reaches 80–90% agreement with human experts on subjective criteria, at a fraction of the cost.

Rubric design

A good rubric is:

Specific. Not "Is this a good answer?" but "Does the answer (a) cite a source, (b) answer the user's question, (c) stay under 200 words?"
Comparable. Each criterion is scored on the same scale — typically 1–5 or pass/fail.
Calibrated. Before using it at scale, grade 20 cases by hand, then have the LLM grade the same 20. Agreement should be above 80%; if not, the rubric is ambiguous.

You are grading a customer-support reply against these four criteria.
For each, score 1 (fail) to 5 (excellent), and briefly justify.

1. Accuracy — are the facts correct?
2. Helpfulness — does it address the user's actual question?
3. Tone — is it professional, empathetic, on-brand?
4. Scope — did it avoid promising things outside the agent's authority?

Return JSON: { "scores": {...}, "reasoning": "..." }

Pairwise vs absolute scoring

Two patterns:

Absolute — "grade this output against this rubric". Simple, but prone to grade inflation and inter-rater drift.
Pairwise — "which of these two outputs better satisfies the rubric?". More reliable for subjective criteria; doubles the cost because each case generates two outputs.

For launch decisions, pairwise is worth the extra spend. For day-to-day CI, absolute is usually fine.

The bias traps

LLM-judges are known to have systematic biases. You should test for these explicitly before trusting a judge:

Position bias — the judge prefers the first (or last) option in a pairwise comparison. Mitigate: randomise order; run each pair twice with swapped positions.
Length bias — the judge prefers longer answers, even when shorter is better. Mitigate: explicit rubric criterion for conciseness; length-controlled pairs.
Verbosity bias — the judge prefers outputs that sound confident. Mitigate: ask the judge to reason before scoring; calibrate against humans.
Self-preference — a judge tends to prefer outputs from the same model family. Mitigate: use a different model provider as judge than as agent.

What a good LLM-judge pipeline looks like

Run the agent on the test set.
Run the judge on each output, with the rubric.
Sample 5–10% for human review each week.
Re-calibrate the rubric when judge-vs-human agreement drops.

Sources

Evidently AI — LLM-as-a-judge: complete guide
Confident AI — LLM Agent Evaluation

Test sets and golden datasets

Red-teaming