📊 Evaluate

Success metrics

What to actually measure.

What to actually measure

Good success metrics are the metrics you would defend to a regulator, a board, and the user. They are rarely a single number. For production agents, expect to track a small cluster covering quality, safety, cost, and usefulness.

The core cluster

Metric	What it measures	Why it matters
Task completion rate	% of runs where the agent produced the intended outcome.	The headline quality number. Defined per use case.
Tool-call accuracy	% of tool calls with the expected tool, arguments, and ordering.	Catches right-answer-wrong-route.
Groundedness	% of factual claims traceable to a cited source.	Catches hallucination. Critical for regulated domains.
Cost per run	Total AI Gateway spend per successful task.	Economic viability and the knee of your model-routing curve.
Latency (p50, p95)	Wall-clock time, including tool calls.	User experience and whether the agent is usable in a real workflow.
User satisfaction	Explicit thumbs / CSAT + escalation rates.	The ground truth nothing else captures.
Refusal calibration	% of out-of-scope requests safely refused; % of in-scope refused in error.	Governance signal — is the envelope holding?

Per-task metrics

Beyond the core cluster, each task needs its own success signal. For a support-triage agent, this might be correct-tier classification and resolution without escalation. For a research agent, citation completeness and fact-check pass rate.

Pick two or three per-task metrics that matter to the product. Track them per deploy, and wire them into the monitoring surface.

Leading vs lagging indicators

Leading — per-run metrics visible immediately: tool-call accuracy, groundedness, latency, cost.
Lagging — product-level metrics visible over weeks: CSAT, escalation rates, deflection rates.

Leading indicators are where you iterate. Lagging indicators are where you justify the agent to the business. A healthy eval culture tracks both.

What not to measure (alone)

Never ship an agent with only a task-completion number. It hides tool misuse, quiet hallucination, and refusal miscalibration. Any of those will bite you in production, and the blast radius depends on how long you took to notice.

Why evaluation matters

Test sets and golden datasets