📊 Evaluate

Post-deployment monitoring

Drift, regressions, and circuit breakers.

Evaluation does not stop at launch

Pre-launch evaluation establishes that the agent can behave correctly. Post-launch monitoring establishes that it still is. The two problems are different, and the second is harder.

Things change. User behaviour drifts. The upstream corpus drifts. The model updates. The agent's own long-term memory drifts. Each of these can quietly move the agent out of the envelope you shipped with.

What to monitor

Five families of signal, all live in production:

Input drift

The distribution of user inputs changes over time. New topics, new languages, new query shapes. Measure: a sliding window of input embeddings vs the training baseline. Alert when the divergence crosses a threshold.

Output drift

The distribution of agent outputs changes. Longer replies, more refusals, new phrases. Measure: key output statistics (length, tool-call frequency, refusal rate) per day. Alert on step changes.

Tool-call regressions

Tool usage patterns drift. A tool's success rate drops; a tool gets called less often; a new tool combination appears. Measure: tool-call mix and success rate per day. Alert when either moves.

Quality regressions

A silver subset of production traffic, graded each day by the same LLM-judge rubric. Plot the aggregate score over time. Alert when it drops.

Cost and latency

Per-run p50/p95 cost and latency. Cost creep is usually a silent sign the agent is falling back to longer reasoning or retrying more often.

Circuit breakers

When something goes wrong, you want a hard stop, not a debate. Circuit breakers are automatic rules that halt the agent when:

Per-user spend exceeds a cap.
Per-session tool calls exceed a cap.
Error rate on a specific tool crosses a threshold.
A canary query starts failing (a fixed input whose correct answer never changes).

Kill-switches must live outside the agent's runtime. If the agent can modify its own kill switch, it does not have one.

Drift detection patterns

Canaries — a small set of pinned inputs with known-good outputs. Run them every hour. The cheapest smoke test you will build.
Golden regressions — run the full golden dataset once per deploy, and once per day against the current model snapshot. Flag any delta.
Shadow mode — run a new system prompt, model, or tool in parallel with the current one, without exposing the new output. Compare distributions for a week before switching.

The link back to REMIT-M

Everything on this page is the Monitoring pillar (M) of REMIT made concrete. If you cannot reconstruct what the agent did and why within 24 hours, you do not have governance — you have hope.

Sources and further reading

See the toolkit page for the observability platforms that implement these patterns in production.
The REMIT Monitoring lesson covers the governance frame these signals serve.

Human-in-the-loop review

The practitioner toolkit