📊 Evaluate

The practitioner toolkit

Braintrust, Langfuse, Arize, Weave, Inspect AI, more.

The practitioner toolkit

You do not need to build evaluation infrastructure from scratch. The ecosystem is mature enough that most teams should pick a platform for observability and evals, pick an OSS framework for structured evaluation scripts, and add a few internal scripts for the bits that are domain-specific.

Hosted platforms (evals + observability)

Platform	Where it fits
Braintrust	Strong evals-first platform. Opinionated, fast, good TypeScript SDK. Popular with AI-native teams.
Langfuse	Open-source, self-hostable observability + evals. Strong trace visualisation.
Arize / Phoenix	OSS Phoenix for tracing; Arize for the hosted offering. Strong on drift monitoring.
Weights & Biases Weave	Integrates naturally if you already use W&B for ML experiments.

Pick one and commit. Running two in parallel tends to mean you use neither well.

Open-source eval frameworks

Framework	Use when
UK AISI Inspect	You need rigorous safety evaluations — the framework behind the UK AI Safety Institute's own evaluations.
OpenAI Evals	You want a lightweight declarative eval harness.
LangChain agentevals	You specifically want trajectory evaluators for multi-step agents.
Confident AI DeepEval	You want batteries-included metrics with pytest-style ergonomics.

On the Vercel platform

Vercel Agent — AI-powered code review and production investigation. Useful for PR review and for triaging incidents in your own product after launch.
AI Gateway — unified observability across providers. Traces, latencies, and costs in one place; swap models without code changes.

What "good" looks like

Every agent run traced end-to-end, with tool calls, reasoning, and costs attached.
A golden dataset in the platform, re-run on every deploy.
A weekly review ritual where a human reads a sampled trace and writes a one-line note.
Canaries running every hour; alerts wired to a Slack channel the on-call watches.

None of this is glamorous. It is the unglamorous work that lets teams ship agents and keep them shipping.

Post-deployment monitoring

Five tests every agent must pass