📊 Evaluate
The practitioner toolkit
Braintrust, Langfuse, Arize, Weave, Inspect AI, more.
The practitioner toolkit
You do not need to build evaluation infrastructure from scratch. The ecosystem is mature enough that most teams should pick a platform for observability and evals, pick an OSS framework for structured evaluation scripts, and add a few internal scripts for the bits that are domain-specific.
Hosted platforms (evals + observability)
| Platform | Where it fits |
|---|---|
| Braintrust | Strong evals-first platform. Opinionated, fast, good TypeScript SDK. Popular with AI-native teams. |
| Langfuse | Open-source, self-hostable observability + evals. Strong trace visualisation. |
| Arize / Phoenix | OSS Phoenix for tracing; Arize for the hosted offering. Strong on drift monitoring. |
| Weights & Biases Weave | Integrates naturally if you already use W&B for ML experiments. |
Pick one and commit. Running two in parallel tends to mean you use neither well.
Open-source eval frameworks
| Framework | Use when |
|---|---|
| UK AISI Inspect | You need rigorous safety evaluations — the framework behind the UK AI Safety Institute's own evaluations. |
| OpenAI Evals | You want a lightweight declarative eval harness. |
| LangChain agentevals | You specifically want trajectory evaluators for multi-step agents. |
| Confident AI DeepEval | You want batteries-included metrics with pytest-style ergonomics. |
On the Vercel platform
- Vercel Agent — AI-powered code review and production investigation. Useful for PR review and for triaging incidents in your own product after launch.
- AI Gateway — unified observability across providers. Traces, latencies, and costs in one place; swap models without code changes.
What "good" looks like
- Every agent run traced end-to-end, with tool calls, reasoning, and costs attached.
- A golden dataset in the platform, re-run on every deploy.
- A weekly review ritual where a human reads a sampled trace and writes a one-line note.
- Canaries running every hour; alerts wired to a Slack channel the on-call watches.
None of this is glamorous. It is the unglamorous work that lets teams ship agents and keep them shipping.