🔨 Build

Choosing the right model

Opus, Sonnet, Haiku — and when to blend.

The right model is the one that matches the task

There is no single "best" model in 2026. The frontier is crowded, prices vary by three orders of magnitude, and the answer depends on what you actually care about: coding quality, pure reasoning, cost per call, latency, context window, or deployability.

This lesson covers the three decisions you will make:

Which provider / family — Anthropic, OpenAI, Google, xAI, Meta, DeepSeek, Alibaba...
Which tier within that family — flagship, mid, small. Sometimes you will get a specfic paramater counts (e.g. 70b, 32b, 7b)
Cloud-only or self-hosted open-weight — do you call a hosted model, or run the weights yourself?

Cloud-only vs open-weight — the big fork

The biggest architectural choice is not "which model" but "can I self-host it". Two worlds:

Dimension	Cloud-only (closed-weight)	Open-weight
Examples	Claude Opus/Sonnet/Haiku, GPT-5.4 family, Gemini 3.x, Grok 4	Llama 4, DeepSeek V3.2, Qwen 3, GPT-OSS, Gemma
Where it runs	The provider's infra only	Any infra that can host it — cloud, VPC, your datacentre, maybe even your laptop or phone
Pricing model	Per-token, provider-set	Per-token if hosted via an API like BedRock or OpenRouter; otherwise you pay for GPUs and power
Data residency	Constrained to provider regions	Total control over where data goes
Fine-tuning	Limited (API-exposed tuning only)	Full — you own the weights
Quality ceiling	Usually the higher ceiling	Catching up fast; top open models now match closed mid-tier
Support	Vendor SLAs, safety review, security work	You own operations, scaling, and safety work
Licence	Proprietary	Varies — Apache 2.0, MIT, community licences with caveats

The honest 2026 picture: closed models still hold the ceiling on hard reasoning and agentic coding, but the gap has narrowed. For many production workloads — especially ones where data residency or cost matter more than peak capability — open-weight models like DeepSeek V3.2 or Qwen3-Max are legitimate primary options.

The Claude family (this course's default)

The previous lesson version focused on the Claude tiers. They are still the simplest way to think about the hosted-model ladder, and they apply to every other family too.

Opus — maximum intelligence. Use for nuanced judgement, complex reasoning, high-stakes decisions, novel problem spaces.
Sonnet — balanced workhorse. The default for most agent work — strong reasoning at a reasonable price.
Haiku — speed and efficiency. High-volume, low-complexity, cost-sensitive routing.

Every frontier family has the same three-tier structure: a top intelligence tier, a balanced daily tier, and a fast/cheap tier. OpenAI's is GPT-5.4 Pro / 5.4 / 5.4-mini; Google's is Gemini 3.1 Pro / 3 Pro / 3 Flash. Pick the tier, not just the family.

The benchmarks — and the use cases behind them

Before you look at any comparison chart, know what the benchmarks actually measure. Each one is a proxy for a different product use case, and a model's ranking can shuffle dramatically between them. Pick the benchmark that matches the work you will ship.

Benchmark	Use case	What a high score signals
SWE-Bench Verified	Agentic coding, autonomous PR agents	Can resolve real-world GitHub issues end-to-end. The single best signal for coding agents.
GPQA Diamond	Research agents, technical Q&A	Graduate-level science reasoning on adversarial questions. Measures the ceiling, not the mean.
AIME 2025	Symbolic math, optimisation, provable reasoning	Math-olympiad-level problem solving.
MMLU Pro	Knowledge-heavy workflows, generalist assistants	Broad 14-subject knowledge + reasoning.
Aider Polyglot	Code editing inside an IDE or CI	Precision edits across six languages. Best for refactor / fix workflows.

The interactive landscape

The chart below plots cost per typical agent run (log scale) against benchmark performance. Switch the benchmark to change which use case you're optimising for — the bold tag next to each option is the one-word use case. Toggle closed vs open-weight to compare head-to-head. Hover a point for details.

Benchmark — the agentic coding use case

Family

Tier

500 hand-verified real-world GitHub issues. The headline benchmark for agentic software-engineering work. Higher is better.

Cost per run vs SWE-Bench Verified score

Cost per run assumes ~2k input + 800 output tokens — a typical agent turn. Log-scale X-axis: the price range spans three orders of magnitude. Open-weight models have an outlined border.

Comparative grid

Model	Provider	Openness	Context	$/1M in	$/1M out	SWE-Bench Verified	Notes
Claude Opus 4.7	Anthropic	Closed	1000K	$5.00	$25.00	87.6%	Top agentic-coding scores in 2026. The go-to for hard reasoning or final-stage judgement in a blend.
GPT-5.4 Pro	OpenAI	Closed	1050K	$30.00	$180.00	82%	Max-intelligence tier. Perfect AIME, top-tier GPQA — but the price reflects it.
Gemini 3.1 Pro	Google	Closed	1000K	$2.00	$12.00	80.6%	Top of MMLU Pro. Very strong multimodal. Huge context window with good recall.
GPT-5 Codex	OpenAI	Closed	400K	$1.25	$10.00	78%	Code-specialised GPT-5 branch. Top of the Aider leaderboard.
DeepSeek V3.2 (Thinking)	DeepSeek	MIT	128K	$0.28	$0.42	78%	Same weights, reasoning mode. Cheap reasoning model — often the price-performance winner.
Claude Sonnet 4.6	Anthropic	Closed	1000K	$3.00	$15.00	77%	Balanced daily-driver. The default for most production agent workloads.
Qwen3-Coder Plus	Alibaba	Apache 2.0	1000K	$1.00	$5.00	75%	Code-specialised Qwen. 1M context. Strong SWE-Bench for an open model.
GPT-5.4	OpenAI	Closed	1050K	$2.50	$15.00	74.9%	OpenAI's daily flagship. Strong coding + reasoning at a Sonnet-comparable price.
Qwen3-Max	Alibaba	Apache 2.0	262K	$1.20	$6.00	72%	Alibaba's flagship Qwen. Strong multilingual and agent performance.
Grok 4.20 Reasoning	xAI	Closed	2000K	$2.00	$6.00	70%	xAI's reasoning tier. 2M context. Competitive on tool-heavy agent work.
DeepSeek V3.2	DeepSeek	MIT	128K	$0.28	$0.42	68.8%	Open-weight MoE. Remarkable price-performance — on many coding tasks it rivals closed models.
OpenAI o3	OpenAI	Closed	200K	$2.00	$8.00	62%	Reasoning-tuned series. Slower (thinks out loud) but strong on hard problems.
GPT-5.4 mini	OpenAI	Closed	400K	$0.75	$4.50	60%	Mid-tier. Closer to Sonnet on many tasks at a much lower price.
Gemini 3 Flash	Google	Closed	1000K	$0.50	$3.00	58%	Google's speed tier. Sonnet-adjacent quality at a Haiku-adjacent price — compelling for throughput work.
Claude Haiku 4.5	Anthropic	Closed	200K	$1.00	$5.00	52%	Fast + cheap. Great for classification, routing, cheap LLM-judge graders.
GPT-OSS 120B	OpenAI	Apache 2.0	131K	$0.35	$0.75	52%	OpenAI's open-weight release. Can be self-hosted. Quality below GPT-5-mini but very cheap to run.
Llama 4 Maverick (402B MoE)	Meta	Llama 4 Community	128K	$0.24	$0.97	48%	Flagship Llama 4. Self-hostable at serious infra cost, or hosted cheaply on the Gateway.
Llama 4 Scout (109B MoE)	Meta	Llama 4 Community	128K	$0.17	$0.66	38%	Smaller Llama 4 MoE. Cheap and commonly self-hosted for low-sensitivity bulk work.

For a wider, side-by-side view of the chart and the full comparative grid, open the Model chooser playground.

How to read the chart

The top-left corner is the Pareto frontier: high score at low cost. Look here first.
Points on the right but not much higher than the Pareto frontier are paying a premium — usually worth it only when the task is at the edge of what the cheaper model can do.
Moving the benchmark switcher reshuffles the chart. A model that leads on MMLU Pro might trail on SWE-Bench. There is no single "best". Pick the benchmark that matches your use case.
Open-weight models (outlined) tend to cluster in the lower-left — cheap, with competitive but not leading scores.

Blending: the real production pattern

No team ships with a single model behind an agent. The mature production pattern is a blend:

Haiku / GPT-5.4-nano / Gemini 3 Flash / Llama 4 Scout — classify the incoming request.
Sonnet / GPT-5.4 / Gemini 3 Pro / DeepSeek V3.2 Thinking — handle the main reasoning.
Opus / GPT-5.4 Pro — reserved for the hardest calls.

Play with the Cost / Quality simulator to see how shifting tier within a blend moves your cost curve. You will usually find the knee well below the top tier.

A decision checklist

Before you ship, answer these five:

Does your data need to stay on specific infrastructure? → Open-weight or a closed model in a sovereign region.
What benchmark most closely matches your actual task? → That is your north-star score.
What is the cheapest tier that reliably passes your evals on that benchmark? → Your default model.
Which tier do you need for the hardest 5% of cases? → Your escalation model.
How do you switch between models as a new generation drops? → Abstract the provider behind a gateway or a thin internal interface. You will re-evaluate at least quarterly.

Sources

Live model catalogue and pricing — Vercel AI Gateway models API.
Vellum LLM Leaderboard — Apr 2026 aggregate scores.
Artificial Analysis evaluations — independent benchmarks.
Vals AI MMLU Pro board.
MindStudio GPT-5.4 / Opus 4.6 / Gemini 3.1 comparison.
Spheron — DeepSeek V3.2 vs Llama 4 vs Qwen 3 for production 2026.

The capability equation

Context windows