Choosing the right model
Opus, Sonnet, Haiku — and when to blend.
The right model is the one that matches the task
There is no single "best" model in 2026. The frontier is crowded, prices vary by three orders of magnitude, and the answer depends on what you actually care about: coding quality, pure reasoning, cost per call, latency, context window, or deployability.
This lesson covers the three decisions you will make:
- Which provider / family — Anthropic, OpenAI, Google, xAI, Meta, DeepSeek, Alibaba...
- Which tier within that family — flagship, mid, small. Sometimes you will get a specfic paramater counts (e.g. 70b, 32b, 7b)
- Cloud-only or self-hosted open-weight — do you call a hosted model, or run the weights yourself?
Cloud-only vs open-weight — the big fork
The biggest architectural choice is not "which model" but "can I self-host it". Two worlds:
| Dimension | Cloud-only (closed-weight) | Open-weight |
|---|---|---|
| Examples | Claude Opus/Sonnet/Haiku, GPT-5.4 family, Gemini 3.x, Grok 4 | Llama 4, DeepSeek V3.2, Qwen 3, GPT-OSS, Gemma |
| Where it runs | The provider's infra only | Any infra that can host it — cloud, VPC, your datacentre, maybe even your laptop or phone |
| Pricing model | Per-token, provider-set | Per-token if hosted via an API like BedRock or OpenRouter; otherwise you pay for GPUs and power |
| Data residency | Constrained to provider regions | Total control over where data goes |
| Fine-tuning | Limited (API-exposed tuning only) | Full — you own the weights |
| Quality ceiling | Usually the higher ceiling | Catching up fast; top open models now match closed mid-tier |
| Support | Vendor SLAs, safety review, security work | You own operations, scaling, and safety work |
| Licence | Proprietary | Varies — Apache 2.0, MIT, community licences with caveats |
The honest 2026 picture: closed models still hold the ceiling on hard reasoning and agentic coding, but the gap has narrowed. For many production workloads — especially ones where data residency or cost matter more than peak capability — open-weight models like DeepSeek V3.2 or Qwen3-Max are legitimate primary options.
The Claude family (this course's default)
The previous lesson version focused on the Claude tiers. They are still the simplest way to think about the hosted-model ladder, and they apply to every other family too.
- Opus — maximum intelligence. Use for nuanced judgement, complex reasoning, high-stakes decisions, novel problem spaces.
- Sonnet — balanced workhorse. The default for most agent work — strong reasoning at a reasonable price.
- Haiku — speed and efficiency. High-volume, low-complexity, cost-sensitive routing.
Every frontier family has the same three-tier structure: a top intelligence tier, a balanced daily tier, and a fast/cheap tier. OpenAI's is GPT-5.4 Pro / 5.4 / 5.4-mini; Google's is Gemini 3.1 Pro / 3 Pro / 3 Flash. Pick the tier, not just the family.
The benchmarks — and the use cases behind them
Before you look at any comparison chart, know what the benchmarks actually measure. Each one is a proxy for a different product use case, and a model's ranking can shuffle dramatically between them. Pick the benchmark that matches the work you will ship.
| Benchmark | Use case | What a high score signals |
|---|---|---|
| SWE-Bench Verified | Agentic coding, autonomous PR agents | Can resolve real-world GitHub issues end-to-end. The single best signal for coding agents. |
| GPQA Diamond | Research agents, technical Q&A | Graduate-level science reasoning on adversarial questions. Measures the ceiling, not the mean. |
| AIME 2025 | Symbolic math, optimisation, provable reasoning | Math-olympiad-level problem solving. |
| MMLU Pro | Knowledge-heavy workflows, generalist assistants | Broad 14-subject knowledge + reasoning. |
| Aider Polyglot | Code editing inside an IDE or CI | Precision edits across six languages. Best for refactor / fix workflows. |
The interactive landscape
The chart below plots cost per typical agent run (log scale) against benchmark performance. Switch the benchmark to change which use case you're optimising for — the bold tag next to each option is the one-word use case. Toggle closed vs open-weight to compare head-to-head. Hover a point for details.
500 hand-verified real-world GitHub issues. The headline benchmark for agentic software-engineering work. Higher is better.
Cost per run assumes ~2k input + 800 output tokens — a typical agent turn. Log-scale X-axis: the price range spans three orders of magnitude. Open-weight models have an outlined border.
| Model | Provider | Openness | Context | $/1M in | $/1M out | SWE-Bench Verified | Notes |
|---|---|---|---|---|---|---|---|
| Claude Opus 4.7 | Anthropic | Closed | 1000K | $5.00 | $25.00 | 87.6% | Top agentic-coding scores in 2026. The go-to for hard reasoning or final-stage judgement in a blend. |
| GPT-5.4 Pro | OpenAI | Closed | 1050K | $30.00 | $180.00 | 82% | Max-intelligence tier. Perfect AIME, top-tier GPQA — but the price reflects it. |
| Gemini 3.1 Pro | Closed | 1000K | $2.00 | $12.00 | 80.6% | Top of MMLU Pro. Very strong multimodal. Huge context window with good recall. | |
| GPT-5 Codex | OpenAI | Closed | 400K | $1.25 | $10.00 | 78% | Code-specialised GPT-5 branch. Top of the Aider leaderboard. |
| DeepSeek V3.2 (Thinking) | DeepSeek | MIT | 128K | $0.28 | $0.42 | 78% | Same weights, reasoning mode. Cheap reasoning model — often the price-performance winner. |
| Claude Sonnet 4.6 | Anthropic | Closed | 1000K | $3.00 | $15.00 | 77% | Balanced daily-driver. The default for most production agent workloads. |
| Qwen3-Coder Plus | Alibaba | Apache 2.0 | 1000K | $1.00 | $5.00 | 75% | Code-specialised Qwen. 1M context. Strong SWE-Bench for an open model. |
| GPT-5.4 | OpenAI | Closed | 1050K | $2.50 | $15.00 | 74.9% | OpenAI's daily flagship. Strong coding + reasoning at a Sonnet-comparable price. |
| Qwen3-Max | Alibaba | Apache 2.0 | 262K | $1.20 | $6.00 | 72% | Alibaba's flagship Qwen. Strong multilingual and agent performance. |
| Grok 4.20 Reasoning | xAI | Closed | 2000K | $2.00 | $6.00 | 70% | xAI's reasoning tier. 2M context. Competitive on tool-heavy agent work. |
| DeepSeek V3.2 | DeepSeek | MIT | 128K | $0.28 | $0.42 | 68.8% | Open-weight MoE. Remarkable price-performance — on many coding tasks it rivals closed models. |
| OpenAI o3 | OpenAI | Closed | 200K | $2.00 | $8.00 | 62% | Reasoning-tuned series. Slower (thinks out loud) but strong on hard problems. |
| GPT-5.4 mini | OpenAI | Closed | 400K | $0.75 | $4.50 | 60% | Mid-tier. Closer to Sonnet on many tasks at a much lower price. |
| Gemini 3 Flash | Closed | 1000K | $0.50 | $3.00 | 58% | Google's speed tier. Sonnet-adjacent quality at a Haiku-adjacent price — compelling for throughput work. | |
| Claude Haiku 4.5 | Anthropic | Closed | 200K | $1.00 | $5.00 | 52% | Fast + cheap. Great for classification, routing, cheap LLM-judge graders. |
| GPT-OSS 120B | OpenAI | Apache 2.0 | 131K | $0.35 | $0.75 | 52% | OpenAI's open-weight release. Can be self-hosted. Quality below GPT-5-mini but very cheap to run. |
| Llama 4 Maverick (402B MoE) | Meta | Llama 4 Community | 128K | $0.24 | $0.97 | 48% | Flagship Llama 4. Self-hostable at serious infra cost, or hosted cheaply on the Gateway. |
| Llama 4 Scout (109B MoE) | Meta | Llama 4 Community | 128K | $0.17 | $0.66 | 38% | Smaller Llama 4 MoE. Cheap and commonly self-hosted for low-sensitivity bulk work. |
For a wider, side-by-side view of the chart and the full comparative grid, open the Model chooser playground.
How to read the chart
- The top-left corner is the Pareto frontier: high score at low cost. Look here first.
- Points on the right but not much higher than the Pareto frontier are paying a premium — usually worth it only when the task is at the edge of what the cheaper model can do.
- Moving the benchmark switcher reshuffles the chart. A model that leads on MMLU Pro might trail on SWE-Bench. There is no single "best". Pick the benchmark that matches your use case.
- Open-weight models (outlined) tend to cluster in the lower-left — cheap, with competitive but not leading scores.
Blending: the real production pattern
No team ships with a single model behind an agent. The mature production pattern is a blend:
- Haiku / GPT-5.4-nano / Gemini 3 Flash / Llama 4 Scout — classify the incoming request.
- Sonnet / GPT-5.4 / Gemini 3 Pro / DeepSeek V3.2 Thinking — handle the main reasoning.
- Opus / GPT-5.4 Pro — reserved for the hardest calls.
Play with the Cost / Quality simulator to see how shifting tier within a blend moves your cost curve. You will usually find the knee well below the top tier.
A decision checklist
Before you ship, answer these five:
- Does your data need to stay on specific infrastructure? → Open-weight or a closed model in a sovereign region.
- What benchmark most closely matches your actual task? → That is your north-star score.
- What is the cheapest tier that reliably passes your evals on that benchmark? → Your default model.
- Which tier do you need for the hardest 5% of cases? → Your escalation model.
- How do you switch between models as a new generation drops? → Abstract the provider behind a gateway or a thin internal interface. You will re-evaluate at least quarterly.
Sources
- Live model catalogue and pricing — Vercel AI Gateway models API.
- Vellum LLM Leaderboard — Apr 2026 aggregate scores.
- Artificial Analysis evaluations — independent benchmarks.
- Vals AI MMLU Pro board.
- MindStudio GPT-5.4 / Opus 4.6 / Gemini 3.1 comparison.
- Spheron — DeepSeek V3.2 vs Llama 4 vs Qwen 3 for production 2026.