LearnAIAgents
🔨 Build

Choosing the right model

Opus, Sonnet, Haiku — and when to blend.

The right model is the one that matches the task

There is no single "best" model in 2026. The frontier is crowded, prices vary by three orders of magnitude, and the answer depends on what you actually care about: coding quality, pure reasoning, cost per call, latency, context window, or deployability.

This lesson covers the three decisions you will make:

  1. Which provider / family — Anthropic, OpenAI, Google, xAI, Meta, DeepSeek, Alibaba...
  2. Which tier within that family — flagship, mid, small. Sometimes you will get a specfic paramater counts (e.g. 70b, 32b, 7b)
  3. Cloud-only or self-hosted open-weight — do you call a hosted model, or run the weights yourself?

Cloud-only vs open-weight — the big fork

The biggest architectural choice is not "which model" but "can I self-host it". Two worlds:

DimensionCloud-only (closed-weight)Open-weight
ExamplesClaude Opus/Sonnet/Haiku, GPT-5.4 family, Gemini 3.x, Grok 4Llama 4, DeepSeek V3.2, Qwen 3, GPT-OSS, Gemma
Where it runsThe provider's infra onlyAny infra that can host it — cloud, VPC, your datacentre, maybe even your laptop or phone
Pricing modelPer-token, provider-setPer-token if hosted via an API like BedRock or OpenRouter; otherwise you pay for GPUs and power
Data residencyConstrained to provider regionsTotal control over where data goes
Fine-tuningLimited (API-exposed tuning only)Full — you own the weights
Quality ceilingUsually the higher ceilingCatching up fast; top open models now match closed mid-tier
SupportVendor SLAs, safety review, security workYou own operations, scaling, and safety work
LicenceProprietaryVaries — Apache 2.0, MIT, community licences with caveats

The honest 2026 picture: closed models still hold the ceiling on hard reasoning and agentic coding, but the gap has narrowed. For many production workloads — especially ones where data residency or cost matter more than peak capability — open-weight models like DeepSeek V3.2 or Qwen3-Max are legitimate primary options.

The Claude family (this course's default)

The previous lesson version focused on the Claude tiers. They are still the simplest way to think about the hosted-model ladder, and they apply to every other family too.

  • Opus — maximum intelligence. Use for nuanced judgement, complex reasoning, high-stakes decisions, novel problem spaces.
  • Sonnet — balanced workhorse. The default for most agent work — strong reasoning at a reasonable price.
  • Haiku — speed and efficiency. High-volume, low-complexity, cost-sensitive routing.

Every frontier family has the same three-tier structure: a top intelligence tier, a balanced daily tier, and a fast/cheap tier. OpenAI's is GPT-5.4 Pro / 5.4 / 5.4-mini; Google's is Gemini 3.1 Pro / 3 Pro / 3 Flash. Pick the tier, not just the family.

The benchmarks — and the use cases behind them

Before you look at any comparison chart, know what the benchmarks actually measure. Each one is a proxy for a different product use case, and a model's ranking can shuffle dramatically between them. Pick the benchmark that matches the work you will ship.

BenchmarkUse caseWhat a high score signals
SWE-Bench VerifiedAgentic coding, autonomous PR agentsCan resolve real-world GitHub issues end-to-end. The single best signal for coding agents.
GPQA DiamondResearch agents, technical Q&AGraduate-level science reasoning on adversarial questions. Measures the ceiling, not the mean.
AIME 2025Symbolic math, optimisation, provable reasoningMath-olympiad-level problem solving.
MMLU ProKnowledge-heavy workflows, generalist assistantsBroad 14-subject knowledge + reasoning.
Aider PolyglotCode editing inside an IDE or CIPrecision edits across six languages. Best for refactor / fix workflows.

The interactive landscape

The chart below plots cost per typical agent run (log scale) against benchmark performance. Switch the benchmark to change which use case you're optimising for — the bold tag next to each option is the one-word use case. Toggle closed vs open-weight to compare head-to-head. Hover a point for details.

500 hand-verified real-world GitHub issues. The headline benchmark for agentic software-engineering work. Higher is better.

Cost per run vs SWE-Bench Verified score

Cost per run assumes ~2k input + 800 output tokens — a typical agent turn. Log-scale X-axis: the price range spans three orders of magnitude. Open-weight models have an outlined border.

Comparative grid
ModelProviderOpennessContext$/1M in$/1M outSWE-Bench VerifiedNotes
Claude Opus 4.7AnthropicClosed1000K$5.00$25.0087.6%Top agentic-coding scores in 2026. The go-to for hard reasoning or final-stage judgement in a blend.
GPT-5.4 ProOpenAIClosed1050K$30.00$180.0082%Max-intelligence tier. Perfect AIME, top-tier GPQA — but the price reflects it.
Gemini 3.1 ProGoogleClosed1000K$2.00$12.0080.6%Top of MMLU Pro. Very strong multimodal. Huge context window with good recall.
GPT-5 CodexOpenAIClosed400K$1.25$10.0078%Code-specialised GPT-5 branch. Top of the Aider leaderboard.
DeepSeek V3.2 (Thinking)DeepSeekMIT128K$0.28$0.4278%Same weights, reasoning mode. Cheap reasoning model — often the price-performance winner.
Claude Sonnet 4.6AnthropicClosed1000K$3.00$15.0077%Balanced daily-driver. The default for most production agent workloads.
Qwen3-Coder PlusAlibabaApache 2.01000K$1.00$5.0075%Code-specialised Qwen. 1M context. Strong SWE-Bench for an open model.
GPT-5.4OpenAIClosed1050K$2.50$15.0074.9%OpenAI's daily flagship. Strong coding + reasoning at a Sonnet-comparable price.
Qwen3-MaxAlibabaApache 2.0262K$1.20$6.0072%Alibaba's flagship Qwen. Strong multilingual and agent performance.
Grok 4.20 ReasoningxAIClosed2000K$2.00$6.0070%xAI's reasoning tier. 2M context. Competitive on tool-heavy agent work.
DeepSeek V3.2DeepSeekMIT128K$0.28$0.4268.8%Open-weight MoE. Remarkable price-performance — on many coding tasks it rivals closed models.
OpenAI o3OpenAIClosed200K$2.00$8.0062%Reasoning-tuned series. Slower (thinks out loud) but strong on hard problems.
GPT-5.4 miniOpenAIClosed400K$0.75$4.5060%Mid-tier. Closer to Sonnet on many tasks at a much lower price.
Gemini 3 FlashGoogleClosed1000K$0.50$3.0058%Google's speed tier. Sonnet-adjacent quality at a Haiku-adjacent price — compelling for throughput work.
Claude Haiku 4.5AnthropicClosed200K$1.00$5.0052%Fast + cheap. Great for classification, routing, cheap LLM-judge graders.
GPT-OSS 120BOpenAIApache 2.0131K$0.35$0.7552%OpenAI's open-weight release. Can be self-hosted. Quality below GPT-5-mini but very cheap to run.
Llama 4 Maverick (402B MoE)MetaLlama 4 Community128K$0.24$0.9748%Flagship Llama 4. Self-hostable at serious infra cost, or hosted cheaply on the Gateway.
Llama 4 Scout (109B MoE)MetaLlama 4 Community128K$0.17$0.6638%Smaller Llama 4 MoE. Cheap and commonly self-hosted for low-sensitivity bulk work.

For a wider, side-by-side view of the chart and the full comparative grid, open the Model chooser playground.

How to read the chart

  • The top-left corner is the Pareto frontier: high score at low cost. Look here first.
  • Points on the right but not much higher than the Pareto frontier are paying a premium — usually worth it only when the task is at the edge of what the cheaper model can do.
  • Moving the benchmark switcher reshuffles the chart. A model that leads on MMLU Pro might trail on SWE-Bench. There is no single "best". Pick the benchmark that matches your use case.
  • Open-weight models (outlined) tend to cluster in the lower-left — cheap, with competitive but not leading scores.

Blending: the real production pattern

No team ships with a single model behind an agent. The mature production pattern is a blend:

  • Haiku / GPT-5.4-nano / Gemini 3 Flash / Llama 4 Scout — classify the incoming request.
  • Sonnet / GPT-5.4 / Gemini 3 Pro / DeepSeek V3.2 Thinking — handle the main reasoning.
  • Opus / GPT-5.4 Pro — reserved for the hardest calls.

Play with the Cost / Quality simulator to see how shifting tier within a blend moves your cost curve. You will usually find the knee well below the top tier.

A decision checklist

Before you ship, answer these five:

  1. Does your data need to stay on specific infrastructure? → Open-weight or a closed model in a sovereign region.
  2. What benchmark most closely matches your actual task? → That is your north-star score.
  3. What is the cheapest tier that reliably passes your evals on that benchmark? → Your default model.
  4. Which tier do you need for the hardest 5% of cases? → Your escalation model.
  5. How do you switch between models as a new generation drops? → Abstract the provider behind a gateway or a thin internal interface. You will re-evaluate at least quarterly.

Sources