Model chooser
Explore the cost vs benchmark landscape across frontier and open-weight families. Switch benchmarks to re-rank by use case, pick a family and tier to highlight, and read across to the comparative grid for raw pricing and context-window numbers.
500 hand-verified real-world GitHub issues. The headline benchmark for agentic software-engineering work. Higher is better.
Cost per run vs SWE-Bench Verified score
Cost per run assumes ~2k input + 800 output tokens — a typical agent turn. Log-scale X-axis: the price range spans three orders of magnitude. Open-weight models have an outlined border.
Comparative grid
| Model | Provider | Openness | Context | $/1M in | $/1M out | SWE-Bench Verified | Notes |
|---|---|---|---|---|---|---|---|
| Claude Opus 4.7 | Anthropic | Closed | 1000K | $5.00 | $25.00 | 87.6% | Top agentic-coding scores in 2026. The go-to for hard reasoning or final-stage judgement in a blend. |
| GPT-5.4 Pro | OpenAI | Closed | 1050K | $30.00 | $180.00 | 82% | Max-intelligence tier. Perfect AIME, top-tier GPQA — but the price reflects it. |
| Gemini 3.1 Pro | Closed | 1000K | $2.00 | $12.00 | 80.6% | Top of MMLU Pro. Very strong multimodal. Huge context window with good recall. | |
| GPT-5 Codex | OpenAI | Closed | 400K | $1.25 | $10.00 | 78% | Code-specialised GPT-5 branch. Top of the Aider leaderboard. |
| DeepSeek V3.2 (Thinking) | DeepSeek | MIT | 128K | $0.28 | $0.42 | 78% | Same weights, reasoning mode. Cheap reasoning model — often the price-performance winner. |
| Claude Sonnet 4.6 | Anthropic | Closed | 1000K | $3.00 | $15.00 | 77% | Balanced daily-driver. The default for most production agent workloads. |
| Qwen3-Coder Plus | Alibaba | Apache 2.0 | 1000K | $1.00 | $5.00 | 75% | Code-specialised Qwen. 1M context. Strong SWE-Bench for an open model. |
| GPT-5.4 | OpenAI | Closed | 1050K | $2.50 | $15.00 | 74.9% | OpenAI's daily flagship. Strong coding + reasoning at a Sonnet-comparable price. |
| Qwen3-Max | Alibaba | Apache 2.0 | 262K | $1.20 | $6.00 | 72% | Alibaba's flagship Qwen. Strong multilingual and agent performance. |
| Grok 4.20 Reasoning | xAI | Closed | 2000K | $2.00 | $6.00 | 70% | xAI's reasoning tier. 2M context. Competitive on tool-heavy agent work. |
| DeepSeek V3.2 | DeepSeek | MIT | 128K | $0.28 | $0.42 | 68.8% | Open-weight MoE. Remarkable price-performance — on many coding tasks it rivals closed models. |
| OpenAI o3 | OpenAI | Closed | 200K | $2.00 | $8.00 | 62% | Reasoning-tuned series. Slower (thinks out loud) but strong on hard problems. |
| GPT-5.4 mini | OpenAI | Closed | 400K | $0.75 | $4.50 | 60% | Mid-tier. Closer to Sonnet on many tasks at a much lower price. |
| Gemini 3 Flash | Closed | 1000K | $0.50 | $3.00 | 58% | Google's speed tier. Sonnet-adjacent quality at a Haiku-adjacent price — compelling for throughput work. | |
| Claude Haiku 4.5 | Anthropic | Closed | 200K | $1.00 | $5.00 | 52% | Fast + cheap. Great for classification, routing, cheap LLM-judge graders. |
| GPT-OSS 120B | OpenAI | Apache 2.0 | 131K | $0.35 | $0.75 | 52% | OpenAI's open-weight release. Can be self-hosted. Quality below GPT-5-mini but very cheap to run. |
| Llama 4 Maverick (402B MoE) | Meta | Llama 4 Community | 128K | $0.24 | $0.97 | 48% | Flagship Llama 4. Self-hostable at serious infra cost, or hosted cheaply on the Gateway. |
| Llama 4 Scout (109B MoE) | Meta | Llama 4 Community | 128K | $0.17 | $0.66 | 38% | Smaller Llama 4 MoE. Cheap and commonly self-hosted for low-sensitivity bulk work. |