Model chooser

Explore the cost vs benchmark landscape across frontier and open-weight families. Switch benchmarks to re-rank by use case, pick a family and tier to highlight, and read across to the comparative grid for raw pricing and context-window numbers.

Benchmark — the agentic coding use case

Family

Tier

500 hand-verified real-world GitHub issues. The headline benchmark for agentic software-engineering work. Higher is better.

Cost per run vs SWE-Bench Verified score

Cost per run assumes ~2k input + 800 output tokens — a typical agent turn. Log-scale X-axis: the price range spans three orders of magnitude. Open-weight models have an outlined border.

Comparative grid

Model	Provider	Openness	Context	$/1M in	$/1M out	SWE-Bench Verified	Notes
Claude Opus 4.7	Anthropic	Closed	1000K	$5.00	$25.00	87.6%	Top agentic-coding scores in 2026. The go-to for hard reasoning or final-stage judgement in a blend.
GPT-5.4 Pro	OpenAI	Closed	1050K	$30.00	$180.00	82%	Max-intelligence tier. Perfect AIME, top-tier GPQA — but the price reflects it.
Gemini 3.1 Pro	Google	Closed	1000K	$2.00	$12.00	80.6%	Top of MMLU Pro. Very strong multimodal. Huge context window with good recall.
GPT-5 Codex	OpenAI	Closed	400K	$1.25	$10.00	78%	Code-specialised GPT-5 branch. Top of the Aider leaderboard.
DeepSeek V3.2 (Thinking)	DeepSeek	MIT	128K	$0.28	$0.42	78%	Same weights, reasoning mode. Cheap reasoning model — often the price-performance winner.
Claude Sonnet 4.6	Anthropic	Closed	1000K	$3.00	$15.00	77%	Balanced daily-driver. The default for most production agent workloads.
Qwen3-Coder Plus	Alibaba	Apache 2.0	1000K	$1.00	$5.00	75%	Code-specialised Qwen. 1M context. Strong SWE-Bench for an open model.
GPT-5.4	OpenAI	Closed	1050K	$2.50	$15.00	74.9%	OpenAI's daily flagship. Strong coding + reasoning at a Sonnet-comparable price.
Qwen3-Max	Alibaba	Apache 2.0	262K	$1.20	$6.00	72%	Alibaba's flagship Qwen. Strong multilingual and agent performance.
Grok 4.20 Reasoning	xAI	Closed	2000K	$2.00	$6.00	70%	xAI's reasoning tier. 2M context. Competitive on tool-heavy agent work.
DeepSeek V3.2	DeepSeek	MIT	128K	$0.28	$0.42	68.8%	Open-weight MoE. Remarkable price-performance — on many coding tasks it rivals closed models.
OpenAI o3	OpenAI	Closed	200K	$2.00	$8.00	62%	Reasoning-tuned series. Slower (thinks out loud) but strong on hard problems.
GPT-5.4 mini	OpenAI	Closed	400K	$0.75	$4.50	60%	Mid-tier. Closer to Sonnet on many tasks at a much lower price.
Gemini 3 Flash	Google	Closed	1000K	$0.50	$3.00	58%	Google's speed tier. Sonnet-adjacent quality at a Haiku-adjacent price — compelling for throughput work.
Claude Haiku 4.5	Anthropic	Closed	200K	$1.00	$5.00	52%	Fast + cheap. Great for classification, routing, cheap LLM-judge graders.
GPT-OSS 120B	OpenAI	Apache 2.0	131K	$0.35	$0.75	52%	OpenAI's open-weight release. Can be self-hosted. Quality below GPT-5-mini but very cheap to run.
Llama 4 Maverick (402B MoE)	Meta	Llama 4 Community	128K	$0.24	$0.97	48%	Flagship Llama 4. Self-hostable at serious infra cost, or hosted cheaply on the Gateway.
Llama 4 Scout (109B MoE)	Meta	Llama 4 Community	128K	$0.17	$0.66	38%	Smaller Llama 4 MoE. Cheap and commonly self-hosted for low-sensitivity bulk work.