LearnAIAgents

Model chooser

Explore the cost vs benchmark landscape across frontier and open-weight families. Switch benchmarks to re-rank by use case, pick a family and tier to highlight, and read across to the comparative grid for raw pricing and context-window numbers.

500 hand-verified real-world GitHub issues. The headline benchmark for agentic software-engineering work. Higher is better.

Cost per run vs SWE-Bench Verified score

Cost per run assumes ~2k input + 800 output tokens — a typical agent turn. Log-scale X-axis: the price range spans three orders of magnitude. Open-weight models have an outlined border.

Comparative grid
ModelProviderOpennessContext$/1M in$/1M outSWE-Bench VerifiedNotes
Claude Opus 4.7AnthropicClosed1000K$5.00$25.0087.6%Top agentic-coding scores in 2026. The go-to for hard reasoning or final-stage judgement in a blend.
GPT-5.4 ProOpenAIClosed1050K$30.00$180.0082%Max-intelligence tier. Perfect AIME, top-tier GPQA — but the price reflects it.
Gemini 3.1 ProGoogleClosed1000K$2.00$12.0080.6%Top of MMLU Pro. Very strong multimodal. Huge context window with good recall.
GPT-5 CodexOpenAIClosed400K$1.25$10.0078%Code-specialised GPT-5 branch. Top of the Aider leaderboard.
DeepSeek V3.2 (Thinking)DeepSeekMIT128K$0.28$0.4278%Same weights, reasoning mode. Cheap reasoning model — often the price-performance winner.
Claude Sonnet 4.6AnthropicClosed1000K$3.00$15.0077%Balanced daily-driver. The default for most production agent workloads.
Qwen3-Coder PlusAlibabaApache 2.01000K$1.00$5.0075%Code-specialised Qwen. 1M context. Strong SWE-Bench for an open model.
GPT-5.4OpenAIClosed1050K$2.50$15.0074.9%OpenAI's daily flagship. Strong coding + reasoning at a Sonnet-comparable price.
Qwen3-MaxAlibabaApache 2.0262K$1.20$6.0072%Alibaba's flagship Qwen. Strong multilingual and agent performance.
Grok 4.20 ReasoningxAIClosed2000K$2.00$6.0070%xAI's reasoning tier. 2M context. Competitive on tool-heavy agent work.
DeepSeek V3.2DeepSeekMIT128K$0.28$0.4268.8%Open-weight MoE. Remarkable price-performance — on many coding tasks it rivals closed models.
OpenAI o3OpenAIClosed200K$2.00$8.0062%Reasoning-tuned series. Slower (thinks out loud) but strong on hard problems.
GPT-5.4 miniOpenAIClosed400K$0.75$4.5060%Mid-tier. Closer to Sonnet on many tasks at a much lower price.
Gemini 3 FlashGoogleClosed1000K$0.50$3.0058%Google's speed tier. Sonnet-adjacent quality at a Haiku-adjacent price — compelling for throughput work.
Claude Haiku 4.5AnthropicClosed200K$1.00$5.0052%Fast + cheap. Great for classification, routing, cheap LLM-judge graders.
GPT-OSS 120BOpenAIApache 2.0131K$0.35$0.7552%OpenAI's open-weight release. Can be self-hosted. Quality below GPT-5-mini but very cheap to run.
Llama 4 Maverick (402B MoE)MetaLlama 4 Community128K$0.24$0.9748%Flagship Llama 4. Self-hostable at serious infra cost, or hosted cheaply on the Gateway.
Llama 4 Scout (109B MoE)MetaLlama 4 Community128K$0.17$0.6638%Smaller Llama 4 MoE. Cheap and commonly self-hosted for low-sensitivity bulk work.