LearnAIAgents
🔨 Build

Context windows

The shared budget every agent lives in — sizes, cost, and compute.

Every agent lives inside a context window

Before we talk about tools, skills, memory, or RAG, you need a clear mental model of the thing they all share: the context window. It is the single piece of working memory the model gets on every call, and everything else in this module is a strategy for spending it well.

The first thing to internalise: the model has no memory of its own. Every time the user types a new message, the client sends the entire conversation so far — system prompt, every prior user message, every prior assistant reply, plus the new user message — back to the server as one big payload. There is no server-side "session" holding the chat; the server is stateless and the history is the prompt. Each turn the prompt gets longer, even if the user only adds a sentence.

Every turn, the whole history is sent again. The system prompt (top) plus every prior message plus the new user message — each request gets bigger, even if the user types one line.
System
user
Turn 1 · 2 msgs
System
user
assistant
user
Turn 2 · 4 msgs
System
user
assistant
user
assistant
user
Turn 3 · 6 msgs
System
user
assistant
user
assistant
user
assistant
user
Turn 4 · 8 msgs
time →
System prompt (always sent)
User message
Assistant reply

That re-sending is what makes prompt caching and conversation compaction so important: by turn 20 of a working session you may be sending hundreds of thousands of tokens on every keystroke's reply, and you're paying for them every single time.

The context window is a fixed-size token budget. On each call the model can see:

  • The system prompt you wrote.
  • The tool definitions the agent has been given.
  • Any skills loaded into the conversation.
  • The conversation history so far.
  • Any retrieved documents injected by RAG.
  • Any scratchpad / reasoning the model produced earlier.

When all of that together exceeds the window, the call fails or older content is dropped. Hence the discipline: what gets space, and what earns its keep.

A real example — Claude Code's /context

Claude Code ships a /context slash command that shows you exactly what has been spent. Here's a real snapshot from a session building this course:

Opus 4.7 (1M context) — 486.3k / 1M tokens (49% used)From `/context` in Claude Code during this course build
549.2k / 1.00M (54.9%)
System prompt:8.3k (0.8%)
System tools:14.1k (1.4%)
Custom agents:983 (0.1%)
Skills:7.4k (0.7%)
Messages:485.4k (48.5%)
Free space:450.9k (45.1%)
Autocompact buffer:33.0k (3.3%)
See the numbersclick for more

The raw numbers, broken down:

CategoryTokensShare
System prompt8.3k0.8%
System tools (definitions)14.1k1.4%
Custom agents9830.1%
Skills7.4k0.7%
Messages (conversation)485.4k48.5%
Autocompact buffer33k3.3%
Used549k54.9%
Free450.9k45.1%

Observations:

Note that with Claude Opus, the context window is very large.

  • The conversation dominates. 88% of everything spent is message history. System plumbing (prompt + tools + skills + agents) is a rounding error by comparison.
  • Free space is not really free. The auto-compact buffer (~3% here) is reserved for the moment the window fills and the tool has to summarise and compress older messages to keep going.
  • Tool definitions are cheap. 14k for all tools is a tiny price to pay, which is why it's usually wrong to prune tools for context reasons. Prune them for confusion reasons.

Context sizes vary by ~16× across the frontier

Not all models have the same window. Pick a model whose window fits the shape of the problem.

Context window by model. Ranges from 128K (Llama, DeepSeek) to 2M (Grok 4). Most frontier closed models have converged on ~1M.
  • Grok 4.20 Reasoning
    2.00M
  • GPT-5.4 Pro
    1.05M
  • GPT-5.4
    1.05M
  • Claude Opus 4.7
    1.00M
  • Claude Sonnet 4.6
    1.00M
  • Gemini 3.1 Pro
    1.00M
  • Gemini 3 Flash
    1.00M
  • Qwen3-Coder Plus
    1.00M
  • GPT-5.4 mini
    400K
  • GPT-5 Codex
    400K
  • Qwen3-Max
    262K
  • Claude Haiku 4.5
    200K
  • OpenAI o3
    200K
  • GPT-OSS 120B
    131K
  • Llama 4 Maverick (402B MoE)
    128K
  • Llama 4 Scout (109B MoE)
    128K
  • DeepSeek V3.2
    128K
  • DeepSeek V3.2 (Thinking)
    128K

The spread matters:

  • 128K (Llama 4, DeepSeek V3.2, GPT-OSS) — enough for a long document or a short agent session; not enough for a multi-hour coding run.
  • 200K (Claude Haiku 4.5, OpenAI o3) — comfortable for most single-task agent work.
  • 1M (Claude Opus/Sonnet 4.x, Gemini 3, GPT-5.4, Qwen3-Coder Plus) — the current frontier ceiling for most closed models. Enough to hold a whole codebase or a week of chat.
  • 2M (Grok 4) — headroom for exceptionally long agent runs or corpus-scale ingestion.

The pattern in 2026 is clear: closed-model vendors have converged on ~1M as the standard; open models are still mostly 128K. If your workflow genuinely needs 1M of live context, that narrows your provider choice significantly.

Bigger is not free, and isn't always better

A larger window sounds like pure upside. It is not. Three real costs come with spending more:

1. Compute and latency

Attention is the technology at the heart of a transformer, and classic attention scales quadratically with sequence length. Modern models use tricks (sparse attention, sliding windows, KV-cache optimisations) to make long context feasible, but the economics still rhyme: more tokens in, more compute per call, more waiting before the first output token, more wall-clock time to the last.

In practice: a 500k-token input is not just expensive, it is noticeably slower to respond than a 5k-token one, even on the same model.

2. Cost

Every token in the input is billed on every turn. If the conversation has grown to 400k tokens and you take 20 more turns, you are paying for 400k×20 = 8M input tokens even if the user says very little new.

Prompt caching — where the provider stores the KV-cache for repeated parts of the prompt and bills you at a fraction of the usual price. It costs more upfront, but much less over time With caching on, the unchanged part of the prompt (system, tools, earlier messages) can be billed at around 10% of its usual price. Without it, long sessions get expensive fast.

3. Quality decays — the "dumb zone"

A model that can hold 1M tokens rarely uses all of them equally well. Research has shown that models are generally best at paying attention to the beginning and end of their context windows, leading to the model appearing to forget things from the middle.

Two landmark pieces of research make the pointclick for more

Liu et al. (2023), Lost in the Middle — the first study to systematically show that LLM recall follows a U-shape. Performance is highest when relevant information sits at the beginning or end of the context, and significantly worse when it sits in the middle. They saw this on multi-document QA and key-value retrieval, across both off-the-shelf models and ones explicitly advertised as long-context.

Chroma Research (2025), Context Rot — the 2025 update. They evaluated 18 models across four vendors (all Claude 4 / 3.x, GPT-4.x and o3, Gemini 2.5, Qwen3-235B down to 8B) on four task types: needle-in-a-haystack with variable similarity, distractor robustness, haystack structure, and LongMemEval-style conversational recall. The findings sharpen the earlier picture:

  • Degradation is non-uniform — every model tested shows decline as input grows, but the shape and severity vary by model and task.
  • Distractors compound. Even a single distractor drops performance below baseline; several compound. Distractor 2 and 3 were the most-hallucinated positions.
  • Structure matters — in surprising ways. Logically coherent haystacks actually hurt recall more than shuffled ones, likely because coherent structure competes for attention with the target.
  • Model styles differ. Claude models refuse and abstain more (Opus 4 refused the repeated- words task ~2.9% of the time), GPT models hallucinate more confidently, Gemini generates random non-input words past ~500–750 tokens.
  • Real work is harder than the benchmark. Classical needle-in-a-haystack tests lexical retrieval; real agent workloads require semantic synthesis across distractors, which degrades faster.

The qualitative U-shape from the Lost in the Middle paper holds up; the Context Rot paper shows the curve collapses further when you add realistic distractor load.

The dumb zone. Illustrative U-shaped recall across a long context. Performance is highest at the beginning and end; the middle is where information gets lost.
the dumb zonestartendposition in contextrecall accuracy →

Curve is illustrative, adapted from the pattern described in Liu et al. (2023), Lost in the Middle, and in Chroma Research's 2025 Context Rot study. The real shape and depth depend on model, task, and distractor load — but the qualitative pattern is remarkably stable.

The practical implications for agent builders:

  • Put the most important content at the end. System prompt and tools up top (cacheable, ignorable); retrieved facts, user query, and required output format near the bottom.
  • Fewer, cleaner retrievals beat more. If you can pull five very relevant chunks, do that rather than 20 loosely-relevant ones — distractors cost more than missing context.
  • Don't treat 1M as a license to stuff. A model that scores well at 5k tokens may score materially worse at 500k on the same task. Measure on realistic input sizes.
  • Summarise aggressively for older turns. The middle of a long conversation is the dumb zone. Rolling summaries are not just a cost mitigation — they are a quality mitigation.

Sources

Strategies for living inside the window

Four moves that every serious agent uses:

  1. Rolling summaries. When conversation history starts crowding other content, compact older turns to a summary. Claude Code's auto-compact is an automatic version of this.
  2. RAG instead of stuffing. If a whole document library is relevant, don't paste it — index it and retrieve the relevant chunks per turn. Covered in RAG and context engineering.
  3. Per-turn selection. At each step, decide what memory and which skills are actually relevant, and load only those. This is the unglamorous part of agent engineering that most moves the quality needle past prototype.
  4. Prompt caching — covered in detail below.

Prompt caching

The single biggest cost lever for any multi-turn agent is prompt caching. Because every turn re-sends the same system prompt, the same tool definitions, and the same earlier messages, most of the payload on turn N was already sent on turn N−1. Prompt caching lets the provider recognise that unchanged prefix and charge you a fraction of the usual price to reuse it.

The picture is the same one as the top of the lesson — only now the system prompt carries a cache marker. Turn 1 writes the prefix to the cache; turns 2+ read it back.

With caching, the system prompt (and any stable prefix) is written once and read thereafter. The cached segment is billed at 0.1× per read; unchanged content keeps its discount until anything in the prefix changes.
Systemwrite 2×
user
Turn 1 · 2 msgs
Systemhit 0.1×
user
assistant
user
Turn 2 · 4 msgs
Systemhit 0.1×
user
assistant
user
assistant
user
Turn 3 · 6 msgs
Systemhit 0.1×
user
assistant
user
assistant
user
assistant
user
Turn 4 · 8 msgs
time →
System prompt (always sent)
User message
Assistant reply
Cached prefix (read at ~10% price)

The catch: writing to the cache costs 2× the normal input price. That one-time premium means a single-turn interaction with caching on is more expensive than one without. The question is how quickly the 0.1× read price pays that premium back.

Cumulative input cost over a session. Cache writes cost 2× and cache reads cost 0.1× the normal input price. The cached line is higher on turn 1 (the write) but pulls ahead as early as turn 3.

Model: 5k-token cached prefix, +500 tokens per turn, one cache write on turn 1 at 2×, cache reads thereafter at 0.1×. Real prefixes are often larger (system prompt + tool definitions + skills commonly total 20–50k), which makes the break-even sooner and the long-tail savings steeper.

Read the shape of the chart, not the absolute numbers. With a realistic cacheable prefix, the two curves cross at around turn 3 — and from then on the gap widens every turn. The implications:

  • One-shot classifiers and single-turn calls: caching is usually not worth it. The write premium never amortises.
  • Any multi-turn agent (>3 turns): caching is almost always a win, and the savings compound. A 20-turn working session typically ends up costing a third or less of the uncached equivalent.
  • Put stable things at the top. System prompt, tool definitions, long skills, reference docs. Anything that changes every turn goes at the bottom — once a cached prefix's tail changes, the cache is invalidated from that point on.
  • Measure your real prefix. The 5k prefix in the chart is conservative; most production agents have 20–50k tokens of system prompt + tools + skills, which makes the break-even even earlier and the long-tail savings bigger.

Play with the cost

Drag the sliders to see how context fill, output size, and session length move cost — and how different models compare at each setting.

100k
1.0k
20
Cost per turn
$0.0720
Input + output, with caching if on.
Cost per session
$1.440
×20 turns
Cost per turn, across models at these settings

The gap widens with input size. At 500k tokens per turn, cheap models with small context windows can't even serve the request — they appear faded because the input has been clamped to their ceiling.

Three things worth noticing as you play with it:

  • With prompt caching on, doubling input tokens roughly doubles cost — but without caching, it roughly doubles cost per turn, every turn. Caching is the single biggest lever.
  • Small-context models disappear off the chart as you push past 128k — they simply can't run. That is the cliff they hit in production.
  • The cost-per-turn gap between Sonnet and GPT-5.4 is modest; the gap between either of them and DeepSeek V3.2 is an order of magnitude.