🔨 Build

RAG and context engineering

When retrieval beats long context, and when it doesn’t.

The three ways to put knowledge into an agent

When an agent needs information it does not already have, there are three levers:

Long context — fit the information into the prompt directly.
RAG (retrieval-augmented generation) — fetch relevant documents at query time and inject them into the prompt.
Fine-tuning — bake the information into the model's weights.

These are not competing religions. They are complementary tools, and the right choice depends on three variables: volatility of the information, size of the corpus, and latency budget.

When long context wins

Context windows keep growing. If the knowledge relevant to a single task fits in the context window — and if it is small enough that the per-call cost is acceptable — you often do not need RAG. Just paste the documents in.

Long context wins when:

The corpus is small (say, under 200K tokens).
The information is short-lived (you update it between calls).
Latency of retrieval would exceed the cost of just including everything.

When RAG wins

RAG is the default for most production agents. You keep a vector index (or a text search index, or both) of your corpus, and at query time you retrieve the top-k most relevant chunks and inject them into the prompt.

RAG wins when:

The corpus is large (cannot fit in context).
The corpus is frequently updated (you want to re-index, not re-fine-tune).
You need citations back to specific sources (essential for groundedness).

When fine-tuning wins

Fine-tuning bakes knowledge into the model's weights. It is the most expensive option to set up and the hardest to change.

Fine-tuning wins when:

The knowledge is style, not fact — how your team writes, how your product speaks.
Latency of retrieval is unacceptable at your scale.
The behaviour needs to be consistent across thousands of turns without retrieval noise.

The pyramid

Start at the bottom and climb only when evaluation tells you to.

System prompt — rules and tone go here.
Skills — domain knowledge and decision frameworks.
RAG — large corpora and cite-able sources.
Long context — one-off large inputs (whole documents, whole meetings).
Fine-tuning — only if everything above has plateaued.

Context engineering

"Context engineering" is the practice of shaping what the model sees at each turn. It includes:

Summarisation — rolling summaries of long conversations to avoid blowing the window.
Retrieval re-ranking — using a cheap model (Haiku) to reorder retrieved chunks by relevance before passing them to the expensive model.
Tool response compression — tools that return summaries, not raw dumps.
Lossy selection — at each turn, deciding which memory and which skills to load.

This is the unglamorous part of agent engineering, and the part that most moves the quality needle once you're past the prototype stage.

The system prompt

Orchestration patterns