🔨 Build

Tools and MCP

What an agent can DO, and how it connects.

Tools are what the agent can DO

A model without tools can only produce text. An agent with tools can query systems, run code, send messages, make payments. Tools are how the model's reasoning touches reality.

Every tool deserves a risk rating. The simplest rating is three tiers:

Read-only (low) — the tool can fetch, query, search, lookup. No side effects. Example: getCalendarEvents, searchDocs, lookupAccount.
Write/modify (medium) — the tool changes state but within a bounded, reversible scope. Example: createDraftTicket, updateTag, postToScratchChannel.
Irreversible (high) — the tool has real-world consequences that cannot be undone cheaply. Example: sendEmail, makePayment, deleteRecord, publishPost.

The risk rating of an agent's action space is the maximum risk of any of its tools, not the average. One irreversible tool makes the whole agent a high-risk system.

Model Context Protocol (MCP)

MCP: Model Context Protocol. One protocol, many connections — like USB-C for AI.

MCP is the Model Context Protocol, an open standard Anthropic introduced in November 2024. Think of it as USB-C for AI: one protocol, many connections. An MCP server exposes resources (files, database rows, search indices) and tools (functions the agent can call); an MCP client (the agent runtime) discovers and invokes them uniformly.

Why this matters for a product team:

Portability. A tool you expose via MCP can be consumed by Claude, by Cursor, by VS Code, by any MCP-aware host. You are not locked into a vendor's tool format.
Provenance. MCP servers are explicit, named, and versioned. An agent that calls an MCP server has a traceable line: which server, which tool, which version.
Governance surface. MCP gives you a natural boundary at which to enforce authorisation, logging, and rate-limits — between the agent and the MCP server, not buried inside each tool implementation.

Building and consuming tools

A tool definition has four parts: a name, a description (the model reads this to decide when to call it), an input schema (typed arguments), and an implementation (the function that runs). Good descriptions are disproportionately important — a model that is uncertain about when to use a tool will either hesitate or misuse it.

tool({
  name: "lookupAccount",
  description:
    "Retrieve an account's profile by email. Use this only when the user has given their email and you need account details to answer. Returns null for unknown emails.",
  inputSchema: z.object({ email: z.string().email() }),
  execute: async ({ email }) => database.findAccountByEmail(email),
});

How many tools? And should one do everything?

The honest answer is "it depends" — but the wrong answers are easier to name. Three anti-patterns show up repeatedly in practice:

1. Too many tools in one agent

An agent with 30 tools tends to pick worse than an agent with 8. Every extra tool adds description tokens, dilutes the model's attention when selecting, and increases the chance it picks a plausible-but-wrong one. Past roughly 15–20 tools most agents start paying a measurable accuracy tax on every turn. The symptom is subtle: not total failure, just a rising rate of "almost right" tool calls on routine tasks.

2. The swiss-army knife tool

The opposite approach can be tempting: collapsing twenty tools into one generic tool like executeSQL, runCommand, callService(endpoint, method, payload). You get one tool to maintain instead of twenty, and the description stays short. However, now you've moved the discrimination problem inside the tool, where the model now has to compose the right arguments without structure and format guardrails, without per-tool descriptions, and without a human checkpoint calibrated to what it's actually about to do. This is hard to test, hard to evaluate, hard to rate-limit.

Worst of all, the tool's blast radius becomes the maximum of everything it can reach. A single high-risk action in the reachable set turns the whole tool into a high-risk tool.

3. Too few tools

The under-tooled agent is quieter but just as dangerous. It falls back to hallucinating data it should have fetched, asks the user for ids it should have looked up, or invents workarounds that look correct until they don't. The floor matters as much as the ceiling.

Case study: the agent that got the right answer by cheatingclick for more

In early development of an agent to understand performance profile data, a prototype was wired up with a set of investigative tools (query the data, sample rows, inspect metrics) and asked to diagnose a performance problem. It returned the right answer. Excellent. Except the trace showed it hadn't called a single tool! It had read the surrounding code, spotted a suspicious code pattern, and guessed correctly.

Right answer, wrong process. On the next case — with data the agent couldn't shortcut by reading code — it would have invented something plausible and been confidently wrong. The toolkit is a hypothesis about how the agent will reason; if the agent ignores it and still succeeds, you haven't built the agent you think you have.

The lesson: check that tools are being used the way you expect, not just that the final answers look right. Tool-call trajectories belong in your eval suite alongside outcomes, and a "pass" on output with zero tool calls should usually be a fail.

Many tools, split across agents

The genuinely big toolbox problem is almost always solved by splitting, not by squeezing. You can have a large surface area of tools overall — dozens, hundreds even — as long as each individual agent only ever sees the subset it needs.

The standard pattern: keep the top-level agent's tool list small, and move the long tail into specialist subagents. Each subagent has its own narrow toolkit, its own system prompt, its own eval harness. From the top-level agent's perspective a subagent is a single delegate tool: "ask the invoices subagent to do X". From the invoices subagent's perspective, its whole prompt is about invoicing and it has exactly the six tools it needs.

A support agent, split into specialists. The orchestrator classifies the user’s intent and hands off to a subagent with a narrow toolkit. Each specialist sees only the tools it needs.

Note how the Guide Agent runs on a cheaper model with only two tools, while the billing and issue specialists carry the domain-specific toolkits. The orchestrator itself has no business tools at all — only the routing logic.

This pattern scales cleanly:

Selection accuracy stays high. The user-facing agent has a short, distinct tool list. Each subagent does too.
Evaluation gets tractable. You can build golden datasets for each subagent on its narrow job, rather than trying to evaluate a 40-tool generalist.
Governance gets clearer. Each subagent can have a different model, a different authority level, and a different set of guardrails. A research subagent can read broadly; an action subagent can be tightly scoped and gated by HITL.
Blast radius shrinks. The irreversible tools live inside one narrow subagent, not spread across the top-level prompt where any turn might reach them.

A useful mental model: a tool is what the agent can do; a subagent is a colleague with a different job. You hire a specialist when the generalist starts dropping things — and the same logic applies to agents. The design is covered in more depth in Orchestration.

Context windows

Skills: the agent’s training manual