The Boundary with Prompt Engineering — Tuning Each of the 5 Context Layers
Boundary with Prompt Engineering — How to Tune Each of the 5 Context Layers
Chapter 1 listed the 5 context layers. This chapter takes each apart: what goes in, how many tokens it eats, how to debug in isolation, how to find the culprit.
Every context engineering technique (rerank, memory, sub-agents) is surgery on one of these layers.
What the 5 Layers Actually Look Like
Below is the full message JR omni-report's daily-jobs routine sends to Claude right after scraping LinkedIn —
# tested: 2026-04-26 · anthropic@0.40.0 · model: claude-sonnet-4-6
client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
# === Layer 1: System instruction ===
system=[
{
"type": "text",
"text": "你是 JR Academy 的 daily-jobs picker...", # ~800 token
"cache_control": {"type": "ephemeral"} # 5min TTL prompt cache
}
],
# === Layer 2: Tool definitions ===
tools=[
{"name": "WebFetch", "description": "...", "input_schema": {...}}, # ~200 token
{"name": "Write", "description": "...", "input_schema": {...}}, # ~150 token
# 共 6 个 tool ≈ 1100 token
],
messages=[
# === Layer 3: Memory / 历史对话 ===
{"role": "user", "content": "前 7 天已挑过的 job ID 列表..."}, # ~300 token
{"role": "assistant", "content": "记下了"},
# === Layer 4: Retrieved context ===
{"role": "user", "content": [
{"type": "text", "text": "LinkedIn 抓回的 30 个 job 全文:"},
{"type": "text", "text": "<job1>...</job1><job2>...</job2>"} # ~12000 token
]},
# === Layer 5: User input(这次 task 的实际指令)===
{"role": "user", "content": "从上面 30 个 job 里挑 3 个:"
"1 aspirational / 1 actionable / 1 special。"
"输出 JSON 严格匹配 ${OUT} 文件 schema。"} # ~80 token
]
)
Total context here lands around 14500 tokens. Prompt engineering only cares about that last user input — 80 tokens, 0.5%. The other 99.5% is context engineering territory.
Layer by Layer
Layer 1: System instruction — Role + Invariant Rules
What goes in: model role, rules that never change (output format, prohibited actions, token limits), long descriptions worth caching (reference docs, style guides).
Tune: must stay stable — once changed, cache invalidates entirely. Anthropic prompt cache is 5-minute TTL, reuse requires system block stays the same (docs). Add cache_control: ephemeral and system drops to 10% billing on reused calls.
Debug: system alone + minimal user input, see if model acts in role. Change system and output is identical — system isn't taking effect, usually placed wrong (should be system but ended in user).
Layer 2: Tool definitions — The Tool Menu
What goes in: each function calling / MCP tool's name + description + input_schema, including "when to use" (write into description).
Tune: each tool schema 100-300 tokens. 15 tools ≈ 3000 tokens, all fight for context space. Claude Code on MCP can pull 17 servers × N tools — that's why most tools are deferred (loaded on demand via ToolSearch).
Debug: cut tool list to 1, see if model still calls tools randomly. If yes — description isn't clear, model is "guessing" (classic context pollution).
Layer 3: Memory / chat history — What Happened Before
What goes in: previous N turns (usually summarized), scratchpad of agent task steps, long-term memory (user preferences).
Tune: layer where context budget blows first. 100 turns × 1500 tokens = 150K, eats 75% of a 200K window. Needs three-layer memory architecture (Chapter 6): scratchpad / working / persistent, each with its own compression strategy.
Debug: cut history to last turn. Can model still finish? Yes — history was redundant. No — you need better summarization, not deletion.
Layer 4: Retrieved context — Facts Stuffed in on the Fly
What goes in: RAG passages, WebFetch page content, prior tool call outputs.
Tune: biggest token volume, most volatility. 20 passages × 500 chars = 10K tokens. Lost in the Middle (Liu et al. 2023, arXiv:2307.03172): docs in the middle the model basically can't see (Chapter 3 details).
Debug: inspect what model cited vs retrieved. 10 retrieved, 0 cited — selection failed, add rerank (Chapter 5).
Layer 5: User input — This Turn's Actual Instruction
What goes in: what the user typed + task goal + output requirements.
Tune: traditional prompt engineering turf. Difference from the other 4: changes every call, can't be cached.
Debug: lock the first 4 layers, change user input wording, watch output shift — standard prompt engineering A/B setup.
Anthropic's 5 Agent Patterns = 5 Kinds of Context Engineering
Building Effective Agents (Anthropic, 2024-12-20) breaks agentic systems into 5:
| Anthropic pattern | Context engineering decision |
|---|---|
| Prompt chaining | Big task into multiple calls, each only sees what it needs |
| Routing | Cheap model classifies → picks template → hands to expensive |
| Parallelization | Multiple LLMs run, results merge |
| Orchestrator-workers | Main agent splits task, sub-agents do pieces |
| Evaluator-optimizer | One generates, one evaluates, loop |
Engineering decision behind every pattern isn't prompt wording. It's how to slice, pass, collect context.
JR Real Case: classroom-deck-builder skill
JR's classroom-deck-builder skill (.claude/skills/classroom-deck-builder/) compiles a Quest lesson into a "live class" (slide + voiceover + teaching gestures).
Started in "one-shot mode" — one LLM call generated N slides + voiceover. Slide-to-slide narrative broke, voiceover tone inconsistent. Rewritten into two stages:
Stage 1: outline 阶段
Context = lesson goal + style guide
→ 输出 N 个 SceneOutline(仅标题 + 一句话指令)
[人工审 outline,可改可删]
Stage 2: finalize 阶段(SSE 流式)
Context = lesson goal + style guide + 已批准的 outline 全集
→ 逐个 scene 生成完整 slide + 配音
Stage 2 packs in one extra layer that Stage 1 lacks — "the full approved outline". Each scene knows its position in the overall narrative. One-shot mode lacks that, so narrative breaks.
Textbook context engineering: don't change prompt wording. Change how context flows between LLM calls.
Single-Layer Context vs Layered Context — Trade-off
Most people dump all 5 layers into one prompt string in their first LLM app. It runs but debugging is hell. Layered processing needs more engineering but you can tune each layer independently.
| Dimension | Single-layer mixed prompt | Layered context |
|---|---|---|
| Write | One string format, 30 lines | ContextBuilder class, 100+ lines |
| Debug | Errors = rewrite whole thing | Mock any single layer |
| Token cost | No prompt cache | First 2 layers cacheable, 60-90% saved |
| Multi-person | One edit affects everyone | system / tools have own owners |
| Use when | Single-task scripts | Production LLM apps |
| Don't use when | Complex agentic | One-shot demos |
JR rule: more than 3 LLM calls or more than one maintainer — must be layered.
Takeaway
Prompt engineering cares about 1 / 5 (user input). Context engineering cares about 5 / 5 — plus how layers pass, cache, isolate. 80% of production LLM app engineering lives in the first 4 layers.
References
- Anthropic. (2024-12-20). Building Effective Agents.
- Anthropic. Prompt caching docs — 5-minute TTL ephemeral cache.
- Anthropic. Cookbook — agent patterns.
- Anthropic. Tool use docs.
- Liu et al. (2023-07-06). Lost in the Middle. arXiv:2307.03172.
Production case: JR Academy classroom-deck-builder skill (.claude/skills/classroom-deck-builder/) — two-stage pipeline implements context isolation.
📚 相关资源
❓ 常见问题
关于本章主题最常被搜索的问题,点击展开答案
System prompt 和 user prompt 区别是什么?
System 装不变的规则 + 角色,每次调用复用、走 prompt cache 省 90% 成本;user 是这一轮的 task 指令,每次都变。两者混一字符串调试是地狱,production 必须分开。
我的 context 里 system / tools / memory / retrieval / user 各自占多少 token?
典型 production RAG 占用:system 800 + tools 1100 + memory 300 + retrieval 12000 + user 80 = 14280 token。用 Anthropic SDK 的 token counter 分段量。Retrieval 层永远是最大头。
Anthropic 5 个 agent pattern 是什么?
5 个 pattern:Prompt chaining(拆步)、Routing(先分类再选模板)、Parallelization(并行 + 汇聚)、Orchestrator-workers(主 agent + sub-agent)、Evaluator-optimizer(生成 + 评估循环)。每个都是 context 怎么切/怎么传的工程决策,不是 prompt 措辞。
我已经在用 OpenAI Chat Completions API,5 层 context 还适用吗?
适用。5 层是工程抽象,不绑模型:OpenAI 的 system role + tools 参数 + messages history + retrieval 注入 + user message 一一对应。Gemini / DeepSeek / Qwen API 字段名不同但层次完全一致,迁移只是改 SDK 包名。
做客服机器人这种简单业务,5 层都要做吗?
不需要全做:客服机器人最少 3 层够用——system(角色 + 业务规则)+ retrieval(FAQ 知识库 top-3)+ user。Memory 层做不做看是否需要跨 session 记住客户偏好;tools 层只在要操作订单/退款时才加。先 3 层上线再按 ticket 复盘加层。
新手最容易在哪一层翻车?
Retrieval 层。80% 自学项目栽在「召回了相关文档但 LLM 还是答错」——以为是 prompt 不行去调 system 措辞,真问题是 retrieval 没做 selection(filter / rerank / judge)。第 3、5 章专门讲。