logo
02

The Boundary with Prompt Engineering — Tuning Each of the 5 Context Layers

⏱️ 20 min

Boundary with Prompt Engineering — How to Tune Each of the 5 Context Layers

Chapter 1 listed the 5 context layers. This chapter takes each apart: what goes in, how many tokens it eats, how to debug in isolation, how to find the culprit.

Every context engineering technique (rerank, memory, sub-agents) is surgery on one of these layers.

What the 5 Layers Actually Look Like

Below is the full message JR omni-report's daily-jobs routine sends to Claude right after scraping LinkedIn —

# tested: 2026-04-26 · anthropic@0.40.0 · model: claude-sonnet-4-6
client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,

    # === Layer 1: System instruction ===
    system=[
        {
            "type": "text",
            "text": "你是 JR Academy 的 daily-jobs picker...",  # ~800 token
            "cache_control": {"type": "ephemeral"}  # 5min TTL prompt cache
        }
    ],

    # === Layer 2: Tool definitions ===
    tools=[
        {"name": "WebFetch", "description": "...", "input_schema": {...}},  # ~200 token
        {"name": "Write",    "description": "...", "input_schema": {...}},  # ~150 token
        # 共 6 个 tool ≈ 1100 token
    ],

    messages=[
        # === Layer 3: Memory / 历史对话 ===
        {"role": "user",      "content": "前 7 天已挑过的 job ID 列表..."},  # ~300 token
        {"role": "assistant", "content": "记下了"},

        # === Layer 4: Retrieved context ===
        {"role": "user", "content": [
            {"type": "text", "text": "LinkedIn 抓回的 30 个 job 全文:"},
            {"type": "text", "text": "<job1>...</job1><job2>...</job2>"}  # ~12000 token
        ]},

        # === Layer 5: User input(这次 task 的实际指令)===
        {"role": "user", "content": "从上面 30 个 job 里挑 3 个:"
                                    "1 aspirational / 1 actionable / 1 special。"
                                    "输出 JSON 严格匹配 ${OUT} 文件 schema。"}  # ~80 token
    ]
)

Total context here lands around 14500 tokens. Prompt engineering only cares about that last user input — 80 tokens, 0.5%. The other 99.5% is context engineering territory.

Layer by Layer

Layer 1: System instruction — Role + Invariant Rules

What goes in: model role, rules that never change (output format, prohibited actions, token limits), long descriptions worth caching (reference docs, style guides).

Tune: must stay stable — once changed, cache invalidates entirely. Anthropic prompt cache is 5-minute TTL, reuse requires system block stays the same (docs). Add cache_control: ephemeral and system drops to 10% billing on reused calls.

Debug: system alone + minimal user input, see if model acts in role. Change system and output is identical — system isn't taking effect, usually placed wrong (should be system but ended in user).

Layer 2: Tool definitions — The Tool Menu

What goes in: each function calling / MCP tool's name + description + input_schema, including "when to use" (write into description).

Tune: each tool schema 100-300 tokens. 15 tools ≈ 3000 tokens, all fight for context space. Claude Code on MCP can pull 17 servers × N tools — that's why most tools are deferred (loaded on demand via ToolSearch).

Debug: cut tool list to 1, see if model still calls tools randomly. If yes — description isn't clear, model is "guessing" (classic context pollution).

Layer 3: Memory / chat history — What Happened Before

What goes in: previous N turns (usually summarized), scratchpad of agent task steps, long-term memory (user preferences).

Tune: layer where context budget blows first. 100 turns × 1500 tokens = 150K, eats 75% of a 200K window. Needs three-layer memory architecture (Chapter 6): scratchpad / working / persistent, each with its own compression strategy.

Debug: cut history to last turn. Can model still finish? Yes — history was redundant. No — you need better summarization, not deletion.

Layer 4: Retrieved context — Facts Stuffed in on the Fly

What goes in: RAG passages, WebFetch page content, prior tool call outputs.

Tune: biggest token volume, most volatility. 20 passages × 500 chars = 10K tokens. Lost in the Middle (Liu et al. 2023, arXiv:2307.03172): docs in the middle the model basically can't see (Chapter 3 details).

Debug: inspect what model cited vs retrieved. 10 retrieved, 0 cited — selection failed, add rerank (Chapter 5).

Layer 5: User input — This Turn's Actual Instruction

What goes in: what the user typed + task goal + output requirements.

Tune: traditional prompt engineering turf. Difference from the other 4: changes every call, can't be cached.

Debug: lock the first 4 layers, change user input wording, watch output shift — standard prompt engineering A/B setup.

Anthropic's 5 Agent Patterns = 5 Kinds of Context Engineering

Building Effective Agents (Anthropic, 2024-12-20) breaks agentic systems into 5:

Anthropic patternContext engineering decision
Prompt chainingBig task into multiple calls, each only sees what it needs
RoutingCheap model classifies → picks template → hands to expensive
ParallelizationMultiple LLMs run, results merge
Orchestrator-workersMain agent splits task, sub-agents do pieces
Evaluator-optimizerOne generates, one evaluates, loop

Engineering decision behind every pattern isn't prompt wording. It's how to slice, pass, collect context.

JR Real Case: classroom-deck-builder skill

JR's classroom-deck-builder skill (.claude/skills/classroom-deck-builder/) compiles a Quest lesson into a "live class" (slide + voiceover + teaching gestures).

Started in "one-shot mode" — one LLM call generated N slides + voiceover. Slide-to-slide narrative broke, voiceover tone inconsistent. Rewritten into two stages:

Stage 1: outline 阶段
  Context = lesson goal + style guide
  → 输出 N 个 SceneOutline(仅标题 + 一句话指令)

[人工审 outline,可改可删]

Stage 2: finalize 阶段(SSE 流式)
  Context = lesson goal + style guide + 已批准的 outline 全集
  → 逐个 scene 生成完整 slide + 配音

Stage 2 packs in one extra layer that Stage 1 lacks — "the full approved outline". Each scene knows its position in the overall narrative. One-shot mode lacks that, so narrative breaks.

Textbook context engineering: don't change prompt wording. Change how context flows between LLM calls.

Single-Layer Context vs Layered Context — Trade-off

Most people dump all 5 layers into one prompt string in their first LLM app. It runs but debugging is hell. Layered processing needs more engineering but you can tune each layer independently.

DimensionSingle-layer mixed promptLayered context
WriteOne string format, 30 linesContextBuilder class, 100+ lines
DebugErrors = rewrite whole thingMock any single layer
Token costNo prompt cacheFirst 2 layers cacheable, 60-90% saved
Multi-personOne edit affects everyonesystem / tools have own owners
Use whenSingle-task scriptsProduction LLM apps
Don't use whenComplex agenticOne-shot demos

JR rule: more than 3 LLM calls or more than one maintainer — must be layered.

Takeaway

Prompt engineering cares about 1 / 5 (user input). Context engineering cares about 5 / 5 — plus how layers pass, cache, isolate. 80% of production LLM app engineering lives in the first 4 layers.


References

  1. Anthropic. (2024-12-20). Building Effective Agents.
  2. Anthropic. Prompt caching docs — 5-minute TTL ephemeral cache.
  3. Anthropic. Cookbook — agent patterns.
  4. Anthropic. Tool use docs.
  5. Liu et al. (2023-07-06). Lost in the Middle. arXiv:2307.03172.

Production case: JR Academy classroom-deck-builder skill (.claude/skills/classroom-deck-builder/) — two-stage pipeline implements context isolation.

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题,点击展开答案

System prompt 和 user prompt 区别是什么?

System 装不变的规则 + 角色,每次调用复用、走 prompt cache 省 90% 成本;user 是这一轮的 task 指令,每次都变。两者混一字符串调试是地狱,production 必须分开。

我的 context 里 system / tools / memory / retrieval / user 各自占多少 token?

典型 production RAG 占用:system 800 + tools 1100 + memory 300 + retrieval 12000 + user 80 = 14280 token。用 Anthropic SDK 的 token counter 分段量。Retrieval 层永远是最大头。

Anthropic 5 个 agent pattern 是什么?

5 个 pattern:Prompt chaining(拆步)、Routing(先分类再选模板)、Parallelization(并行 + 汇聚)、Orchestrator-workers(主 agent + sub-agent)、Evaluator-optimizer(生成 + 评估循环)。每个都是 context 怎么切/怎么传的工程决策,不是 prompt 措辞。

我已经在用 OpenAI Chat Completions API,5 层 context 还适用吗?

适用。5 层是工程抽象,不绑模型:OpenAI 的 system role + tools 参数 + messages history + retrieval 注入 + user message 一一对应。Gemini / DeepSeek / Qwen API 字段名不同但层次完全一致,迁移只是改 SDK 包名。

做客服机器人这种简单业务,5 层都要做吗?

不需要全做:客服机器人最少 3 层够用——system(角色 + 业务规则)+ retrieval(FAQ 知识库 top-3)+ user。Memory 层做不做看是否需要跨 session 记住客户偏好;tools 层只在要操作订单/退款时才加。先 3 层上线再按 ticket 复盘加层。

新手最容易在哪一层翻车?

Retrieval 层。80% 自学项目栽在「召回了相关文档但 LLM 还是答错」——以为是 prompt 不行去调 system 措辞,真问题是 retrieval 没做 selection(filter / rerank / judge)。第 3、5 章专门讲。