What is the difference between system prompt and user prompt?

System holds the stable rules + role, reused every call and rides prompt cache to save 90% cost; user is this turn's task instruction, changes every call. Mixing them into one string is a debugging nightmare; production must split them.

How many tokens do system / tools / memory / retrieval / user each take in my context?

Typical production RAG split: system 800 + tools 1100 + memory 300 + retrieval 12000 + user 80 = 14280 tokens. Use the Anthropic SDK token counter to measure each segment. Retrieval is always the biggest slice.

What are Anthropic's 5 agent patterns?

5 patterns: Prompt chaining (split into steps), Routing (classify then pick template), Parallelization (run parallel + aggregate), Orchestrator-workers (lead agent + sub-agents), Evaluator-optimizer (generate + evaluate loop). Each is a context slice/pass engineering decision, not a wording one.

I am already on OpenAI Chat Completions API — does the 5-layer context model still apply?

Yes. The 5 layers are an engineering abstraction, not model-bound: OpenAI's system role + tools parameter + messages history + injected retrieval + user message map one-to-one. Gemini / DeepSeek / Qwen change field names but the layering is identical; migration is just a SDK package swap.

For a simple customer-support bot, do I need all 5 layers?

No: a customer-support bot needs only 3 layers minimum — system (role + business rules) + retrieval (top-3 FAQ KB hits) + user. Add memory only if you must remember preferences across sessions; add tools only when the bot needs to act on orders/refunds. Ship 3 layers first, then add layers based on ticket post-mortems.

Which layer trips up beginners the most?

The retrieval layer. 80% of self-study projects crash on "retrieved relevant docs but LLM still wrong" — they assume the prompt is bad and tweak system wording, but the real bug is missing selection on retrieval (filter / rerank / judge). Chapters 3 and 5 cover it.

The Boundary with Prompt Engineering — Tuning Each of the 5 Context Layers

⏱️ 20 min

Boundary with Prompt Engineering — How to Tune Each of the 5 Context Layers

Chapter 1 listed the 5 context layers. This chapter takes each apart: what goes in, how many tokens it eats, how to debug in isolation, how to find the culprit.

Every context engineering technique (rerank, memory, sub-agents) is surgery on one of these layers.

What the 5 Layers Actually Look Like

Below is the full message JR omni-report's daily-jobs routine sends to Claude right after scraping LinkedIn —

# tested: 2026-04-26 · anthropic@0.40.0 · model: claude-sonnet-4-6
client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,

    # === Layer 1: System instruction ===
    system=[
        {
            "type": "text",
            "text": "你是 JR Academy 的 daily-jobs picker...",  # ~800 token
            "cache_control": {"type": "ephemeral"}  # 5min TTL prompt cache
        }
    ],

    # === Layer 2: Tool definitions ===
    tools=[
        {"name": "WebFetch", "description": "...", "input_schema": {...}},  # ~200 token
        {"name": "Write",    "description": "...", "input_schema": {...}},  # ~150 token
        # 共 6 个 tool ≈ 1100 token
    ],

    messages=[
        # === Layer 3: Memory / 历史对话 ===
        {"role": "user",      "content": "前 7 天已挑过的 job ID 列表..."},  # ~300 token
        {"role": "assistant", "content": "记下了"},

        # === Layer 4: Retrieved context ===
        {"role": "user", "content": [
            {"type": "text", "text": "LinkedIn 抓回的 30 个 job 全文："},
            {"type": "text", "text": "<job1>...</job1><job2>...</job2>"}  # ~12000 token
        ]},

        # === Layer 5: User input（这次 task 的实际指令）===
        {"role": "user", "content": "从上面 30 个 job 里挑 3 个："
                                    "1 aspirational / 1 actionable / 1 special。"
                                    "输出 JSON 严格匹配 ${OUT} 文件 schema。"}  # ~80 token
    ]
)

Total context here lands around 14500 tokens. Prompt engineering only cares about that last user input — 80 tokens, 0.5%. The other 99.5% is context engineering territory.

Layer by Layer

Layer 1: System instruction — Role + Invariant Rules

What goes in: model role, rules that never change (output format, prohibited actions, token limits), long descriptions worth caching (reference docs, style guides).

Tune: must stay stable — once changed, cache invalidates entirely. Anthropic prompt cache is 5-minute TTL, reuse requires system block stays the same (docs). Add cache_control: ephemeral and system drops to 10% billing on reused calls.

Debug: system alone + minimal user input, see if model acts in role. Change system and output is identical — system isn't taking effect, usually placed wrong (should be system but ended in user).

What goes in: each function calling / MCP tool's name + description + input_schema, including "when to use" (write into description).

Tune: each tool schema 100-300 tokens. 15 tools ≈ 3000 tokens, all fight for context space. Claude Code on MCP can pull 17 servers × N tools — that's why most tools are deferred (loaded on demand via ToolSearch).

Debug: cut tool list to 1, see if model still calls tools randomly. If yes — description isn't clear, model is "guessing" (classic context pollution).

Layer 3: Memory / chat history — What Happened Before

What goes in: previous N turns (usually summarized), scratchpad of agent task steps, long-term memory (user preferences).

Tune: layer where context budget blows first. 100 turns × 1500 tokens = 150K, eats 75% of a 200K window. Needs three-layer memory architecture (Chapter 6): scratchpad / working / persistent, each with its own compression strategy.

Debug: cut history to last turn. Can model still finish? Yes — history was redundant. No — you need better summarization, not deletion.

Layer 4: Retrieved context — Facts Stuffed in on the Fly

What goes in: RAG passages, WebFetch page content, prior tool call outputs.

Tune: biggest token volume, most volatility. 20 passages × 500 chars = 10K tokens. Lost in the Middle (Liu et al. 2023, arXiv:2307.03172): docs in the middle the model basically can't see (Chapter 3 details).

Debug: inspect what model cited vs retrieved. 10 retrieved, 0 cited — selection failed, add rerank (Chapter 5).

Layer 5: User input — This Turn's Actual Instruction

What goes in: what the user typed + task goal + output requirements.

Tune: traditional prompt engineering turf. Difference from the other 4: changes every call, can't be cached.

Debug: lock the first 4 layers, change user input wording, watch output shift — standard prompt engineering A/B setup.

Anthropic's 5 Agent Patterns = 5 Kinds of Context Engineering

Building Effective Agents (Anthropic, 2024-12-20) breaks agentic systems into 5:

Anthropic pattern	Context engineering decision
Prompt chaining	Big task into multiple calls, each only sees what it needs
Routing	Cheap model classifies → picks template → hands to expensive
Parallelization	Multiple LLMs run, results merge
Orchestrator-workers	Main agent splits task, sub-agents do pieces
Evaluator-optimizer	One generates, one evaluates, loop

Engineering decision behind every pattern isn't prompt wording. It's how to slice, pass, collect context.

JR Real Case: classroom-deck-builder skill

JR's classroom-deck-builder skill (.claude/skills/classroom-deck-builder/) compiles a Quest lesson into a "live class" (slide + voiceover + teaching gestures).

Started in "one-shot mode" — one LLM call generated N slides + voiceover. Slide-to-slide narrative broke, voiceover tone inconsistent. Rewritten into two stages:

Stage 1: outline 阶段
  Context = lesson goal + style guide
  → 输出 N 个 SceneOutline（仅标题 + 一句话指令）

[人工审 outline，可改可删]

Stage 2: finalize 阶段（SSE 流式）
  Context = lesson goal + style guide + 已批准的 outline 全集
  → 逐个 scene 生成完整 slide + 配音

Stage 2 packs in one extra layer that Stage 1 lacks — "the full approved outline". Each scene knows its position in the overall narrative. One-shot mode lacks that, so narrative breaks.

Textbook context engineering: don't change prompt wording. Change how context flows between LLM calls.

Single-Layer Context vs Layered Context — Trade-off

Most people dump all 5 layers into one prompt string in their first LLM app. It runs but debugging is hell. Layered processing needs more engineering but you can tune each layer independently.

Dimension	Single-layer mixed prompt	Layered context
Write	One string format, 30 lines	ContextBuilder class, 100+ lines
Debug	Errors = rewrite whole thing	Mock any single layer
Token cost	No prompt cache	First 2 layers cacheable, 60-90% saved
Multi-person	One edit affects everyone	system / tools have own owners
Use when	Single-task scripts	Production LLM apps
Don't use when	Complex agentic	One-shot demos

JR rule: more than 3 LLM calls or more than one maintainer — must be layered.

Takeaway

Prompt engineering cares about 1 / 5 (user input). Context engineering cares about 5 / 5 — plus how layers pass, cache, isolate. 80% of production LLM app engineering lives in the first 4 layers.

References

Anthropic. (2024-12-20). Building Effective Agents.
Anthropic. Prompt caching docs — 5-minute TTL ephemeral cache.
Anthropic. Cookbook — agent patterns.
Anthropic. Tool use docs.
Liu et al. (2023-07-06). Lost in the Middle. arXiv:2307.03172.

Production case: JR Academy classroom-deck-builder skill (.claude/skills/classroom-deck-builder/) — two-stage pipeline implements context isolation.

📚 相关资源

Anthropic Building Effective Agents

❓ 常见问题

关于本章主题最常被搜索的问题，点击展开答案

System prompt 和 user prompt 区别是什么？

System 装不变的规则 + 角色，每次调用复用、走 prompt cache 省 90% 成本；user 是这一轮的 task 指令，每次都变。两者混一字符串调试是地狱，production 必须分开。

我的 context 里 system / tools / memory / retrieval / user 各自占多少 token？

典型 production RAG 占用：system 800 + tools 1100 + memory 300 + retrieval 12000 + user 80 = 14280 token。用 Anthropic SDK 的 token counter 分段量。Retrieval 层永远是最大头。

Anthropic 5 个 agent pattern 是什么？

5 个 pattern：Prompt chaining（拆步）、Routing（先分类再选模板）、Parallelization（并行 + 汇聚）、Orchestrator-workers（主 agent + sub-agent）、Evaluator-optimizer（生成 + 评估循环）。每个都是 context 怎么切/怎么传的工程决策，不是 prompt 措辞。

我已经在用 OpenAI Chat Completions API，5 层 context 还适用吗？

适用。5 层是工程抽象，不绑模型：OpenAI 的 system role + tools 参数 + messages history + retrieval 注入 + user message 一一对应。Gemini / DeepSeek / Qwen API 字段名不同但层次完全一致，迁移只是改 SDK 包名。

做客服机器人这种简单业务，5 层都要做吗？

不需要全做：客服机器人最少 3 层够用——system（角色 + 业务规则）+ retrieval（FAQ 知识库 top-3）+ user。Memory 层做不做看是否需要跨 session 记住客户偏好；tools 层只在要操作订单/退款时才加。先 3 层上线再按 ticket 复盘加层。

新手最容易在哪一层翻车？

Retrieval 层。80% 自学项目栽在「召回了相关文档但 LLM 还是答错」——以为是 prompt 不行去调 system 措辞，真问题是 retrieval 没做 selection（filter / rerank / judge）。第 3、5 章专门讲。