What is Context Engineering — Karpathy's Rename
What Context Engineering Is — Karpathy's Renaming
2025-06-25, Karpathy posts on X: "+1 for 'context engineering' over 'prompt engineering'. People associate prompts with short task descriptions you'd give an LLM in your day-to-day use. When in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step."
That tweet was a relay handoff from Shopify CEO Tobi Lutke's post 7 days earlier — "I really like the term 'context engineering' over prompt engineering" (@tobi, 2025-06-18). Once an LLM app stuffs 50K to 200K tokens of context per call, the prompt is just a small slice of the whole thing.
Pick Your Path First
Jump based on the engineering problem:
| You're stuck on | Jump to |
|---|---|
| RAG retrieves right docs but LLM answers wrong | Ch 3, Ch 5 |
| Agent burns through context window in a few steps | Ch 4, Ch 6 |
| Picking Cursor / Claude Code / Cline | Ch 8 |
| Building production RAG yourself | Ch 10 |
| Just finished Prompt Engineering | Read 1→2→3→4→5 |
Where Prompt Ends and Context Begins
Tobi Lutke's definition — "the art of providing all the context for the task to be plausibly solvable by the LLM".
Prompt is the few lines you typed. Context is every token the LLM actually sees:
- System instruction — the system prompt set in the API
- Tool definitions — schema for every function calling / MCP tool
- Memory / chat history — earlier turns (or summaries)
- Retrieved context — RAG-fetched docs, vector passages, scraped pages
- User input — what the user typed this turn
Prompt engineering writes item 5; context engineering decides how 1-5 stack into a 200K window without blowing it up.
JR's omni-report runs 17 routines — "JR AI Visibility Weekly" sweeps 20 real student queries weekly to check whether AI engines recommend JR Academy. Its prompt is doing 4 context engineering moves:
# 真实节选:omni-report/JR AI Visibility Weekly routine prompt
# tested: 2026-04-26 · model: claude-sonnet-4-6
【Phase 1:写骨架 + commit】
Write `ai-visibility/$DATE.md` 骨架(10 个 _TBD_ 占位)
commit + push: feat(ai-visibility): scaffold $DATE
【Phase 2:4 batch × 5 query × 2 layer】
每 batch 处理 5 个 query → Edit 对应表格 → commit + push
(避免 stream idle timeout)
- Context budget — task split into 6 phases with forced commits, output never piles past stream timeout
- Context selection — use
lsto push upstream paths into context, don't let LLM guess - Context scaffolding — skeleton file first (10 TBD placeholders) gives Edits structural anchors
- Context isolation — each batch commits independently, earlier mistakes don't pollute later
None are prompt wording. All structural decisions.
Why This Term Only Showed Up in 2025
ChatGPT shipped at the end of 2022. It took two and a half years for someone to say "we should rename this" — engineering reality shifted. Three things stacked up:
Thing 1: RAG went mainstream (2023-2024)
LangChain shipped v0.1 in January 2024, LlamaIndex same time. RAG turned LLM input from "a paragraph" into "a paragraph + 5 retrieved passages + summary of last 3 turns". Stanford's Lost in the Middle in July 2023 (arXiv:2307.03172, Liu et al.) proved LLM attention drops hard in the middle — docs in the prompt ≠ model uses them.
Thing 2: Agents took off (2024)
Anthropic dropped Building Effective Agents on 2024-12-20, breaking agentic systems into five patterns: prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer. The engineering challenge behind each is "make sure each step only sees what it should see".
Thing 3: Long-context models (2024-2025)
Claude 3.5 gave us 200K context windows. Gemini 1.5 Pro went straight to 1M. But being able to stuff isn't the same as needing to — stuffing too much drops retrieval accuracy (Lost in the Middle gets worse at 100K), cost climbs linearly, latency degrades. More tokens means token allocation became the new engineering problem.
Three threads converged by mid-2025, and Tobi and Karpathy said nearly the same line within 7 days in June. Simon Willison's 2025-06-27 post recaps — not a naming gimmick, 2 years of engineer muscle memory finally getting a name.
Prompt Engineering vs Context Engineering — Not Replacement, It's a Layer Above
A lot of people misread this as "prompt eng is obsolete". Prompt is part of context. Their relationship:
| Dimension | Prompt Engineering | Context Engineering |
|---|---|---|
| Focus | Wording of one instruction, few-shot | How all 5 context layers assemble |
| Typical scenario | ChatGPT chat box | Production LLM app |
| Optimization goal | Nail this one prompt | Pipeline stays reliable across 1000 queries |
| Failure mode | Model misunderstands you | Model "understood right but answered wrong" (pollution, decay, overflow) |
| Prerequisite | None | Must know prompt engineering first |
| Doesn't apply when | One-shot, stateless | Toy demo |
Prompt engineering solves the "communication problem". Context engineering solves the "system problem". First is like writing an email. Second is like designing the email protocol.
Why "Engineering" Instead of "Design"
Other people call this "context design" or "prompt craftsmanship". Karpathy deliberately picked the word engineering — that's a signal:
- Measurable — context quality runs through an eval set of 1000 queries, not "I think it looks good"
- Reusable — context selection strategy is code, drops into the next project
- Can break — one deploy tweaks retrieval threshold, accuracy drops 15% — incident, not taste
- Trade-offs explicit — 5 more passages vs cost 2x vs latency +800ms, pick one
Anthropic's prompt caching docs — the 5-minute TTL is engineering trade-off. A cache hit saves 90% on cost, but past 5 minutes you rebuild it.
What's Coming Across the Next 9 Chapters
| # | Chapter | What |
|---|---|---|
| 2 | Boundary with Prompt Engineering | 5 context layers debugging |
| 3 | Context Selection — retrieval ≠ correct answer | Attention decay + Lost in the Middle |
| 4 | Token Budget — split 200K | 5 layers fight for the pool |
| 5 | Rerank — retrieval to selection | bi-encoder + cross-encoder + LLM-judge |
| 6 | Agent Memory three-layer | scratchpad / working / persistent |
| 7 | Context cost of tool calls | Tool schema tokens, MCP dynamic discovery |
| 8 | Cursor / Claude Code / Cline compared | 80% experience gap is context strategy |
| 9 | Multi-agent context isolation | sub-agent + summary-back |
| 10 | Build production RAG in 7 days | Eval set, monitoring, checklist |
Every chapter unpacks a real engineering problem with code + JR omni-report cases.
Takeaway
Prompt is "what you say". Context is "what the LLM sees". Between "what you say" and "what the LLM sees" sits 4 more layers — system prompt, tool definitions, memory, retrieval. Those 4 layers are where 80% of the engineering work in a production LLM app lives.
References
- Karpathy. (2025-06-25). Tweet on context engineering.
- Tobi Lutke. (2025-06-18). Tweet on context engineering.
- Anthropic. (2024-12-20). Building Effective Agents.
- Liu et al. (2023-07-06). Lost in the Middle. arXiv:2307.03172.
- Simon Willison. (2025-06-27). Context engineering.
- Anthropic. Prompt caching docs.
Production case: JR Academy omni-report — context engineering across 17 routines.
📚 相关资源
❓ 常见问题
关于本章主题最常被搜索的问题,点击展开答案
Context Engineering 和 Prompt Engineering 的区别是什么?
Context engineering 包含 5 层:system prompt、工具定义、memory、检索结果、user input。Prompt engineering 只优化第 5 层措辞;context engineering 优化全部 5 层怎么组装。Karpathy 2025-06-25 推文:prompt 只是 context 一小片。
为什么 2025 年才有这个词?
RAG 和 Agent 普及后,LLM 单次输入动辄 50K-200K token、来自多个来源,context 选/排/压缩变成独立工程问题。Tobi Lutke (Shopify CEO) 2025-06-18 推、Karpathy 2025-06-25 接力,词才落地。
学这个之前要先学 Prompt Engineering 吗?
要。Prompt 是 context 5 层中的第 5 层,跳过 prompt 基础学 context engineering 等于学钢琴不识谱。先走完 JR Academy 的 Prompt 大师方向。
Context Engineering 是不是又一个 hype 词?
不是。Anthropic 2024-12-20 发布的「Building Effective Agents」blog 已经在做 context 工程化,只是没命名。Cursor / Claude Code / Cline 调同一模型体验差 80%,差异全在 context 策略。
学完整个方向自己跑一遍 demo 大概要花多少钱?
$15-25 总成本:Anthropic API $5-10(10 章 demo + 第 10 章 7 天 RAG 实战)、Cohere rerank $2-5、向量库 Pinecone/Qdrant 免费档够用。本地跑 chunking + eval 脚本不烧钱,CPU 笔记本即可。
我每天能学 1 小时,多久能把 10 章走完?
3-4 周:10 章每章 15-25 分钟阅读 + 30-45 分钟动手验证,加第 10 章 7 天 RAG 实战。每天 1 小时节奏走 21 天,每天 2 小时压到 14 天。建议把第 10 章实战集中在最后一周做,前面章节边读边攒理解。
不学 context engineering,直接 fine-tune 一个模型行不行?
不行。Fine-tune 解决「风格 / 格式 / 领域语气」,解决不了「实时数据 / 工具调用 / 多轮记忆 / 大文档检索」——这些全是 context engineering 的活。OpenAI 和 Anthropic 官方文档都明确推荐:先做 RAG + prompt,证明不够再 fine-tune,95% 场景根本走不到 fine-tune。