How much of the 200K context window can I actually use?

200K is a turn budget, not infinite: typical production RAG burns 23% in 5 turns. 5 context layers fight one pool, retrieval is always the biggest slice (30K-50K). 20 turns of history + 30K retrieval per turn hits the ceiling.

How does Anthropic prompt cache save 90% on cost?

Cache-hit input tokens cost only 10%, cache writes cost +25%, TTL 5 minutes. Add cache_control: ephemeral to system + tools; high-frequency calls (≥ 6/min) save 60-90% per day. Catch: cache key is byte-exact, change one character in system/tools and the whole cache invalidates.

How much more expensive are output tokens than input?

Sonnet 4.6 output is 5× the input price. max_tokens=4096 is not just a length limit, it cuts cost directly. JR data: JSON tasks at max_tokens=2048 return valid JSON more reliably than 8192; long budgets make the model ramble and inject explanatory text that breaks JSON.

I am not on Anthropic — do OpenAI / Gemini have prompt cache too?

All three have it, mechanics differ: OpenAI auto-caches (≥1024-token prefix, hit price 50%, no cache_control needed); Gemini uses explicit cachedContent API (custom TTL, minimum token threshold); Anthropic uses explicit cache_control (hit price 10%, 5-min TTL, biggest savings). All match on byte-exact prefix — change one character in system/tools and the whole cache invalidates.

Does a solo dev side project really need to budget tokens?

Yes. Solo budgets are tight and one bad night can torch them: a RAG side project with no cache controls can burn $50+ in an hour. Bare minimum: add cache_control to system + tools, cap max_tokens at 2048. These two steps alone keep monthly cost under $20.

What is the most common pitfall in context budgeting?

Adding prompt cache while still injecting dynamic date / user ID / random strings into system — cache hits 0% forever. Check: in API response, cache_creation_input_tokens should not be > 0 every call, and cache_read_input_tokens should not stay 0. Both signals together mean system/tools are unstable; move the variables into user message.

Token Budget — How to Allocate Your 200K Window

⏱️ 20 min

Token Budget — How to Split 200K

200K sounds like a lot. Two copies of To Live (the Chinese novel runs ~120K characters) would fit. But in production LLM apps, 200K is never enough — five context layers fight over the same pool, every extra retrieval chunk costs a turn of history, every extra tool costs a block of memory.

What 200K actually buys you

JR omni-report's daily-jobs routine — by turn 5 of the conversation:

# tested: 2026-04-26 · model: claude-sonnet-4-6 · max_context: 200K

Layer 1  System instruction              ~800 tok      (cached)
Layer 2  Tool definitions (6 tools)     ~1,100 tok     (cached)
Layer 3  History (5 轮对话 + scratchpad)  ~8,500 tok
Layer 4  Retrieved context              ~32,000 tok    (LinkedIn 30 jobs 全文)
Layer 5  User input                       ~120 tok
─────────────────────────────────────────────────────
                                       ~42,520 tok    (使用 21%)

Output budget (max_tokens)              ~4,096 tok
─────────────────────────────────────────────────────
Effective budget                       ~46,616 tok    (23%)

Five turns in and 23% is gone. Run 15 more turns, history quadruples, and at 30K retrieval per turn you cap out at turn 20. 200K isn't "infinite" — it's a turn budget.

Not every token costs the same — pricing dictates the strategy

Anthropic's current pricing puts tokens in four tiers (relative ratios using Claude Sonnet 4.6 as baseline):

Token type	Relative cost	What it means
Cache hit (read)	1×	Input tokens that hit a 5-minute prompt cache
Cache write	12.5×	First write into the cache (25% premium on input)
Input (uncached)	10×	Plain old input tokens
Output	50×	Tokens the model generates (the priciest tier)

If system + tools are 2000 tokens and never change, cache hits make them feel like 200 tokens of budget. Anthropic's docs say it plainly: "prompt caching can reduce costs by up to 90%." Output costs 5× input, so max_tokens=4096 is a direct cost cut, not just brevity.

Five budget techniques (ROI-sorted)

1. Cache the fixed layers (highest ROI)

Add cache_control: {type: "ephemeral"} to system + tools, 5-minute TTL:

# tested: 2026-04-26 · anthropic@0.40.0
system=[
    {
        "type": "text",
        "text": SYSTEM_PROMPT,  # ~800 tok
        "cache_control": {"type": "ephemeral"}
    }
],
tools=[
    *[{**tool, "cache_control": {"type": "ephemeral"}}
      for tool in TOOL_DEFS]  # ~1100 tok
]

Cost: first call pays +25% cache-write tax; next 5 minutes those layers cost 10% per call. A high-frequency workload (≥6 calls/min) saves 60-90% daily.

Catch: cache key is byte-exact. One character changes in system/tools and cache shatters.

2. Summarize history (next ROI)

History blows first. Pattern:

Keep most recent N turns verbatim (5-10)
Push older turns through a cheap model (Haiku) to compress, stuff summary at end of system prompt marked [历史摘要]

LangChain's ConversationSummaryBufferMemory wraps this. Mem0 / Letta are heavier stacks (chapter 6).

Catch: summaries lose info. Tasks needing exact numbers / proper nouns must keep originals or re-fetch via RAG.

3. Lazy-load tools

15 tools × 200 tokens = 3000 tokens on every call. But 80% of tasks only touch 2-3.

Phase 1 ship a meta tool, searchTools(query), returning relevant tool names
Phase 2 when called, inject the schema then
Claude Code defaults to this — ToolSearch for on-demand loading

Catch: needs multi-turn flow. One API call can't do it.

4. Rerank retrieval (preview of chapter 5)

Retrieval is the biggest token sink. bi-encoder pulls 50 → cross-encoder reranks to 10 → LLM-as-judge picks 3. Same 90% recall, tokens drop from 25K to 1.5K.

5. Output budget cap

max_tokens=4096 vs 8192 — neither runs full, but the model's generation strategy shifts based on max_tokens. Give it 8K and it leans long-form; 4K and it self-compresses.

JR experience: JSON-output tasks at max_tokens=2048 produce valid JSON more reliably than 8192 — with a long budget the model gets chatty and slips prose into JSON.

Real JR case: omni-report routine budget strategy

The 17 omni-report routines run on Claude Code Remote, with a hidden constraint: stream idle timeout ≈ 10 minutes — no output for that long and you get cut off. Every routine has to chop big tasks into phases with forced commits. JR AI Visibility Weekly:

Phase 0: 准备 + 读上游           (~30 sec, 5K tok)
Phase 1: 写骨架 + commit          (~1 min, 2K tok)
Phase 2: 4 batch × 5 query        (~6 min × 4, 30K tok)
  └─ 每 batch 完成 → Edit → commit + push
Phase 3: 仪表盘 + 洞察 + 行动清单  (~2 min, 10K tok)
Phase 4: Notion 全文同步           (~30 sec, 8K tok)

Each phase outputs <8K, together they're 60K+. One LLM call doing the whole thing trips idle timeout — splitting into phases is both budget and reliability strategy.

Aggressive vs Generous — Trade-off

Dimension	Aggressive (squeeze every layer)	Generous (buffer)
Token cost	Very low (cache fully utilized)	2-5× higher
Latency	Low (short context)	Medium-high
Accuracy	Medium (may drop context)	High
Maintenance	High (fine-tune each cap)	Low
First LLM app	Don't	✅ Recommended
Large-scale production	✅ Recommended	Burns money

JR's rule: first app starts generous; only after it works + has eval set + ran 100 real queries does it go aggressive. Going aggressive on day one drops context, and eval can't tell you why.

Takeaway

200K context isn't a budget. It's a turn capacity that 5 layers fight over. Token cost isn't uniform — cache hit is 1×, output is 50×, cache write is 12.5×. The engineering core for production apps: send the first 2 layers (system + tools) through cache, the 3rd (history) through summaries, the 4th (retrieval) through rerank.

References

Anthropic. Pricing — input / output / cache 4 档计费.
Anthropic. Prompt caching documentation — 5 分钟 TTL ephemeral cache + 90% 成本节省说明.
Anthropic. Long context tips — 长 context 信息组织建议.
Anthropic. Models overview — context window 容量 + 模型对比.
LangChain. ConversationSummaryBufferMemory — 摘要 history 实现.

Production case: JR Academy omni-report routines — 「骨架 + 渐进填充 + 每段就 commit」的 phase 拆分模式，应对 stream idle timeout 的 budget 策略.

📚 相关资源

Anthropic Prompt Caching

❓ 常见问题

关于本章主题最常被搜索的问题，点击展开答案

200K context 实际能用多少？

200K 是轮次预算不是无限：典型 production RAG 跑 5 轮就用掉 23%。5 层 context 抢同一池子，retrieval 永远是最大头（30K-50K）。20 轮 history + 每轮 30K retrieval 就触顶。

Anthropic prompt cache 怎么省 90% 成本？

Cache hit input token 只收 10% 成本，cache write +25%，TTL 5 分钟。给 system + tools 加 cache_control: ephemeral，高频调用（每分钟 ≥ 6 次）日省 60-90%。坑：cache key 是 byte-exact match，system/tools 改一字符 cache 全部失效。

Output token 比 input 贵多少？

Sonnet 4.6 output 价格是 input 的 5×。max_tokens=4096 不只是限制长度，是直接砍成本。JR 实测：JSON 输出 task 设 max_tokens=2048 比 8192 更容易拿到 valid JSON，长预算下模型会絮叨夹解释文字破坏 JSON。

我用的不是 Anthropic，OpenAI / Gemini 也有类似的 prompt cache 吗？

都有但机制不同：OpenAI 自动 cache（≥1024 token 前缀，hit 价 50%，无需 cache_control）、Gemini 显式 cachedContent API（TTL 自定义、写入有最低 token 门槛）、Anthropic 显式 cache_control（hit 价 10%，TTL 5 分钟，省最多）。三家都按「前缀 byte-exact」匹配，system/tools 改一字符全失效的坑通用。

单人 dev side project 也要算 token budget 吗？

要。单人 project 月预算紧、容易一夜烧光：side project 跑 RAG 没 cache 控制，1 小时玩坏可烧 $50+。最低限度做两件事：system + tools 上 cache_control、max_tokens 限到 2048。这两步就够把月成本控在 $20 内。

Context budget 最常见的坑是什么？

把 prompt cache 加上后 system 还在动态拼接日期/用户 ID/随机字符串——cache 永远 0% 命中。检查方法：API 响应看 cache_creation_input_tokens 是否每次都 > 0、cache_read_input_tokens 是否一直 0。两个都中说明 system/tools 不稳定，把变量挪到 user message 里。