07

Context Engineering & Memory

⏱️ 35分钟

Context engineering and memory keep LLM responses relevant without blowing token budgets.

1) Goals

  • Provide just enough context (instructions + facts) for accuracy.
  • Control cost/latency by trimming or structuring history.
  • Maintain conversational continuity where needed.

2) Instruction Hierarchy

  • System: non-negotiable rules (role, language, safety).
  • Task/User: current request and constraints.
  • History: only necessary turns; summarize older content.
  • Tools: function specs and expectations.

3) History Management

  • Sliding window: keep recent N turns.
  • Summarization: compress older history into bullets; include IDs/time.
  • Topical caches: store per-topic summaries; swap in/out as topic changes.
  • Reset triggers: new topic? re-send core instructions; drop stale history.

4) Context Packing for RAG/Chat

  • Strict budget: target ≤ 60-70% of context limit; reserve for output.
  • Ordering: instructions → constraints → retrieved snippets (with IDs) → question.
  • Deduplicate snippets; group by source; include citation IDs.
  • Dynamic selection: choose top-k by relevance + recency + source diversity.

5) Structured Facts

  • Provide facts as bullet lists or key-value blocks, not prose.
  • Use IDs for each fact for citation/traceback.
  • For numbers/dates, keep canonical units and formats.

6) Session Memory Patterns

  • Short-term: recent dialog + working set.
  • Long-term: vector or key-value store of facts/preferences; retrieve by query + tenant/user.
  • Ephemeral: auto-expire or rotate; respect privacy/PII limits.

7) Safety & Leakage Prevention

  • Drop user-provided prompt fragments from summaries to avoid prompt injection persistence.
  • Redact secrets/PII before storing/retrieving.
  • Tag data by tenant/user/region; filter on retrieval.

8) Testing & Validation

  • Token audits: measure context size under typical/peak conditions.
  • Regression checks: ensure core instructions remain present after packing.
  • Topic-switch tests: verify summaries and resets behave.

9) Minimal Checklist

  • Instruction hierarchy enforced; core rules always included.
  • History trimmed/summarized with IDs; budgeted context ≤ 70% of limit.
  • Retrieved snippets deduped, cited, and filtered by tenant.

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题,点击展开答案

history 越塞越多 token 爆炸,怎么管理?

三种策略组合用:(1) sliding window 只留最近 N 轮;(2) summarization 把更早的压成 bullets,保留 IDs/time;(3) topical caches 按话题存摘要,话题切换时换入换出。话题切换还要触发 reset:重发核心指令、丢弃陈旧历史。目标 context 占 ≤ 60-70% 模型上限,给输出留 1/3。

system / user / history / tools 该怎么排顺序?

Instruction hierarchy:(1) System 装不可妥协的规则(角色、语言、安全),最高优先级;(2) Task/User 是当轮请求和约束;(3) History 只放必需轮次,更早内容先 summarize;(4) Tools 装 function spec 和预期。RAG 里则是 instructions → constraints → retrieved snippets(带 IDs)→ question。关键信息放头或尾,避开中间。

long-term memory(用户偏好、历史事实)怎么存?

短期 memory 是当前对话 + working set;长期 memory 用 vector store 或 key-value store 存 facts/preferences,按 query + tenant/user 检索。Ephemeral memory 自动过期/轮换避免 PII 堆积。事实存成 bullet list 或 key-value 块,不写散文;每条带 ID 便于引用追溯;数字/日期用统一单位和格式。

memory 系统怎么防 prompt injection?

三道闸:(1) 摘要时把用户提供的 prompt 片段剥离,避免注入持久化进 memory;(2) 写/读前对 secrets 和 PII 脱敏;(3) 数据按 tenant/user/region 打 tag,retrieval 时强过滤。production 还要做 token audit(典型 + 峰值场景下 context 大小)和 regression check(核心 instructions 在 packing 后还在)。

怎么验证 context packing 没把核心指令挤掉?

Minimal checklist:(1) instruction hierarchy 强制执行,核心规则每次都进;(2) history 已 trim/summarize 带 IDs,预算 ≤ 70% 模型上限;(3) retrieval snippets 已去重、带引用、按 tenant 过滤。再做 topic-switch 测试:故意切话题,验证 summary 和 reset 行为。每次 release 跑一遍 regression 不漏。