Context Engineering & Memory
Context engineering and memory keep LLM responses relevant without blowing token budgets.
1) Goals
- Provide just enough context (instructions + facts) for accuracy.
- Control cost/latency by trimming or structuring history.
- Maintain conversational continuity where needed.
2) Instruction Hierarchy
- System: non-negotiable rules (role, language, safety).
- Task/User: current request and constraints.
- History: only necessary turns; summarize older content.
- Tools: function specs and expectations.
3) History Management
- Sliding window: keep recent N turns.
- Summarization: compress older history into bullets; include IDs/time.
- Topical caches: store per-topic summaries; swap in/out as topic changes.
- Reset triggers: new topic? re-send core instructions; drop stale history.
4) Context Packing for RAG/Chat
- Strict budget: target ≤ 60-70% of context limit; reserve for output.
- Ordering: instructions → constraints → retrieved snippets (with IDs) → question.
- Deduplicate snippets; group by source; include citation IDs.
- Dynamic selection: choose top-k by relevance + recency + source diversity.
5) Structured Facts
- Provide facts as bullet lists or key-value blocks, not prose.
- Use IDs for each fact for citation/traceback.
- For numbers/dates, keep canonical units and formats.
6) Session Memory Patterns
- Short-term: recent dialog + working set.
- Long-term: vector or key-value store of facts/preferences; retrieve by query + tenant/user.
- Ephemeral: auto-expire or rotate; respect privacy/PII limits.
7) Safety & Leakage Prevention
- Drop user-provided prompt fragments from summaries to avoid prompt injection persistence.
- Redact secrets/PII before storing/retrieving.
- Tag data by tenant/user/region; filter on retrieval.
8) Testing & Validation
- Token audits: measure context size under typical/peak conditions.
- Regression checks: ensure core instructions remain present after packing.
- Topic-switch tests: verify summaries and resets behave.
9) Minimal Checklist
- Instruction hierarchy enforced; core rules always included.
- History trimmed/summarized with IDs; budgeted context ≤ 70% of limit.
- Retrieved snippets deduped, cited, and filtered by tenant.
📚 相关资源
❓ 常见问题
关于本章主题最常被搜索的问题,点击展开答案
history 越塞越多 token 爆炸,怎么管理?
三种策略组合用:(1) sliding window 只留最近 N 轮;(2) summarization 把更早的压成 bullets,保留 IDs/time;(3) topical caches 按话题存摘要,话题切换时换入换出。话题切换还要触发 reset:重发核心指令、丢弃陈旧历史。目标 context 占 ≤ 60-70% 模型上限,给输出留 1/3。
system / user / history / tools 该怎么排顺序?
Instruction hierarchy:(1) System 装不可妥协的规则(角色、语言、安全),最高优先级;(2) Task/User 是当轮请求和约束;(3) History 只放必需轮次,更早内容先 summarize;(4) Tools 装 function spec 和预期。RAG 里则是 instructions → constraints → retrieved snippets(带 IDs)→ question。关键信息放头或尾,避开中间。
long-term memory(用户偏好、历史事实)怎么存?
短期 memory 是当前对话 + working set;长期 memory 用 vector store 或 key-value store 存 facts/preferences,按 query + tenant/user 检索。Ephemeral memory 自动过期/轮换避免 PII 堆积。事实存成 bullet list 或 key-value 块,不写散文;每条带 ID 便于引用追溯;数字/日期用统一单位和格式。
memory 系统怎么防 prompt injection?
三道闸:(1) 摘要时把用户提供的 prompt 片段剥离,避免注入持久化进 memory;(2) 写/读前对 secrets 和 PII 脱敏;(3) 数据按 tenant/user/region 打 tag,retrieval 时强过滤。production 还要做 token audit(典型 + 峰值场景下 context 大小)和 regression check(核心 instructions 在 packing 后还在)。
怎么验证 context packing 没把核心指令挤掉?
Minimal checklist:(1) instruction hierarchy 强制执行,核心规则每次都进;(2) history 已 trim/summarize 带 IDs,预算 ≤ 70% 模型上限;(3) retrieval snippets 已去重、带引用、按 tenant 过滤。再做 topic-switch 测试:故意切话题,验证 summary 和 reset 行为。每次 release 跑一遍 regression 不漏。