Agent Memory — Three Layers and Their Tool Stacks
Agent Memory — Three-Layer Architecture and Toolchain
An agent runs 5 steps fine. Run 50 and the context blows — most people's first instinct is "switch to a 1M context model." But running 50 steps on 1M costs 100× more than 5 steps on 200K, and Lost in the Middle gets worse, not better, at 1M.
The right move is giving the agent layered memory, not buying more context budget.
Three layers of memory — split by lifetime
Borrowing OS terminology. The Letta project (GitHub) is built on this OS metaphor:
| Layer | Lifetime | OS analogy | What goes in | Size |
|---|---|---|---|---|
| Scratchpad | One task | CPU register | tool-call intermediate results, temp vars | 1-10K |
| Working | One session | RAM | conversation history, task plan, user input | 5-50K |
| Persistent | Across sessions | Disk | user prefs, learned facts, past task summaries | Unbounded |
Each layer has a different toolchain — mixing them up is why context blows.
Layer 1: Scratchpad — task-internal scratch
What goes in: search-tool return value, intermediate markdown table from step two, step three's reasoning picking top 3.
Scratchpad uses no memory service — lives inside the task's messages array. After every tool call, the result becomes a tool_result message and the next LLM turn sees it.
- Once the task ends, scratchpad gets thrown out
- Token ceiling 10K; over that, split into sub-tasks (chapter 9)
- If a task ran 30 steps and scratchpad still bloats — task needs splitting, not memory upgrading
JR omni-report's "Phase 0/1/2/3/4" is exactly the scratchpad pattern — every phase commits, the next re-reads from git instead of the previous phase's scratchpad. Avoids contamination.
Layer 2: Working memory — session-scoped rolling
What goes in: every turn of the conversation, current multi-step task plan, session-level config (language preference, verbosity).
Simple: rolling window
Keep only the last N turns; truncate older ones. LangChain's ConversationBufferWindowMemory:
# tested: 2026-04-26 · langchain@0.3.x
from langchain.memory import ConversationBufferWindowMemory
memory = ConversationBufferWindowMemory(k=10) # 只留最近 10 轮
Good for: short conversations, Q&A assistants. Bad for: recalling 30-turn-old details.
Advanced: rolling window + history summary
Recent N turns verbatim; compress earlier with cheap model (Haiku), stuff into system prompt. LangChain's ConversationSummaryBufferMemory, LlamaIndex's ChatSummaryMemoryBuffer both do this.
Catch: summaries lose info. Turns with exact numbers, citations, code snippets can't be compressed.
Anthropic native: Message Batches API
100K+ turns (rare) — Anthropic Message Batches API async-processes the whole conversation history as RAG-on-history.
Layer 3: Persistent memory — cross-session fact store
What goes in: user prefs (language, style, expertise), learned facts (works at X, runs Y, follows Z), summaries of past task outcomes (not raw output).
The biggest gap between production agents and toy demos. Only persistent memory lets an agent "remember" across sessions.
Mem0 — arXiv:2504.19413
Two-phase extract-update: at the end of every conversation, extraction pulls salient facts; update phase asks an LLM to add / update / delete / no-op a memory entry.
Mem0 benchmarks: vs OpenAI default thread memory, accuracy 26% higher, p95 latency 91% lower, token cost 90%+ lower.
# tested: 2026-04-26 · mem0ai@0.1.x
from mem0 import Memory
m = Memory()
m.add("用户在悉尼工作,做 AI engineer", user_id="alice")
results = m.search("alice 在哪个城市", user_id="alice")
# → "用户在悉尼工作"
Letta — GitHub
Heavier OS three-layer (core + archival + recall). Letta benchmark: plain filesystem (one markdown file per user) hit 74% accuracy and beat plenty of specialized vector store memory libraries — simple usually works.
OpenAI Threads / Anthropic Conversation API
OpenAI Assistants thread comes with persistent context built in. Anthropic has no thread equivalent, but prompt caching 5-min TTL + your own conversation-history storage equals the same effect.
DIY — Filesystem / DB
One markdown file per user; at conversation end, LLM appends summary. On retrieval, read whole file into context. Letta benchmark proves this beats vector store when user-level state ≤ 10K characters.
Real JR case: skills-data-manager and prod-state.json
JR Academy skills-data-manager (tools/skills-data-manager/) handles bootcamp curriculum sync — edit locally, diff prod, sync one click. Persistent memory:
curriculum/{bootcamp-slug}/public/prod-state.json
└─ 每次从 prod 拉数据回来时缓存的「最后已知 prod 状态」
└─ next diff 用本地内容 - prod-state 算变化
└─ 跨 session 持久化,避免每次都拉 prod
"Filesystem as persistent memory" in the wild — one JSON file is the persistent memory, no vector store, no Mem0. Simple, debuggable, git-trackable. Only when entry count balloons past 1000+ and semantic recall becomes necessary do you upgrade to Mem0/Letta. Don't pre-empt for "we might need it someday."
Mem0/Letta service vs DIY filesystem — Trade-off
| Dimension | Managed (Mem0 / Letta) | DIY filesystem |
|---|---|---|
| Onboarding | < 1 day | 1-3 days |
| Recall quality | 26% above baseline | Tied < 100 entries; falls behind > 1000 |
| Latency | 100-500ms | 5-50ms |
| Cost | $0.01-0.10/MAU + token | Near zero |
| Debuggability | Black box | 100% readable |
| Data compliance | Leaves company | Controllable |
| Multi-user | Built-in user_id isolation | DIY |
| Best for | 100+ user × frequent sessions | < 100 user / internal / POC |
JR rule: start filesystem, run 3 months, watch entry count + recall quality, migrate to Mem0 only after crossing threshold. Letta's 74% filesystem benchmark is the basis.
Takeaway
Memory isn't one thing — it's three: scratchpad (within-task) / working (within-session) / persistent (cross-session). Mix them up and the context blows. Production agents must layer — scratchpad lives in the messages array, working uses rolling window + summary, persistent starts on filesystem and graduates to Mem0.
References
- Mem0 team. (2025-04-28). Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413.
- Mem0 research benchmark. Benchmarking Mem0's token-efficient memory algorithm — 26% acc / 91% latency / 90% token 节省 vs OpenAI baseline.
- Letta. Letta documentation — OS-inspired 三层 memory(core / archival / recall)+ filesystem 74% benchmark surprise.
- LangChain. Migrating from ConversationSummaryBufferMemory — 滚动 + 摘要 working memory pattern.
- Anthropic. Message Batches API — 异步处理超长 conversation.
- mem0ai. GitHub — Mem0 开源实现.
Production case: JR Academy skills-data-manager(tools/skills-data-manager/)— prod-state.json 作为 filesystem persistent memory.
📚 相关资源
❓ 常见问题
关于本章主题最常被搜索的问题,点击展开答案
为什么 agent 不能用 1M context 模型一劳永逸?
1M context 不解决问题:200K → 1M 成本涨 5×,Lost in the Middle 在 1M 下更严重(中段失明范围更广)。正确方向是给 agent 加分层 memory(scratchpad / working / persistent),不是给 context 加预算。
Mem0 和 Letta 哪个适合我的项目?
按用户量分:< 100 user 用 filesystem,> 1000 user 用 Mem0。Mem0 用 extraction-update 两阶段,vs OpenAI baseline 准确率高 26%、p95 延迟低 91%、token 成本省 90%。Letta 走 OS 三层(core/archival/recall)但 benchmark 显示 plain filesystem 拿 74% 反超 vector store。
Working memory 和 persistent memory 边界在哪?
按生命周期切:Working = 当前 session 内的滚动 context(用 LangChain ConversationSummaryBufferMemory 摘要更早轮次),Persistent = 跨 session 的事实库(用户偏好、学到的事实、之前 task 的成果摘要)。Session 结束 working 清空,persistent 保留到下次。
做单人助手类产品 memory 三层都要做吗?
不需要:单人助手最少 working + persistent 两层够。Working 用 LangChain ConversationSummaryBufferMemory 自动摘要、Persistent 用 SQLite + 手动 schema 存用户偏好/事实。Scratchpad 只在 agent 跑多步推理时才用,简单 chat 不需要。
Mem0 / Letta 一个月跑下来贵不贵?
Mem0 自托管免费(开源 Apache 2.0),托管版 Mem0 Cloud free tier 1K memory + 1K search/月够 demo。Letta 100% 开源自托管,PostgreSQL + 一个 Python service 即可,云成本就是 PG 实例(~$15/月 RDS t4g.micro)。两个都不会成本爆炸。
Memory 系统最常见的失败模式是什么?
把所有 LLM 输出无脑写 persistent,3 个月后 memory 库膨胀到 50K+ 条目,每次 search 召回的全是过期/矛盾内容。正确做法:写入前过 LLM-as-judge 一层判断「该不该记」、加 expiration(30/90 天)、定期跑去重 + 矛盾合并。Mem0 默认就做这些,自己手写就要补上。