Why can't agents just use a 1M context model and call it a day?

1M context does not fix it: 200K → 1M is 5× the cost, and Lost in the Middle is worse at 1M (the blind midsection grows wider). The right direction is layered memory on the agent (scratchpad / working / persistent), not more context budget.

Mem0 vs Letta — which fits my project?

Pick by user count: 1000 users use Mem0. Mem0 uses a two-stage extraction-update flow, vs OpenAI baseline scores 26% higher accuracy, 91% lower p95 latency, 90% token cost savings. Letta uses OS-style three layers (core/archival/recall) but its own benchmarks show plain filesystem at 74% beating the vector store.

Where is the line between working memory and persistent memory?

Split by lifecycle: Working = rolling context inside current session (use LangChain ConversationSummaryBufferMemory to summarize earlier turns), Persistent = cross-session fact store (user prefs, learned facts, past task summaries). Session ends → working clears, persistent survives to the next.

For a single-user assistant product, do I need all 3 memory layers?

No: a single-user assistant needs only working + persistent. Use LangChain ConversationSummaryBufferMemory for working (auto-summary), use SQLite + a hand-written schema for persistent (user prefs / facts). Scratchpad only matters when an agent does multi-step reasoning; pure chat skips it.

Are Mem0 / Letta expensive to run monthly?

Mem0 self-hosted is free (Apache 2.0 OSS); Mem0 Cloud free tier gives 1K memories + 1K searches/month, enough for demo. Letta is 100% OSS self-hosted, just PostgreSQL + one Python service — cloud cost is the PG instance (~$15/mo on RDS t4g.micro). Neither blows up your bill.

What is the most common memory failure mode?

Dumping every LLM output into persistent memory — three months in, the store balloons to 50K+ entries and every search returns stale/contradictory content. Correct: gate writes through an LLM-as-judge "should-remember" check, add expiration (30/90 days), periodically run dedup + contradiction merging. Mem0 ships these by default; if you roll your own, add them.

Agent Memory — Three Layers and Their Tool Stacks

⏱️ 25 min

Agent Memory — Three-Layer Architecture and Toolchain

An agent runs 5 steps fine. Run 50 and the context blows — most people's first instinct is "switch to a 1M context model." But running 50 steps on 1M costs 100× more than 5 steps on 200K, and Lost in the Middle gets worse, not better, at 1M.

The right move is giving the agent layered memory, not buying more context budget.

Three layers of memory — split by lifetime

Borrowing OS terminology. The Letta project (GitHub) is built on this OS metaphor:

Layer	Lifetime	OS analogy	What goes in	Size
Scratchpad	One task	CPU register	tool-call intermediate results, temp vars	1-10K
Working	One session	RAM	conversation history, task plan, user input	5-50K
Persistent	Across sessions	Disk	user prefs, learned facts, past task summaries	Unbounded

Each layer has a different toolchain — mixing them up is why context blows.

Layer 1: Scratchpad — task-internal scratch

What goes in: search-tool return value, intermediate markdown table from step two, step three's reasoning picking top 3.

Scratchpad uses no memory service — lives inside the task's messages array. After every tool call, the result becomes a tool_result message and the next LLM turn sees it.

Once the task ends, scratchpad gets thrown out
Token ceiling 10K; over that, split into sub-tasks (chapter 9)
If a task ran 30 steps and scratchpad still bloats — task needs splitting, not memory upgrading

JR omni-report's "Phase 0/1/2/3/4" is exactly the scratchpad pattern — every phase commits, the next re-reads from git instead of the previous phase's scratchpad. Avoids contamination.

Layer 2: Working memory — session-scoped rolling

What goes in: every turn of the conversation, current multi-step task plan, session-level config (language preference, verbosity).

Simple: rolling window

Keep only the last N turns; truncate older ones. LangChain's ConversationBufferWindowMemory:

# tested: 2026-04-26 · langchain@0.3.x
from langchain.memory import ConversationBufferWindowMemory
memory = ConversationBufferWindowMemory(k=10)  # 只留最近 10 轮

Good for: short conversations, Q&A assistants. Bad for: recalling 30-turn-old details.

Advanced: rolling window + history summary

Recent N turns verbatim; compress earlier with cheap model (Haiku), stuff into system prompt. LangChain's ConversationSummaryBufferMemory, LlamaIndex's ChatSummaryMemoryBuffer both do this.

Catch: summaries lose info. Turns with exact numbers, citations, code snippets can't be compressed.

Anthropic native: Message Batches API

100K+ turns (rare) — Anthropic Message Batches API async-processes the whole conversation history as RAG-on-history.

Layer 3: Persistent memory — cross-session fact store

What goes in: user prefs (language, style, expertise), learned facts (works at X, runs Y, follows Z), summaries of past task outcomes (not raw output).

The biggest gap between production agents and toy demos. Only persistent memory lets an agent "remember" across sessions.

Mem0 — arXiv:2504.19413

Two-phase extract-update: at the end of every conversation, extraction pulls salient facts; update phase asks an LLM to add / update / delete / no-op a memory entry.

Mem0 benchmarks: vs OpenAI default thread memory, accuracy 26% higher, p95 latency 91% lower, token cost 90%+ lower.

# tested: 2026-04-26 · mem0ai@0.1.x
from mem0 import Memory
m = Memory()
m.add("用户在悉尼工作，做 AI engineer", user_id="alice")
results = m.search("alice 在哪个城市", user_id="alice")
# → "用户在悉尼工作"

Letta — GitHub

Heavier OS three-layer (core + archival + recall). Letta benchmark: plain filesystem (one markdown file per user) hit 74% accuracy and beat plenty of specialized vector store memory libraries — simple usually works.

OpenAI Threads / Anthropic Conversation API

OpenAI Assistants thread comes with persistent context built in. Anthropic has no thread equivalent, but prompt caching 5-min TTL + your own conversation-history storage equals the same effect.

DIY — Filesystem / DB

One markdown file per user; at conversation end, LLM appends summary. On retrieval, read whole file into context. Letta benchmark proves this beats vector store when user-level state ≤ 10K characters.

Real JR case: skills-data-manager and prod-state.json

JR Academy skills-data-manager (tools/skills-data-manager/) handles bootcamp curriculum sync — edit locally, diff prod, sync one click. Persistent memory:

curriculum/{bootcamp-slug}/public/prod-state.json
  └─ 每次从 prod 拉数据回来时缓存的「最后已知 prod 状态」
  └─ next diff 用本地内容 - prod-state 算变化
  └─ 跨 session 持久化，避免每次都拉 prod

"Filesystem as persistent memory" in the wild — one JSON file is the persistent memory, no vector store, no Mem0. Simple, debuggable, git-trackable. Only when entry count balloons past 1000+ and semantic recall becomes necessary do you upgrade to Mem0/Letta. Don't pre-empt for "we might need it someday."

Mem0/Letta service vs DIY filesystem — Trade-off

Dimension	Managed (Mem0 / Letta)	DIY filesystem
Onboarding	< 1 day	1-3 days
Recall quality	26% above baseline	Tied < 100 entries; falls behind > 1000
Latency	100-500ms	5-50ms
Cost	$0.01-0.10/MAU + token	Near zero
Debuggability	Black box	100% readable
Data compliance	Leaves company	Controllable
Multi-user	Built-in user_id isolation	DIY
Best for	100+ user × frequent sessions	< 100 user / internal / POC

JR rule: start filesystem, run 3 months, watch entry count + recall quality, migrate to Mem0 only after crossing threshold. Letta's 74% filesystem benchmark is the basis.

Takeaway

Memory isn't one thing — it's three: scratchpad (within-task) / working (within-session) / persistent (cross-session). Mix them up and the context blows. Production agents must layer — scratchpad lives in the messages array, working uses rolling window + summary, persistent starts on filesystem and graduates to Mem0.

References

Mem0 team. (2025-04-28). Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413.
Mem0 research benchmark. Benchmarking Mem0's token-efficient memory algorithm — 26% acc / 91% latency / 90% token 节省 vs OpenAI baseline.
Letta. Letta documentation — OS-inspired 三层 memory（core / archival / recall）+ filesystem 74% benchmark surprise.
LangChain. Migrating from ConversationSummaryBufferMemory — 滚动 + 摘要 working memory pattern.
Anthropic. Message Batches API — 异步处理超长 conversation.
mem0ai. GitHub — Mem0 开源实现.

Production case: JR Academy skills-data-manager（tools/skills-data-manager/）— prod-state.json 作为 filesystem persistent memory.

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题，点击展开答案

为什么 agent 不能用 1M context 模型一劳永逸？

1M context 不解决问题：200K → 1M 成本涨 5×，Lost in the Middle 在 1M 下更严重（中段失明范围更广）。正确方向是给 agent 加分层 memory（scratchpad / working / persistent），不是给 context 加预算。

Mem0 和 Letta 哪个适合我的项目？

按用户量分：< 100 user 用 filesystem，> 1000 user 用 Mem0。Mem0 用 extraction-update 两阶段，vs OpenAI baseline 准确率高 26%、p95 延迟低 91%、token 成本省 90%。Letta 走 OS 三层（core/archival/recall）但 benchmark 显示 plain filesystem 拿 74% 反超 vector store。

Working memory 和 persistent memory 边界在哪？

按生命周期切：Working = 当前 session 内的滚动 context（用 LangChain ConversationSummaryBufferMemory 摘要更早轮次），Persistent = 跨 session 的事实库（用户偏好、学到的事实、之前 task 的成果摘要）。Session 结束 working 清空，persistent 保留到下次。

做单人助手类产品 memory 三层都要做吗？

不需要：单人助手最少 working + persistent 两层够。Working 用 LangChain ConversationSummaryBufferMemory 自动摘要、Persistent 用 SQLite + 手动 schema 存用户偏好/事实。Scratchpad 只在 agent 跑多步推理时才用，简单 chat 不需要。

Mem0 / Letta 一个月跑下来贵不贵？

Mem0 自托管免费（开源 Apache 2.0），托管版 Mem0 Cloud free tier 1K memory + 1K search/月够 demo。Letta 100% 开源自托管，PostgreSQL + 一个 Python service 即可，云成本就是 PG 实例（~$15/月 RDS t4g.micro）。两个都不会成本爆炸。

Memory 系统最常见的失败模式是什么？

把所有 LLM 输出无脑写 persistent，3 个月后 memory 库膨胀到 50K+ 条目，每次 search 召回的全是过期/矛盾内容。正确做法：写入前过 LLM-as-judge 一层判断「该不该记」、加 expiration（30/90 天）、定期跑去重 + 矛盾合并。Mem0 默认就做这些，自己手写就要补上。