Context Compression & Optimization
Context Compression Strategies
When agent sessions generate massive conversation history, compression becomes a must. The intuitive approach is minimizing tokens-per-request, but the correct target is tokens-per-task: the total tokens needed to finish the job, including re-fetch costs from losing critical information to compression.
The right goal isn't "shortest single request" — it's "lowest total cost to complete the task."
- Optimize tokens-per-task, not tokens-per-request.
- Structured summaries beat aggressive compression for long tasks.
- Artifact trail is the hardest information to preserve.
- Trigger compression at 70-80% context.
- Use probe questions to evaluate quality.
What You'll Learn
- Trade-offs between three mainstream compression strategies
- Why "structured summarization" is the safest engineering practice
- How to evaluate compression quality with probe questions
When to Activate
Activate this skill when:
- Agent sessions exceed context window limits
- Designing conversation summarization strategies
- Evaluating different compression approaches for production systems
- Debugging cases where agents "forget" what files they modified
- Building evaluation frameworks for compression quality
- Optimizing long-running coding or debugging sessions
Core Concepts
Context compression is a trade-off between token savings and information loss. Three production-ready approaches:
-
Anchored Iterative Summarization: Maintains a structured, continuously updated summary containing session intent, file modifications, decisions, and next steps. On trigger, only the newly truncated portion gets summarized and merged. The structure itself forces retention of critical information.
-
Opaque Compression: Chases maximum compression ratios (99%+), but interpretability is low and you can't verify what was retained.
-
Regenerative Full Summary: Generates a complete summary each time. Readable, but multi-round compression keeps shedding details.
Key conclusion: structured summaries "force retention" and prevent silent information drift.
Detailed Topics
Why Tokens-Per-Task Matters
Traditional metrics only look at tokens-per-request — that's the wrong optimization target. Once compression drops a file path or error message, the agent re-fetches, re-explores, and ends up consuming more tokens overall.
The correct metric is tokens-per-task: total consumption from task start to completion. Saving 0.5% on tokens but incurring 20% re-fetch overhead makes things more expensive, not less.
The Artifact Trail Problem
Artifact trail is the weakest dimension across all compression methods, scoring only 2.2–2.5/5 in evaluations. Even structured summaries struggle to consistently preserve complete file trails.
Coding agents need to know:
- Which files were created
- Which files were modified and what changed
- Which files were read but not modified
- Function names, variable names, error messages
This usually requires dedicated mechanisms beyond natural language summaries.
Structured Summary Sections
An effective structured summary should include:
## Session Intent
[What the user is trying to accomplish]
## Files Modified
- auth.controller.ts: Fixed JWT token generation
- config/redis.ts: Updated connection pooling
- tests/auth.test.ts: Added mock setup for new config
## Decisions Made
- Using Redis connection pool instead of per-request connections
- Retry logic with exponential backoff for transient failures
## Current State
- 14 tests passing, 2 failing
- Remaining: mock setup for session service tests
## Next Steps
1. Fix remaining test failures
2. Run full test suite
3. Update documentation
The point of structure is to "force coverage of critical information" and prevent omissions.
Compression Trigger Strategies
When to compress matters as much as how:
| Strategy | Trigger Point | Trade-off |
|---|---|---|
| Fixed threshold | 70-80% context utilization | Simple but may trigger early |
| Sliding window | Keep last N turns + summary | Predictable |
| Importance-based | Compress low-relevance first | Complex but preserves signal |
| Task-boundary | Compress at task boundaries | Readable but unpredictable |
For coding agents, sliding window + structured summary is usually the best balance.
Probe-Based Evaluation
ROUGE/embedding similarity can't measure functional fidelity. A summary might "look similar" but be missing a critical file path.
Probe-based evaluation tests retention through questions:
| Probe Type | What It Tests | Example Question |
|---|---|---|
| Recall | Factual retention | "What was the original error message?" |
| Artifact | File tracking | "Which files have we modified?" |
| Continuation | Task planning | "What should we do next?" |
| Decision | Reasoning chain | "What did we decide about the Redis issue?" |
Evaluation Dimensions
Six dimensions for measuring compression quality:
- Accuracy: Are technical details correct?
- Context Awareness: Does it match the current conversation state?
- Artifact Trail: Is the file trail complete?
- Completeness: Does it cover the key points?
- Continuity: Can the task resume seamlessly?
- Instruction Following: Are constraints respected?
Accuracy has the widest variance. Artifact Trail is consistently the weakest.
Practical Guidance
Implementing Anchored Iterative Summarization
- Define summary sections (tailored to your task type)
- First compression: generate a complete structured summary
- Subsequent compressions: only summarize the newly truncated portion and merge
- Don't regenerate from scratch — that causes detail drift
- Record summary provenance for debugging
When to Use Each Approach
Use anchored iterative summarization when:
- Sessions are long (100+ messages)
- File tracking is critical
- You need verifiable information retention
Use opaque compression when:
- Maximum compression ratio is the priority
- Sessions are relatively short
- Re-fetch cost is low
Use regenerative summaries when:
- Summary readability is paramount
- There are clear phase boundaries
- You can accept repeated review passes
Compression Ratio Considerations
| Method | Compression Ratio | Quality Score | Trade-off |
|---|---|---|---|
| Anchored Iterative | 98.6% | 3.70 | Best quality, slightly less compression |
| Regenerative | 98.7% | 3.44 | Medium quality |
| Opaque | 99.3% | 3.35 | Best compression, quality drops |
Examples
Example 1: Debugging Session Compression
Original context (89,000 tokens, 178 messages):
- 401 error on /api/auth/login endpoint
- Traced through auth controller, middleware, session store
- Found stale Redis connection
- Fixed connection pooling, added retry logic
- 14 tests passing, 2 failing
Structured summary after compression:
## Session Intent
Debug 401 Unauthorized error on /api/auth/login despite valid credentials.
## Root Cause
Stale Redis connection in session store. JWT generated correctly but session could not be persisted.
## Files Modified
- auth.controller.ts: No changes (read only)
- middleware/cors.ts: No changes (examined)
- config/redis.ts: Fixed connection pooling configuration
- services/session.service.ts: Added retry logic for transient failures
- tests/auth.test.ts: Updated mock setup
## Test Status
14 passing, 2 failing (mock setup issues)
## Next Steps
1. Fix remaining test failures (mock session service)
2. Run full test suite
3. Deploy to staging
Example 2: Probe Response Quality
After compression, asking "What was the original error?":
Good response (structured summarization):
"The original error was a 401 Unauthorized response from the /api/auth/login endpoint. Users received this error with valid credentials. Root cause was stale Redis connection in session store."
Poor response (aggressive compression):
"We were debugging an authentication issue. The login was failing. We fixed some configuration problems."
Guidelines
- Optimize for tokens-per-task, not tokens-per-request
- Use structured summaries with explicit sections for file tracking
- Trigger compression at 70-80% context utilization
- Implement incremental merging rather than full regeneration
- Test compression quality with probe-based evaluation
- Track artifact trail separately if file tracking is critical
- Accept slightly lower compression ratios for better quality retention
- Monitor re-fetching frequency as a compression quality signal
Practice Task
- Write a structured summary template for your project using the sections above
- Design 3 probe questions to verify whether critical facts were retained
Related Pages
- Claude Code Examples
- Context Engineering Fundamentals
- Context Degradation Patterns
- Advanced Evaluation
Integration
This skill connects to:
- context-degradation - Compression is a mitigation strategy
- context-optimization - Compression is one optimization technique
- evaluation - Probe-based evaluation applies to compression testing
- memory-systems - Compression relates to scratchpad and summary memory patterns
References
Related skills:
- context-degradation - Understanding what compression prevents
- context-optimization - Broader optimization strategies
- evaluation - Building evaluation frameworks
External resources:
- Factory Research: Evaluating Context Compression for AI Agents (December 2025)
- Research on LLM-as-judge evaluation methodology (Zheng et al., 2023)
Skill Metadata
Created: 2025-12-22 Last Updated: 2025-12-22 Author: Agent Skills for Context Engineering Contributors Version: 1.0.0
📚 相关资源
❓ 常见问题
关于本章主题最常被搜索的问题,点击展开答案
压缩 context 应该追求 token 最少吗?
不要。目标是 tokens-per-task(任务总成本),不是 tokens-per-request(单次最少)。压得太狠丢了 file paths、error messages 这类关键信息,agent 会重新检索/重复探索,整体反而更贵。节省 0.5% tokens 但带来 20% re-fetch 成本,账算下来负数。
三种 compression 策略各自适合什么场景?
(1) Anchored Iterative Summarization:结构化 summary 持续更新,压缩 98.6% / 质量分 3.70,适合长 session(100+ messages)+ file tracking 关键场景;(2) Opaque Compression:99.3% 压缩率但可解释性低,适合 sessions 短 + re-fetch 成本低;(3) Regenerative Full Summary:可读性高但多轮压缩会丢细节,适合 phase boundary 清晰的任务。
压缩什么时候触发?70% 还是 90%?
70-80% 是甜点。常见触发策略:fixed threshold(70-80% utilization,简单但可能过早)、sliding window(保留最近 N 轮 + summary,可预测)、importance-based(先压低 relevance 内容,复杂但保信号)、task-boundary(任务边界压缩,可读但不可预测)。Coding agent 推荐 sliding window + structured summary 组合最平衡。
structured summary 应该有哪些 section?
5 个固定 section:## Session Intent(用户目标)、## Files Modified(带具体修改说明)、## Decisions Made(关键决策)、## Current State(测试通过/失败、当前进度)、## Next Steps(下一步动作列表)。结构化的目的是「强制覆盖」关键信息——artifact trail(文件轨迹)是评测得分最低的维度(2.2-2.5/5),必须显式列出来。
怎么验证压缩后没丢关键信息?
用 probe-based evaluation,不要只看 ROUGE / embedding similarity(只看「像不像」,不看功能保真)。问 4 类 probe:Recall(原始错误信息是什么?)、Artifact(改了哪些文件?)、Continuation(下一步该做什么?)、Decision(关于 Redis 的决策是什么?)。再按 6 维评估:Accuracy / Context Awareness / Artifact Trail / Completeness / Continuity / Instruction Following。