RAG & Agent Strategy: Complex System Design
RAG & Agent Strategy: Complex System Design
RAG and Agent are the two directions where AI PMs are most easily led astray by buzzwords. Many roadmaps open with "build an Agent" or "add RAG," but the user task hasn't even been defined yet. Result: lots of architecture terminology, very little product value.
This page isn't about implementation details. It's about judging from a PM perspective: when to use RAG, when to use Agent, and when neither should be rushed.
Bottom Line: Ask About the User Task First, Then Choose System Form
A more stable decision sequence:
- Does the user need "more accurate answers" or "more complex execution"
- Are you lacking knowledge access capability or task orchestration capability
- Are risk and cost worth introducing a more complex system
If these three steps aren't thought through first, teams easily mistake technical complexity for product progress.
What Problems Are Better Suited for RAG
RAG's core value isn't making answers "smarter." It's making answers more grounded.
It fits better when:
| Scenario | Why it fits |
|---|---|
| Internal knowledge Q&A | Needs to reference company docs and rules |
| Help center / support copilot | Needs to answer based on existing knowledge |
| Policy, process, product docs retrieval | Needs source-backed answers |
| Long document Q&A | The model doesn't inherently know your private content |
If the problem is fundamentally "the model doesn't know this material," RAG is usually the right direction.
What Problems Are Better Suited for Agent
Agent's core value isn't being "more human-like." It's being able to execute multi-step tasks.
It fits better when:
| Scenario | Why it fits |
|---|---|
| Multi-step research workflow | Needs to search, organize, generate results |
| Tasks requiring multiple tool calls | e.g., query data, write report, send notification |
| Complex operation workflows | Needs to judge next actions |
| Semi-automated task execution | Not just answering -- actually doing things |
If the task is fundamentally "do A first, then judge, then do B," Agent starts to make sense.
When You Shouldn't Rush to Add RAG
| Situation | Reason |
|---|---|
| Document quality is poor | Garbage in, garbage answers out |
| Knowledge update process isn't established | Will go stale quickly |
| What users actually want isn't Q&A | You might be optimizing the wrong problem |
| Team hasn't defined source trust | Retrieved doesn't mean usable |
Many cases of "RAG performing poorly" aren't actually model problems. They're knowledge base governance failures.
When You Shouldn't Casually Add Agent
| Situation | Reason |
|---|---|
| Task steps are actually fixed | Regular workflow automation might be more stable |
| One wrong step is very costly | Agent's autonomy amplifies risk |
| Single-step quality isn't solved yet | Multi-step chains only amplify problems |
| Users don't actually need autonomy | You're adding complexity, not value |
Agent isn't an upgraded chat box. Many products actually only need a clear flow + a few tool calls, not a full agent loop.
A More Practical Decision Framework
Start with this table:
| Problem type | More likely solution |
|---|---|
| Lacking knowledge | RAG |
| Lacking step execution | Agent |
| Lacking both | RAG + Agent, but layer them first |
| Just a regular form or rule flow | Might not need AI system complexity at all |
What PMs should avoid most is choosing the most complex solution because "it sounds more advanced."
RAG PM Focus Points: Beyond Retrieval Accuracy
More important questions to watch:
| Decision point | Why PMs should care |
|---|---|
| Source coverage | Does the knowledge base cover what users will ask |
| Update freshness | How often is knowledge refreshed |
| Citation UX | Can users see the source |
| Failure handling | What happens when nothing is found |
| Trust boundary | Which sources can be trusted |
Whether RAG works well depends heavily on knowledge operations, not just per-retrieval metrics.
Agent PM Focus Point: Controllability
The question Agents should face isn't "is it cool" but:
- Will it call tools it shouldn't call
- Will it keep executing based on wrong premises
- Is each step observable
- Can it abort or escalate to human on failure
If these can't be answered clearly, the Agent approach usually isn't mature enough.
When RAG + Agent Ship Together, Layer Them First
A more stable approach isn't building one "big comprehensive Agent" at once. Break it into:
knowledge layer -> retrieval layer -> decision layer -> action layer
This way you can distinguish:
- Was it a retrieval error
- Was it a judgment error
- Was it a tool execution error
Once systems get complex, the worst thing is having nobody know which layer the error happened in.
Most Overlooked Costs
RAG and Agent costs aren't limited to API bills.
They also include:
- Document governance costs
- Eval and monitoring costs
- Prompt / workflow maintenance costs
- Bad case handling costs
If PMs only budget "model call costs," they'll typically underestimate significantly.
A Sufficient Strategy Review Question Set
Before discussing RAG or Agent, have the team answer:
- What's the user task exactly
- What's the worst-case failure consequence
- Why isn't a simpler workflow sufficient
- How will quality be monitored post-launch
- Which layer can you roll back to when issues arise
If these 5 questions can't be answered clearly, hold off on drawing complex architecture diagrams.
Practice
Take your most-wanted AI feature. First determine which category it resembles:
- Knowledge-type problem
- Multi-step execution problem
- Both
- Actually just regular automation
Get the classification right, and the system design direction usually won't be too far off.
📚 相关资源
❓ 常见问题
关于本章主题最常被搜索的问题,点击展开答案
什么时候该上 RAG,什么时候该上 Agent?
缺知识就 RAG(内部知识问答、support copilot、政策检索、长文档问答),缺多步执行就 Agent(research workflow、调多个 tool、复杂 operation、半自动执行)。两者都缺要 RAG + Agent 但先分层。只是普通表单或规则流根本不需要 AI system complexity——别因为「听起来更先进」就上最复杂方案。
什么情况下不该急着上 RAG?
4 个红灯:文档质量很差(垃圾进库只会让答案更乱)、知识更新流程没建立(很快就会过时)、用户真正要的不是问答(你在优化错问题)、团队还没定义 source trust(检索到了也不代表能用)。很多所谓「RAG 效果差」根本不是模型问题,而是知识库治理没做好。
Agent 的 PM 关注点为什么是可控性而不是能力?
Agent 的自主性会放大风险。要问:(1) 它会不会调用不该调用的工具 (2) 它会不会在错误前提上继续执行 (3) 每一步是否可观测 (4) 失败时能否中断或转人工。如果错一步代价很高、还没搞定单步质量、用户根本不需要 autonomy,Agent 会增加复杂度而不是价值。
RAG + Agent 一起做时为什么必须分层?
拆成 knowledge layer → retrieval layer → decision layer → action layer,才能分清是检索错了、判断错了,还是工具执行错了。一口气做一个「大而全 Agent」最大的问题不是性能,而是出错后没人知道错在哪一层。
评估 RAG / Agent 方案前 team 必须答清的 5 个问题?
(1) 用户任务到底是什么 (2) 失败后最坏后果是什么 (3) 为什么 simpler workflow 不够 (4) 上线后怎么监控质量 (5) 出问题时能回滚到哪一层。这 5 个问题答不清时,先别着急画复杂架构图——技术复杂度不是产品进步。