When should you use RAG and when should you use an Agent?

Missing knowledge → RAG (internal Q&A, support copilot, policy lookup, long-doc Q&A). Missing multi-step execution → Agent (research workflows, multi-tool calls, complex ops, semi-autonomous tasks). Both missing → RAG + Agent, layered. If it is a form or rule flow, you do not need AI system complexity at all—do not pick the most complex stack because it 'sounds advanced'.

When should you not rush to add RAG?

Four red flags: doc quality is poor (garbage in, garbage retrieved), there is no knowledge-refresh process (it goes stale fast), users do not actually want Q&A (you are optimizing the wrong problem), and the team has not defined source trust (retrieving is not the same as trusting). Most 'RAG works badly' complaints are knowledge-base governance failures, not model failures.

Why should a PM focus on Agent controllability rather than capability?

Agent autonomy amplifies risk. Ask: (1) can it call tools it should not (2) will it keep executing on a wrong premise (3) is each step observable (4) can it interrupt or escalate to a human on failure. If a single error is expensive, single-step quality is not solid yet, or users do not actually want autonomy, an Agent adds complexity without adding value.

Why must RAG and Agent be layered when used together?

Split it into knowledge layer → retrieval layer → decision layer → action layer, so you can tell whether retrieval was wrong, judgment was wrong, or tool execution was wrong. The biggest problem with a single 'do-everything Agent' is not performance—it is that when things break, nobody can pinpoint which layer broke.

What 5 questions must a team answer before evaluating a RAG or Agent design?

(1) What is the user's task, exactly? (2) What is the worst-case failure cost? (3) Why is a simpler workflow not enough? (4) How will quality be monitored post-launch? (5) Which layer can we roll back to when it breaks? If those five are foggy, hold off on the architecture diagram—technical complexity is not product progress.

RAG & Agent Strategy: Complex System Design

⏱️ 75 min

RAG & Agent Strategy: Complex System Design

RAG and Agent are the two directions where AI PMs are most easily led astray by buzzwords. Many roadmaps open with "build an Agent" or "add RAG," but the user task hasn't even been defined yet. Result: lots of architecture terminology, very little product value.

This page isn't about implementation details. It's about judging from a PM perspective: when to use RAG, when to use Agent, and when neither should be rushed.

RAG Agent Strategy Map

Bottom Line: Ask About the User Task First, Then Choose System Form

A more stable decision sequence:

Does the user need "more accurate answers" or "more complex execution"
Are you lacking knowledge access capability or task orchestration capability
Are risk and cost worth introducing a more complex system

If these three steps aren't thought through first, teams easily mistake technical complexity for product progress.

What Problems Are Better Suited for RAG

RAG's core value isn't making answers "smarter." It's making answers more grounded.

It fits better when:

Scenario	Why it fits
Internal knowledge Q&A	Needs to reference company docs and rules
Help center / support copilot	Needs to answer based on existing knowledge
Policy, process, product docs retrieval	Needs source-backed answers
Long document Q&A	The model doesn't inherently know your private content

If the problem is fundamentally "the model doesn't know this material," RAG is usually the right direction.

What Problems Are Better Suited for Agent

Agent's core value isn't being "more human-like." It's being able to execute multi-step tasks.

It fits better when:

Scenario	Why it fits
Multi-step research workflow	Needs to search, organize, generate results
Tasks requiring multiple tool calls	e.g., query data, write report, send notification
Complex operation workflows	Needs to judge next actions
Semi-automated task execution	Not just answering -- actually doing things

If the task is fundamentally "do A first, then judge, then do B," Agent starts to make sense.

When You Shouldn't Rush to Add RAG

Situation	Reason
Document quality is poor	Garbage in, garbage answers out
Knowledge update process isn't established	Will go stale quickly
What users actually want isn't Q&A	You might be optimizing the wrong problem
Team hasn't defined source trust	Retrieved doesn't mean usable

Many cases of "RAG performing poorly" aren't actually model problems. They're knowledge base governance failures.

When You Shouldn't Casually Add Agent

Situation	Reason
Task steps are actually fixed	Regular workflow automation might be more stable
One wrong step is very costly	Agent's autonomy amplifies risk
Single-step quality isn't solved yet	Multi-step chains only amplify problems
Users don't actually need autonomy	You're adding complexity, not value

Agent isn't an upgraded chat box. Many products actually only need a clear flow + a few tool calls, not a full agent loop.

A More Practical Decision Framework

Start with this table:

Problem type	More likely solution
Lacking knowledge	RAG
Lacking step execution	Agent
Lacking both	RAG + Agent, but layer them first
Just a regular form or rule flow	Might not need AI system complexity at all

What PMs should avoid most is choosing the most complex solution because "it sounds more advanced."

RAG PM Focus Points: Beyond Retrieval Accuracy

Decision point	Why PMs should care
Source coverage	Does the knowledge base cover what users will ask
Update freshness	How often is knowledge refreshed
Citation UX	Can users see the source
Failure handling	What happens when nothing is found
Trust boundary	Which sources can be trusted

Agent PM Focus Point: Controllability

The question Agents should face isn't "is it cool" but:

Will it call tools it shouldn't call
Will it keep executing based on wrong premises
Is each step observable
Can it abort or escalate to human on failure

If these can't be answered clearly, the Agent approach usually isn't mature enough.

When RAG + Agent Ship Together, Layer Them First

A more stable approach isn't building one "big comprehensive Agent" at once. Break it into:

knowledge layer -> retrieval layer -> decision layer -> action layer

This way you can distinguish:

Was it a retrieval error
Was it a judgment error
Was it a tool execution error

Once systems get complex, the worst thing is having nobody know which layer the error happened in.

Most Overlooked Costs

RAG and Agent costs aren't limited to API bills.

They also include:

Document governance costs
Eval and monitoring costs
Prompt / workflow maintenance costs
Bad case handling costs

If PMs only budget "model call costs," they'll typically underestimate significantly.

A Sufficient Strategy Review Question Set

Before discussing RAG or Agent, have the team answer:

What's the user task exactly
What's the worst-case failure consequence
Why isn't a simpler workflow sufficient
How will quality be monitored post-launch
Which layer can you roll back to when issues arise

If these 5 questions can't be answered clearly, hold off on drawing complex architecture diagrams.

Practice

Take your most-wanted AI feature. First determine which category it resembles:

Knowledge-type problem
Multi-step execution problem
Both
Actually just regular automation

Get the classification right, and the system design direction usually won't be too far off.

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题，点击展开答案

什么时候该上 RAG，什么时候该上 Agent？

缺知识就 RAG（内部知识问答、support copilot、政策检索、长文档问答），缺多步执行就 Agent（research workflow、调多个 tool、复杂 operation、半自动执行）。两者都缺要 RAG + Agent 但先分层。只是普通表单或规则流根本不需要 AI system complexity——别因为「听起来更先进」就上最复杂方案。

什么情况下不该急着上 RAG？

4 个红灯：文档质量很差（垃圾进库只会让答案更乱）、知识更新流程没建立（很快就会过时）、用户真正要的不是问答（你在优化错问题）、团队还没定义 source trust（检索到了也不代表能用）。很多所谓「RAG 效果差」根本不是模型问题，而是知识库治理没做好。

Agent 的 PM 关注点为什么是可控性而不是能力？

Agent 的自主性会放大风险。要问：(1) 它会不会调用不该调用的工具 (2) 它会不会在错误前提上继续执行 (3) 每一步是否可观测 (4) 失败时能否中断或转人工。如果错一步代价很高、还没搞定单步质量、用户根本不需要 autonomy，Agent 会增加复杂度而不是价值。

RAG + Agent 一起做时为什么必须分层？

拆成 knowledge layer → retrieval layer → decision layer → action layer，才能分清是检索错了、判断错了，还是工具执行错了。一口气做一个「大而全 Agent」最大的问题不是性能，而是出错后没人知道错在哪一层。

评估 RAG / Agent 方案前 team 必须答清的 5 个问题？

(1) 用户任务到底是什么 (2) 失败后最坏后果是什么 (3) 为什么 simpler workflow 不够 (4) 上线后怎么监控质量 (5) 出问题时能回滚到哪一层。这 5 个问题答不清时，先别着急画复杂架构图——技术复杂度不是产品进步。