08

RAG & Agent Strategy: Complex System Design

⏱️ 75 min

RAG & Agent Strategy: Complex System Design

RAG and Agent are the two directions where AI PMs are most easily led astray by buzzwords. Many roadmaps open with "build an Agent" or "add RAG," but the user task hasn't even been defined yet. Result: lots of architecture terminology, very little product value.

This page isn't about implementation details. It's about judging from a PM perspective: when to use RAG, when to use Agent, and when neither should be rushed.

RAG Agent Strategy Map


Bottom Line: Ask About the User Task First, Then Choose System Form

A more stable decision sequence:

  1. Does the user need "more accurate answers" or "more complex execution"
  2. Are you lacking knowledge access capability or task orchestration capability
  3. Are risk and cost worth introducing a more complex system

If these three steps aren't thought through first, teams easily mistake technical complexity for product progress.


What Problems Are Better Suited for RAG

RAG's core value isn't making answers "smarter." It's making answers more grounded.

It fits better when:

ScenarioWhy it fits
Internal knowledge Q&ANeeds to reference company docs and rules
Help center / support copilotNeeds to answer based on existing knowledge
Policy, process, product docs retrievalNeeds source-backed answers
Long document Q&AThe model doesn't inherently know your private content

If the problem is fundamentally "the model doesn't know this material," RAG is usually the right direction.


What Problems Are Better Suited for Agent

Agent's core value isn't being "more human-like." It's being able to execute multi-step tasks.

It fits better when:

ScenarioWhy it fits
Multi-step research workflowNeeds to search, organize, generate results
Tasks requiring multiple tool callse.g., query data, write report, send notification
Complex operation workflowsNeeds to judge next actions
Semi-automated task executionNot just answering -- actually doing things

If the task is fundamentally "do A first, then judge, then do B," Agent starts to make sense.


When You Shouldn't Rush to Add RAG

SituationReason
Document quality is poorGarbage in, garbage answers out
Knowledge update process isn't establishedWill go stale quickly
What users actually want isn't Q&AYou might be optimizing the wrong problem
Team hasn't defined source trustRetrieved doesn't mean usable

Many cases of "RAG performing poorly" aren't actually model problems. They're knowledge base governance failures.


When You Shouldn't Casually Add Agent

SituationReason
Task steps are actually fixedRegular workflow automation might be more stable
One wrong step is very costlyAgent's autonomy amplifies risk
Single-step quality isn't solved yetMulti-step chains only amplify problems
Users don't actually need autonomyYou're adding complexity, not value

Agent isn't an upgraded chat box. Many products actually only need a clear flow + a few tool calls, not a full agent loop.


A More Practical Decision Framework

Start with this table:

Problem typeMore likely solution
Lacking knowledgeRAG
Lacking step executionAgent
Lacking bothRAG + Agent, but layer them first
Just a regular form or rule flowMight not need AI system complexity at all

What PMs should avoid most is choosing the most complex solution because "it sounds more advanced."


RAG PM Focus Points: Beyond Retrieval Accuracy

More important questions to watch:

Decision pointWhy PMs should care
Source coverageDoes the knowledge base cover what users will ask
Update freshnessHow often is knowledge refreshed
Citation UXCan users see the source
Failure handlingWhat happens when nothing is found
Trust boundaryWhich sources can be trusted

Whether RAG works well depends heavily on knowledge operations, not just per-retrieval metrics.


Agent PM Focus Point: Controllability

The question Agents should face isn't "is it cool" but:

  1. Will it call tools it shouldn't call
  2. Will it keep executing based on wrong premises
  3. Is each step observable
  4. Can it abort or escalate to human on failure

If these can't be answered clearly, the Agent approach usually isn't mature enough.


When RAG + Agent Ship Together, Layer Them First

A more stable approach isn't building one "big comprehensive Agent" at once. Break it into:

knowledge layer -> retrieval layer -> decision layer -> action layer

This way you can distinguish:

  • Was it a retrieval error
  • Was it a judgment error
  • Was it a tool execution error

Once systems get complex, the worst thing is having nobody know which layer the error happened in.


Most Overlooked Costs

RAG and Agent costs aren't limited to API bills.

They also include:

  • Document governance costs
  • Eval and monitoring costs
  • Prompt / workflow maintenance costs
  • Bad case handling costs

If PMs only budget "model call costs," they'll typically underestimate significantly.


A Sufficient Strategy Review Question Set

Before discussing RAG or Agent, have the team answer:

  1. What's the user task exactly
  2. What's the worst-case failure consequence
  3. Why isn't a simpler workflow sufficient
  4. How will quality be monitored post-launch
  5. Which layer can you roll back to when issues arise

If these 5 questions can't be answered clearly, hold off on drawing complex architecture diagrams.


Practice

Take your most-wanted AI feature. First determine which category it resembles:

  1. Knowledge-type problem
  2. Multi-step execution problem
  3. Both
  4. Actually just regular automation

Get the classification right, and the system design direction usually won't be too far off.

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题,点击展开答案

什么时候该上 RAG,什么时候该上 Agent?

缺知识就 RAG(内部知识问答、support copilot、政策检索、长文档问答),缺多步执行就 Agent(research workflow、调多个 tool、复杂 operation、半自动执行)。两者都缺要 RAG + Agent 但先分层。只是普通表单或规则流根本不需要 AI system complexity——别因为「听起来更先进」就上最复杂方案。

什么情况下不该急着上 RAG?

4 个红灯:文档质量很差(垃圾进库只会让答案更乱)、知识更新流程没建立(很快就会过时)、用户真正要的不是问答(你在优化错问题)、团队还没定义 source trust(检索到了也不代表能用)。很多所谓「RAG 效果差」根本不是模型问题,而是知识库治理没做好。

Agent 的 PM 关注点为什么是可控性而不是能力?

Agent 的自主性会放大风险。要问:(1) 它会不会调用不该调用的工具 (2) 它会不会在错误前提上继续执行 (3) 每一步是否可观测 (4) 失败时能否中断或转人工。如果错一步代价很高、还没搞定单步质量、用户根本不需要 autonomy,Agent 会增加复杂度而不是价值。

RAG + Agent 一起做时为什么必须分层?

拆成 knowledge layer → retrieval layer → decision layer → action layer,才能分清是检索错了、判断错了,还是工具执行错了。一口气做一个「大而全 Agent」最大的问题不是性能,而是出错后没人知道错在哪一层。

评估 RAG / Agent 方案前 team 必须答清的 5 个问题?

(1) 用户任务到底是什么 (2) 失败后最坏后果是什么 (3) 为什么 simpler workflow 不够 (4) 上线后怎么监控质量 (5) 出问题时能回滚到哪一层。这 5 个问题答不清时,先别着急画复杂架构图——技术复杂度不是产品进步。