logo
08

RAG & Agent Strategy: Complex System Design

⏱️ 75 min

RAG & Agent Strategy: Complex System Design

RAG and Agent are the two directions where AI PMs are most easily led astray by buzzwords. Many roadmaps open with "build an Agent" or "add RAG," but the user task hasn't even been defined yet. Result: lots of architecture terminology, very little product value.

This page isn't about implementation details. It's about judging from a PM perspective: when to use RAG, when to use Agent, and when neither should be rushed.

RAG Agent Strategy Map


Bottom Line: Ask About the User Task First, Then Choose System Form

A more stable decision sequence:

  1. Does the user need "more accurate answers" or "more complex execution"
  2. Are you lacking knowledge access capability or task orchestration capability
  3. Are risk and cost worth introducing a more complex system

If these three steps aren't thought through first, teams easily mistake technical complexity for product progress.


What Problems Are Better Suited for RAG

RAG's core value isn't making answers "smarter." It's making answers more grounded.

It fits better when:

ScenarioWhy it fits
Internal knowledge Q&ANeeds to reference company docs and rules
Help center / support copilotNeeds to answer based on existing knowledge
Policy, process, product docs retrievalNeeds source-backed answers
Long document Q&AThe model doesn't inherently know your private content

If the problem is fundamentally "the model doesn't know this material," RAG is usually the right direction.


What Problems Are Better Suited for Agent

Agent's core value isn't being "more human-like." It's being able to execute multi-step tasks.

It fits better when:

ScenarioWhy it fits
Multi-step research workflowNeeds to search, organize, generate results
Tasks requiring multiple tool callse.g., query data, write report, send notification
Complex operation workflowsNeeds to judge next actions
Semi-automated task executionNot just answering -- actually doing things

If the task is fundamentally "do A first, then judge, then do B," Agent starts to make sense.


When You Shouldn't Rush to Add RAG

SituationReason
Document quality is poorGarbage in, garbage answers out
Knowledge update process isn't establishedWill go stale quickly
What users actually want isn't Q&AYou might be optimizing the wrong problem
Team hasn't defined source trustRetrieved doesn't mean usable

Many cases of "RAG performing poorly" aren't actually model problems. They're knowledge base governance failures.


When You Shouldn't Casually Add Agent

SituationReason
Task steps are actually fixedRegular workflow automation might be more stable
One wrong step is very costlyAgent's autonomy amplifies risk
Single-step quality isn't solved yetMulti-step chains only amplify problems
Users don't actually need autonomyYou're adding complexity, not value

Agent isn't an upgraded chat box. Many products actually only need a clear flow + a few tool calls, not a full agent loop.


A More Practical Decision Framework

Start with this table:

Problem typeMore likely solution
Lacking knowledgeRAG
Lacking step executionAgent
Lacking bothRAG + Agent, but layer them first
Just a regular form or rule flowMight not need AI system complexity at all

What PMs should avoid most is choosing the most complex solution because "it sounds more advanced."


RAG PM Focus Points: Beyond Retrieval Accuracy

More important questions to watch:

Decision pointWhy PMs should care
Source coverageDoes the knowledge base cover what users will ask
Update freshnessHow often is knowledge refreshed
Citation UXCan users see the source
Failure handlingWhat happens when nothing is found
Trust boundaryWhich sources can be trusted

Whether RAG works well depends heavily on knowledge operations, not just per-retrieval metrics.


Agent PM Focus Point: Controllability

The question Agents should face isn't "is it cool" but:

  1. Will it call tools it shouldn't call
  2. Will it keep executing based on wrong premises
  3. Is each step observable
  4. Can it abort or escalate to human on failure

If these can't be answered clearly, the Agent approach usually isn't mature enough.


When RAG + Agent Ship Together, Layer Them First

A more stable approach isn't building one "big comprehensive Agent" at once. Break it into:

knowledge layer -> retrieval layer -> decision layer -> action layer

This way you can distinguish:

  • Was it a retrieval error
  • Was it a judgment error
  • Was it a tool execution error

Once systems get complex, the worst thing is having nobody know which layer the error happened in.


Most Overlooked Costs

RAG and Agent costs aren't limited to API bills.

They also include:

  • Document governance costs
  • Eval and monitoring costs
  • Prompt / workflow maintenance costs
  • Bad case handling costs

If PMs only budget "model call costs," they'll typically underestimate significantly.


A Sufficient Strategy Review Question Set

Before discussing RAG or Agent, have the team answer:

  1. What's the user task exactly
  2. What's the worst-case failure consequence
  3. Why isn't a simpler workflow sufficient
  4. How will quality be monitored post-launch
  5. Which layer can you roll back to when issues arise

If these 5 questions can't be answered clearly, hold off on drawing complex architecture diagrams.


Practice

Take your most-wanted AI feature. First determine which category it resembles:

  1. Knowledge-type problem
  2. Multi-step execution problem
  3. Both
  4. Actually just regular automation

Get the classification right, and the system design direction usually won't be too far off.

📚 相关资源