RAG & Agent Strategy: Complex System Design
RAG & Agent Strategy: Complex System Design
RAG and Agent are the two directions where AI PMs are most easily led astray by buzzwords. Many roadmaps open with "build an Agent" or "add RAG," but the user task hasn't even been defined yet. Result: lots of architecture terminology, very little product value.
This page isn't about implementation details. It's about judging from a PM perspective: when to use RAG, when to use Agent, and when neither should be rushed.
Bottom Line: Ask About the User Task First, Then Choose System Form
A more stable decision sequence:
- Does the user need "more accurate answers" or "more complex execution"
- Are you lacking knowledge access capability or task orchestration capability
- Are risk and cost worth introducing a more complex system
If these three steps aren't thought through first, teams easily mistake technical complexity for product progress.
What Problems Are Better Suited for RAG
RAG's core value isn't making answers "smarter." It's making answers more grounded.
It fits better when:
| Scenario | Why it fits |
|---|---|
| Internal knowledge Q&A | Needs to reference company docs and rules |
| Help center / support copilot | Needs to answer based on existing knowledge |
| Policy, process, product docs retrieval | Needs source-backed answers |
| Long document Q&A | The model doesn't inherently know your private content |
If the problem is fundamentally "the model doesn't know this material," RAG is usually the right direction.
What Problems Are Better Suited for Agent
Agent's core value isn't being "more human-like." It's being able to execute multi-step tasks.
It fits better when:
| Scenario | Why it fits |
|---|---|
| Multi-step research workflow | Needs to search, organize, generate results |
| Tasks requiring multiple tool calls | e.g., query data, write report, send notification |
| Complex operation workflows | Needs to judge next actions |
| Semi-automated task execution | Not just answering -- actually doing things |
If the task is fundamentally "do A first, then judge, then do B," Agent starts to make sense.
When You Shouldn't Rush to Add RAG
| Situation | Reason |
|---|---|
| Document quality is poor | Garbage in, garbage answers out |
| Knowledge update process isn't established | Will go stale quickly |
| What users actually want isn't Q&A | You might be optimizing the wrong problem |
| Team hasn't defined source trust | Retrieved doesn't mean usable |
Many cases of "RAG performing poorly" aren't actually model problems. They're knowledge base governance failures.
When You Shouldn't Casually Add Agent
| Situation | Reason |
|---|---|
| Task steps are actually fixed | Regular workflow automation might be more stable |
| One wrong step is very costly | Agent's autonomy amplifies risk |
| Single-step quality isn't solved yet | Multi-step chains only amplify problems |
| Users don't actually need autonomy | You're adding complexity, not value |
Agent isn't an upgraded chat box. Many products actually only need a clear flow + a few tool calls, not a full agent loop.
A More Practical Decision Framework
Start with this table:
| Problem type | More likely solution |
|---|---|
| Lacking knowledge | RAG |
| Lacking step execution | Agent |
| Lacking both | RAG + Agent, but layer them first |
| Just a regular form or rule flow | Might not need AI system complexity at all |
What PMs should avoid most is choosing the most complex solution because "it sounds more advanced."
RAG PM Focus Points: Beyond Retrieval Accuracy
More important questions to watch:
| Decision point | Why PMs should care |
|---|---|
| Source coverage | Does the knowledge base cover what users will ask |
| Update freshness | How often is knowledge refreshed |
| Citation UX | Can users see the source |
| Failure handling | What happens when nothing is found |
| Trust boundary | Which sources can be trusted |
Whether RAG works well depends heavily on knowledge operations, not just per-retrieval metrics.
Agent PM Focus Point: Controllability
The question Agents should face isn't "is it cool" but:
- Will it call tools it shouldn't call
- Will it keep executing based on wrong premises
- Is each step observable
- Can it abort or escalate to human on failure
If these can't be answered clearly, the Agent approach usually isn't mature enough.
When RAG + Agent Ship Together, Layer Them First
A more stable approach isn't building one "big comprehensive Agent" at once. Break it into:
knowledge layer -> retrieval layer -> decision layer -> action layer
This way you can distinguish:
- Was it a retrieval error
- Was it a judgment error
- Was it a tool execution error
Once systems get complex, the worst thing is having nobody know which layer the error happened in.
Most Overlooked Costs
RAG and Agent costs aren't limited to API bills.
They also include:
- Document governance costs
- Eval and monitoring costs
- Prompt / workflow maintenance costs
- Bad case handling costs
If PMs only budget "model call costs," they'll typically underestimate significantly.
A Sufficient Strategy Review Question Set
Before discussing RAG or Agent, have the team answer:
- What's the user task exactly
- What's the worst-case failure consequence
- Why isn't a simpler workflow sufficient
- How will quality be monitored post-launch
- Which layer can you roll back to when issues arise
If these 5 questions can't be answered clearly, hold off on drawing complex architecture diagrams.
Practice
Take your most-wanted AI feature. First determine which category it resembles:
- Knowledge-type problem
- Multi-step execution problem
- Both
- Actually just regular automation
Get the classification right, and the system design direction usually won't be too far off.