When is a single request not enough — when do I need a workflow system?

Five clear signals: multi-step research (search + parse + summarize + merge), batch summarization (one request cannot finish a whole batch), embedding/indexing pipelines (ingest + split + embed + store), HITL review flow (humans after the model), and tool orchestration across multiple external systems. Once a task spans time, tools, or states, a workflow stops being optional.

Why is a state machine more reliable than chaining scripts?

Production tasks need at least six explicit states: pending / running / success / failed / needs_review / cancelled. Chained scripts hide every state in memory or logs, killing SLA, replay, monitoring, and audit. Explicit states let you know exactly which step a task is on, how many times it has failed, and whether to escalate to a human.

What does a solid retry policy actually contain?

At minimum four fields: max attempts to bound retries, a backoff strategy so you do not hammer downstreams, an idempotency key so side effects do not double-fire, and an escalation threshold deciding when to hand off to a human or DLQ. Incidents are rarely caused by the first failure — they are caused by the system redoing the same wrong thing many times.

What assumptions should I bake into an LLM step inside a workflow?

Treat it as an unreliable dependency: it will time out, fail schema, drift outputs, and oscillate in quality. Each LLM step needs at minimum a timeout, schema validation, a fallback / retry path, and a low-confidence review path. Treating an LLM as a deterministic function is the root cause of unmaintainable workflows down the line.

Beyond overall success rate, what observability metrics matter for workflows?

Five metrics: queue lag (is the system clogged), step-level P95 (which step is slow), retry count (which step is flaky), DLQ size (how many tasks have given up automatically), and review rate (how often humans step in). Watching only overall success means local degradations get caught very late.

Workflow & Automation

⏱️ 40 min

Workflow Automation

Many AI apps look great in a single-request demo. But the moment they become long tasks, batch jobs, or multi-step pipelines, problems surface: tasks get lost, retries go haywire, state is unclear, and there's no way to insert human review. The real value of an AI engineer often isn't tuning a single prompt to 90 -- it's making the entire workflow recoverable, observable, and scalable.

So this page is about production workflow automation, not just "chaining a few APIs together."

Workflow Automation Architecture

Bottom line: build explicit state first, then automate

The worst thing for AI workflows is implicit state.

Which step a task is on, how many times it's failed, whether it should go to a human -- if all of that can only be guessed from logs, the automation will spiral out of control fast.

A more stable principle:

Every step has a clear state
Every step can be retried or recovered
Every step can be observed

When you actually need a workflow system

Scenario	Why a single request isn't enough
multi-step research	Needs search, parse, summarize, merge
batch summarization	One request can't handle a whole document batch
embedding / indexing pipeline	Has ingest, split, embed, store -- multiple steps
HITL review flow	Model finishes, but humans still need to review
tool orchestration	Calls multiple external systems

If the task spans time, tools, and states, a workflow isn't optional.

A more reliable workflow skeleton

ingest
  -> queue
  -> worker step
  -> checkpoint
  -> next step or retry
  -> human review if needed
  -> completion

The critical parts here aren't the queue itself -- they're checkpoints and retry policy.

Why a state machine beats "scripts chained together"

A more production-like task state should at least include:

State	Meaning
pending	Created, not yet processed
running	Currently executing
success	Completed
failed	Failed this round, needs evaluation
needs_review	Waiting for human intervention
cancelled	Aborted or rolled back

With this state layer, SLAs, reruns, monitoring, and auditing all become much easier.

Retry isn't just "failed, try again"

A solid retry policy needs to define at minimum:

Item	Why it needs to be explicit
max attempts	Prevents infinite retries
backoff strategy	Avoids instantly hammering dependencies
idempotency key	Prevents duplicate side effects
escalation threshold	When to go to human or DLQ

Many workflow incidents aren't caused by the first failure -- they're caused by failing the same way many times after.

LLM steps in workflows should be treated as unstable dependencies

This is key.

An LLM step isn't a deterministic function, so you should assume it will:

timeout
schema fail
output drift
quality fluctuate

That means every LLM step should ideally have:

timeout
schema validation
fallback / retry
low-confidence review path

Treating the LLM as a stable black box is why many workflows become unmaintainable later.

Design HITL paths early

Human-in-the-loop isn't a patch for when the system fails -- it's a formal workflow branch.

It typically fits:

Scenario	Why humans need to step in
Low confidence output	Model itself isn't reliable
High-risk business content	One mistake is expensive
Repeatedly failed tasks	Automation is no longer worth it
Key customer tasks	SLA and trust demands are higher

If the review path is cobbled together after the fact, it's usually terrible to use.

Observability: don't just watch overall success rate

Look at these instead:

Metric	Purpose
queue lag	Is the system backed up?
step-level P95	Which step is slowing everything down?
retry count	Which step is most unstable?
DLQ size	How many tasks has automation given up on?
review rate	What percentage needs human intervention?

If you only watch overall success, you'll discover local bad trends way too late.

Common failure points

Problem	Root cause
Rerun writes duplicate results	No idempotency
Crash mid-way means full redo	No checkpoints
Hard to insert human review	No explicit review state
Debugging is pure guesswork	No per-task trace

The most valuable part of AI workflow engineering is turning these "used to rely on luck" aspects into system capabilities.

Practice

Take one of your multi-step tasks and map out these 5 things:

task states
retry policy
checkpoint locations
review triggers
failure metrics

Once these 5 are defined, your workflow automation is already a level above script chaining.

📚 相关资源

LangChain RAG Tutorial

❓ 常见问题

关于本章主题最常被搜索的问题，点击展开答案

什么时候单次 request 不够，必须上 workflow system？

5 个场景明显需要：multi-step research（搜索+解析+总结+合并）、batch summarization（一次请求处理不完整批文档）、embedding/indexing pipeline（ingest+split+embed+store 多步）、HITL review flow（模型完后还要人审）、tool orchestration（调多个外部系统）。任务跨时间、跨工具、跨状态时，workflow 就不是可选项。

为什么 state machine 比把脚本串起来更稳？

因为 production 任务至少要 6 个显式 state：pending / running / success / failed / needs_review / cancelled。脚本串起来时所有状态都藏在内存或日志里，SLA、重跑、监控、审计都做不了；显式 state 让你能精确知道任务在哪一步、失败几次、要不要转人工。

Retry policy 怎么写才靠谱？

至少定义 4 项：max attempts 防无限重试、backoff strategy 避免瞬间打爆下游、idempotency key 防止副作用重复执行、escalation threshold 决定什么时候转人工或进 DLQ。事故的根因往往不是第一次失败，而是失败后系统重复做错很多次。

LLM step 在 workflow 里要按什么假设来设计？

默认它是不稳定依赖：会 timeout、会 schema fail、会 output drift、质量会波动。每个 LLM step 至少配 timeout、schema validation、fallback / retry、low-confidence review path 4 件事。把 LLM 当 deterministic function 是很多 workflow 后期难维护的根源。

Workflow observability 看 overall success rate 不够，还要看什么？

5 个指标：queue lag 看系统有没有堵、step-level P95 看哪一步拖慢整体、retry count 看哪一步最不稳、DLQ size 看多少任务自动化失败、review rate 看人工介入比例。只看 overall success，局部的坏趋势会很晚才发现。