Workflow & Automation
Workflow Automation
Many AI apps look great in a single-request demo. But the moment they become long tasks, batch jobs, or multi-step pipelines, problems surface: tasks get lost, retries go haywire, state is unclear, and there's no way to insert human review. The real value of an AI engineer often isn't tuning a single prompt to 90 -- it's making the entire workflow recoverable, observable, and scalable.
So this page is about production workflow automation, not just "chaining a few APIs together."
Bottom line: build explicit state first, then automate
The worst thing for AI workflows is implicit state.
Which step a task is on, how many times it's failed, whether it should go to a human -- if all of that can only be guessed from logs, the automation will spiral out of control fast.
A more stable principle:
- Every step has a clear state
- Every step can be retried or recovered
- Every step can be observed
When you actually need a workflow system
| Scenario | Why a single request isn't enough |
|---|---|
| multi-step research | Needs search, parse, summarize, merge |
| batch summarization | One request can't handle a whole document batch |
| embedding / indexing pipeline | Has ingest, split, embed, store -- multiple steps |
| HITL review flow | Model finishes, but humans still need to review |
| tool orchestration | Calls multiple external systems |
If the task spans time, tools, and states, a workflow isn't optional.
A more reliable workflow skeleton
ingest
-> queue
-> worker step
-> checkpoint
-> next step or retry
-> human review if needed
-> completion
The critical parts here aren't the queue itself -- they're checkpoints and retry policy.
Why a state machine beats "scripts chained together"
A more production-like task state should at least include:
| State | Meaning |
|---|---|
| pending | Created, not yet processed |
| running | Currently executing |
| success | Completed |
| failed | Failed this round, needs evaluation |
| needs_review | Waiting for human intervention |
| cancelled | Aborted or rolled back |
With this state layer, SLAs, reruns, monitoring, and auditing all become much easier.
Retry isn't just "failed, try again"
A solid retry policy needs to define at minimum:
| Item | Why it needs to be explicit |
|---|---|
| max attempts | Prevents infinite retries |
| backoff strategy | Avoids instantly hammering dependencies |
| idempotency key | Prevents duplicate side effects |
| escalation threshold | When to go to human or DLQ |
Many workflow incidents aren't caused by the first failure -- they're caused by failing the same way many times after.
LLM steps in workflows should be treated as unstable dependencies
This is key.
An LLM step isn't a deterministic function, so you should assume it will:
- timeout
- schema fail
- output drift
- quality fluctuate
That means every LLM step should ideally have:
- timeout
- schema validation
- fallback / retry
- low-confidence review path
Treating the LLM as a stable black box is why many workflows become unmaintainable later.
Design HITL paths early
Human-in-the-loop isn't a patch for when the system fails -- it's a formal workflow branch.
It typically fits:
| Scenario | Why humans need to step in |
|---|---|
| Low confidence output | Model itself isn't reliable |
| High-risk business content | One mistake is expensive |
| Repeatedly failed tasks | Automation is no longer worth it |
| Key customer tasks | SLA and trust demands are higher |
If the review path is cobbled together after the fact, it's usually terrible to use.
Observability: don't just watch overall success rate
Look at these instead:
| Metric | Purpose |
|---|---|
| queue lag | Is the system backed up? |
| step-level P95 | Which step is slowing everything down? |
| retry count | Which step is most unstable? |
| DLQ size | How many tasks has automation given up on? |
| review rate | What percentage needs human intervention? |
If you only watch overall success, you'll discover local bad trends way too late.
Common failure points
| Problem | Root cause |
|---|---|
| Rerun writes duplicate results | No idempotency |
| Crash mid-way means full redo | No checkpoints |
| Hard to insert human review | No explicit review state |
| Debugging is pure guesswork | No per-task trace |
The most valuable part of AI workflow engineering is turning these "used to rely on luck" aspects into system capabilities.
Practice
Take one of your multi-step tasks and map out these 5 things:
- task states
- retry policy
- checkpoint locations
- review triggers
- failure metrics
Once these 5 are defined, your workflow automation is already a level above script chaining.
📚 相关资源
❓ 常见问题
关于本章主题最常被搜索的问题,点击展开答案
什么时候单次 request 不够,必须上 workflow system?
5 个场景明显需要:multi-step research(搜索+解析+总结+合并)、batch summarization(一次请求处理不完整批文档)、embedding/indexing pipeline(ingest+split+embed+store 多步)、HITL review flow(模型完后还要人审)、tool orchestration(调多个外部系统)。任务跨时间、跨工具、跨状态时,workflow 就不是可选项。
为什么 state machine 比把脚本串起来更稳?
因为 production 任务至少要 6 个显式 state:pending / running / success / failed / needs_review / cancelled。脚本串起来时所有状态都藏在内存或日志里,SLA、重跑、监控、审计都做不了;显式 state 让你能精确知道任务在哪一步、失败几次、要不要转人工。
Retry policy 怎么写才靠谱?
至少定义 4 项:max attempts 防无限重试、backoff strategy 避免瞬间打爆下游、idempotency key 防止副作用重复执行、escalation threshold 决定什么时候转人工或进 DLQ。事故的根因往往不是第一次失败,而是失败后系统重复做错很多次。
LLM step 在 workflow 里要按什么假设来设计?
默认它是不稳定依赖:会 timeout、会 schema fail、会 output drift、质量会波动。每个 LLM step 至少配 timeout、schema validation、fallback / retry、low-confidence review path 4 件事。把 LLM 当 deterministic function 是很多 workflow 后期难维护的根源。
Workflow observability 看 overall success rate 不够,还要看什么?
5 个指标:queue lag 看系统有没有堵、step-level P95 看哪一步拖慢整体、retry count 看哪一步最不稳、DLQ size 看多少任务自动化失败、review rate 看人工介入比例。只看 overall success,局部的坏趋势会很晚才发现。