24

Workflow & Automation

⏱️ 40 min

Workflow Automation

Many AI apps look great in a single-request demo. But the moment they become long tasks, batch jobs, or multi-step pipelines, problems surface: tasks get lost, retries go haywire, state is unclear, and there's no way to insert human review. The real value of an AI engineer often isn't tuning a single prompt to 90 -- it's making the entire workflow recoverable, observable, and scalable.

So this page is about production workflow automation, not just "chaining a few APIs together."

Workflow Automation Architecture


Bottom line: build explicit state first, then automate

The worst thing for AI workflows is implicit state.

Which step a task is on, how many times it's failed, whether it should go to a human -- if all of that can only be guessed from logs, the automation will spiral out of control fast.

A more stable principle:

  1. Every step has a clear state
  2. Every step can be retried or recovered
  3. Every step can be observed

When you actually need a workflow system

ScenarioWhy a single request isn't enough
multi-step researchNeeds search, parse, summarize, merge
batch summarizationOne request can't handle a whole document batch
embedding / indexing pipelineHas ingest, split, embed, store -- multiple steps
HITL review flowModel finishes, but humans still need to review
tool orchestrationCalls multiple external systems

If the task spans time, tools, and states, a workflow isn't optional.


A more reliable workflow skeleton

ingest
  -> queue
  -> worker step
  -> checkpoint
  -> next step or retry
  -> human review if needed
  -> completion

The critical parts here aren't the queue itself -- they're checkpoints and retry policy.


Why a state machine beats "scripts chained together"

A more production-like task state should at least include:

StateMeaning
pendingCreated, not yet processed
runningCurrently executing
successCompleted
failedFailed this round, needs evaluation
needs_reviewWaiting for human intervention
cancelledAborted or rolled back

With this state layer, SLAs, reruns, monitoring, and auditing all become much easier.


Retry isn't just "failed, try again"

A solid retry policy needs to define at minimum:

ItemWhy it needs to be explicit
max attemptsPrevents infinite retries
backoff strategyAvoids instantly hammering dependencies
idempotency keyPrevents duplicate side effects
escalation thresholdWhen to go to human or DLQ

Many workflow incidents aren't caused by the first failure -- they're caused by failing the same way many times after.


LLM steps in workflows should be treated as unstable dependencies

This is key.

An LLM step isn't a deterministic function, so you should assume it will:

  • timeout
  • schema fail
  • output drift
  • quality fluctuate

That means every LLM step should ideally have:

  • timeout
  • schema validation
  • fallback / retry
  • low-confidence review path

Treating the LLM as a stable black box is why many workflows become unmaintainable later.


Design HITL paths early

Human-in-the-loop isn't a patch for when the system fails -- it's a formal workflow branch.

It typically fits:

ScenarioWhy humans need to step in
Low confidence outputModel itself isn't reliable
High-risk business contentOne mistake is expensive
Repeatedly failed tasksAutomation is no longer worth it
Key customer tasksSLA and trust demands are higher

If the review path is cobbled together after the fact, it's usually terrible to use.


Observability: don't just watch overall success rate

Look at these instead:

MetricPurpose
queue lagIs the system backed up?
step-level P95Which step is slowing everything down?
retry countWhich step is most unstable?
DLQ sizeHow many tasks has automation given up on?
review rateWhat percentage needs human intervention?

If you only watch overall success, you'll discover local bad trends way too late.


Common failure points

ProblemRoot cause
Rerun writes duplicate resultsNo idempotency
Crash mid-way means full redoNo checkpoints
Hard to insert human reviewNo explicit review state
Debugging is pure guessworkNo per-task trace

The most valuable part of AI workflow engineering is turning these "used to rely on luck" aspects into system capabilities.


Practice

Take one of your multi-step tasks and map out these 5 things:

  1. task states
  2. retry policy
  3. checkpoint locations
  4. review triggers
  5. failure metrics

Once these 5 are defined, your workflow automation is already a level above script chaining.

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题,点击展开答案

什么时候单次 request 不够,必须上 workflow system?

5 个场景明显需要:multi-step research(搜索+解析+总结+合并)、batch summarization(一次请求处理不完整批文档)、embedding/indexing pipeline(ingest+split+embed+store 多步)、HITL review flow(模型完后还要人审)、tool orchestration(调多个外部系统)。任务跨时间、跨工具、跨状态时,workflow 就不是可选项。

为什么 state machine 比把脚本串起来更稳?

因为 production 任务至少要 6 个显式 state:pending / running / success / failed / needs_review / cancelled。脚本串起来时所有状态都藏在内存或日志里,SLA、重跑、监控、审计都做不了;显式 state 让你能精确知道任务在哪一步、失败几次、要不要转人工。

Retry policy 怎么写才靠谱?

至少定义 4 项:max attempts 防无限重试、backoff strategy 避免瞬间打爆下游、idempotency key 防止副作用重复执行、escalation threshold 决定什么时候转人工或进 DLQ。事故的根因往往不是第一次失败,而是失败后系统重复做错很多次。

LLM step 在 workflow 里要按什么假设来设计?

默认它是不稳定依赖:会 timeout、会 schema fail、会 output drift、质量会波动。每个 LLM step 至少配 timeout、schema validation、fallback / retry、low-confidence review path 4 件事。把 LLM 当 deterministic function 是很多 workflow 后期难维护的根源。

Workflow observability 看 overall success rate 不够,还要看什么?

5 个指标:queue lag 看系统有没有堵、step-level P95 看哪一步拖慢整体、retry count 看哪一步最不稳、DLQ size 看多少任务自动化失败、review rate 看人工介入比例。只看 overall success,局部的坏趋势会很晚才发现。