24

工作流与自动化

⏱️ 40分钟

Workflow automation covers long-running jobs, retries, and state so LLM apps stay reliable beyond single requests.

1) When You Need Workflows

Long tasks: multi-step research, batch summarization, doc pipelines.
Tool calls: scrape → parse → embed → answer.
Human-in-the-loop: review or labeling steps.

2) Building Blocks

Queue + workers: decouple ingestion and processing; control concurrency.
Scheduler: cron-like triggers for refresh/reindex.
State machine: explicit step states (pending/running/success/failed/needs_review).
Idempotency: task IDs; safe to retry.
Dead letters: capture poison messages; alert for triage.

3) Patterns for LLM Workflows

Fan-out/fan-in: split a corpus → parallel summarize → merge.
Map-reduce summarization: chunk -> summarize -> synthesize.
Tool+LLM loop: detect needed tool, call tool, feed result back to LLM.
Checkpointing: persist intermediate context for resume after failure.

4) Reliability & Timeouts

Step timeouts per tool/LLM call; enforce global SLA.
Retries with backoff; cap attempts; mark for human review after N failures.
Circuit breakers: pause tool/model that flaps; route to fallback.

5) Observability

Trace per task: task_id, steps, attempts, durations.
Metrics: success rate, P95 per step, retries, DLQ size, queue lag.
Alerts: queue lag high, DLQ growth, success rate drop.

6) Human-in-the-Loop (HITL)

Surfaces: review UI for flagged tasks (low confidence, schema fail).
Capture feedback: corrections feed back to models/prompts/evals.
Prioritize: SLA tiers; important tasks bypass queues or get higher worker count.

7) Data & Storage

Store intermediate artifacts (parsed text, embeddings, tool outputs) with TTL if possible.
Version data: record source doc version and processing code version.
Clean-up jobs: expire temp data and logs per policy.

8) Minimal Checklist

Idempotent steps + request IDs.
Per-step timeouts + retries + DLQ.
Metrics/alerts for lag, retries, failures.
HITL path for low-confidence or repeated failures.

📚 相关资源

LangChain RAG 教程