logo
24

工作流与自动化

⏱️ 40分钟

Workflow automation covers long-running jobs, retries, and state so LLM apps stay reliable beyond single requests.

1) When You Need Workflows

  • Long tasks: multi-step research, batch summarization, doc pipelines.
  • Tool calls: scrape → parse → embed → answer.
  • Human-in-the-loop: review or labeling steps.

2) Building Blocks

  • Queue + workers: decouple ingestion and processing; control concurrency.
  • Scheduler: cron-like triggers for refresh/reindex.
  • State machine: explicit step states (pending/running/success/failed/needs_review).
  • Idempotency: task IDs; safe to retry.
  • Dead letters: capture poison messages; alert for triage.

3) Patterns for LLM Workflows

  • Fan-out/fan-in: split a corpus → parallel summarize → merge.
  • Map-reduce summarization: chunk -> summarize -> synthesize.
  • Tool+LLM loop: detect needed tool, call tool, feed result back to LLM.
  • Checkpointing: persist intermediate context for resume after failure.

4) Reliability & Timeouts

  • Step timeouts per tool/LLM call; enforce global SLA.
  • Retries with backoff; cap attempts; mark for human review after N failures.
  • Circuit breakers: pause tool/model that flaps; route to fallback.

5) Observability

  • Trace per task: task_id, steps, attempts, durations.
  • Metrics: success rate, P95 per step, retries, DLQ size, queue lag.
  • Alerts: queue lag high, DLQ growth, success rate drop.

6) Human-in-the-Loop (HITL)

  • Surfaces: review UI for flagged tasks (low confidence, schema fail).
  • Capture feedback: corrections feed back to models/prompts/evals.
  • Prioritize: SLA tiers; important tasks bypass queues or get higher worker count.

7) Data & Storage

  • Store intermediate artifacts (parsed text, embeddings, tool outputs) with TTL if possible.
  • Version data: record source doc version and processing code version.
  • Clean-up jobs: expire temp data and logs per policy.

8) Minimal Checklist

  • Idempotent steps + request IDs.
  • Per-step timeouts + retries + DLQ.
  • Metrics/alerts for lag, retries, failures.
  • HITL path for low-confidence or repeated failures.

📚 相关资源