When should I use function calling instead of letting the model generate free text?

Three scenarios demand it: (1) structured actions — DB queries, API calls, code execution, search; (2) constrained outputs — tool calls are far more reliable than free text; (3) auditable actions — you must log tool name / args / result. Function calling separates the model's decision from the actual execution, and a parameter schema plus server-side validation is an order of magnitude more stable than "please output JSON" prompting.

How should I design a tool schema?

Principles: clear `name` and `description`, a `type` for every param, `enum` for constrained values, explicit required vs optional, defaults kept server-side. Descriptions need usage context, examples and defaults — the consolidation principle says: if a human engineer cannot decide which tool to use, the model definitely cannot. Re-validate every input on the server, then reject or repair before executing.

What does a healthy tool execution loop look like?

Standard loop: while not done → ask the model for tool call(s) → validate / repair args → run in sandbox with timeout → feed the tool result back to the model → check final answer / max steps / time budget → otherwise continue. Every tool needs a timeout, a circuit breaker (auto-trip noisy tools), an allowlist of domains / APIs, and no raw shell — sandbox everything.

How do I handle tool errors — should I retry?

Branch by error type: user errors (bad params) — return a concise error to the model so it can re-plan or ask the user; system errors (tool down) — retry with exponential backoff and a hard cap on attempts. Mutating tools must be idempotent (a retry should not double-charge); high-risk actions (delete data, transfer funds) need a confirmation step. Log every tool call: tool name, scrubbed args, duration, success/fail, token usage.

How do I test a tool-using system?

Three test classes: (1) contract tests — schema compliance and required-param presence; (2) golden cases — correct tool selection and refusal when no tool fits; (3) load / chaos — inject tool errors and verify graceful degradation. Multimodal tools (file / URL inputs) need size and type validation, pre-processing (OCR / transcript), and they should return a handle / ID rather than a raw blob, with a TTL on the artifact. The system prompt should say: "Prefer calling tools when helpful; don't guess params; ask for missing info."

Function Calling & Tool Use

⏱️ 35 min

Function calling turns LLMs into orchestration engines. This chapter outlines patterns for stable tool use.

1) When to Use

Structured actions: DB queries, API calls, code exec, search.
Constrained outputs: prefer tool calls over free-text for reliability.
Auditable actions: log tool name/args/results.

2) Tool Schema Design

Clear names/descriptions; types for every param; enums for constrained values.
Required vs optional fields; defaults kept server-side.
Validate inputs server-side; reject/repair before execution.

3) Prompting for Tools

System: "Prefer calling tools when helpful; don't guess params; ask for missing info."
Few-shot: include examples of good tool calls and refusals.
Disallow hallucination: remind model to refuse if no suitable tool exists.

4) Execution Loop

while not done:
  ask model → get tool call(s)
  validate/repair args
  run tool in sandbox with timeout
  append tool result back to model
  stop if final answer / max steps / time budget

5) Safety & Limits

Timeouts per tool; circuit-break noisy tools.
Allowlist domains/APIs; no raw shell without sandbox.
PII stripping before tool calls; redact secrets from logs.
Idempotency for mutating tools; confirmation steps for risky actions.

6) Error Handling

Distinguish user errors (bad params) vs system errors (tool down).
Provide concise tool error back to model; let model replan or ask user.
Retry with backoff for transient failures; limit total attempts.

7) Multimodal Tools

Tools that accept files/URLs: validate size/type; pre-process (OCR/transcript).
Return handles/IDs instead of raw blobs; store artifacts with TTL.

8) Testing & Evals

Contract tests: schema compliance, required params present.
Golden cases: correct tool selection, refusal when no tool fits.
Load/chaos: inject tool errors and ensure graceful degradation.

9) Minimal Checklist

Strong schemas + validation + allowlists.
Sandbox + timeouts + retries + circuit breakers.
Logs: tool, args (scrubbed), duration, success/fail, tokens.

📚 相关资源

Claude API Guide

❓ 常见问题

关于本章主题最常被搜索的问题，点击展开答案

什么时候该用 function calling 而不是让模型直接生成文本？

三种场景必须用：(1) 结构化动作——DB 查询、API 调用、code execution、search；(2) 受约束输出——tool call 比自由文本可靠；(3) 可审计动作——必须 log tool name/args/result。Tool call 把「模型决策 + 实际执行」拆开，参数 schema + 服务端验证比 prompt「请输出 JSON」稳定一个量级。

tool schema 应该怎么写？

原则：name 和 description 清晰、每个 param 有 type、constrained values 用 enum、明确 required vs optional、defaults 留服务端。description 要包含 usage context、examples、defaults——consolidation principle 说：人类工程师都拿不准该用哪个 tool，模型更不可能选对。所有 input 都要服务端再次校验，reject 或 repair 后再执行。

tool execution loop 应该长什么样？

标准 loop：while not done → 问模型拿 tool call(s) → 校验/修复 args → 在 sandbox 跑 + timeout → tool result 回灌给模型 → 检查是否最终答案 / 是否到 max steps / 是否超 time budget → 否则继续。每个 tool 必须有 timeout、circuit breaker（噪声 tool 自动熔断）、allowlist domain / API、no raw shell（必须 sandbox）。

tool 报错怎么处理？要不要重试？

区分错误类型：用户错误（bad params）→ 把错误简洁返给模型让它 replan 或问用户；系统错误（tool down）→ 指数退避重试，限制总次数。Mutating tool 必须 idempotent（重试不重复扣款）；高风险动作（删数据、转账）加确认步骤。所有 tool 错误日志含 tool 名、args（脱敏）、duration、success/fail、tokens。

tool 系统怎么测？

三类测试：(1) Contract tests——schema 合规性、required params 存在；(2) Golden cases——正确选 tool、没合适 tool 时拒答；(3) Load/chaos——故意注入 tool error 验证优雅降级。Multimodal tools（接受 file/URL）要验 size/type、预处理（OCR/transcript）、返 handle/ID 不返 raw blob，artifact 存 TTL。Prompt 要写「Prefer calling tools when helpful; don't guess params; ask for missing info」。