LLM 工程化与网关
LLM platformization builds a gateway so every model call goes through one place with consistent auth, routing, quotas, observability, and safety.
1) Goals
- One entrypoint for all LLM calls (OpenAI / Claude / Gemini / self-hosted).
- Uniform auth, rate limits, logging, schema checks, and safe rollout levers.
- Swappable providers with canary, A/B, and kill switch baked in.
AI 系统设计路线:从架构到工程落地
掌握高可用与可扩展设计,构建可靠 AI 系统。
2) Core capabilities
- Auth & secrets: per-tenant/service keys; never expose provider keys to clients.
- Rate limiting & quotas: global + per-tenant; burst control; model allowlists.
- Request shaping: default headers/timeouts; caps on max tokens/temperature/input size.
- Validation: required fields, JSON schema for tool/function calls.
- Safety: content filters, PII scrubbing, prompt-injection filters.
- Routing: provider selection by policy (cost/region/latency); model aliases.
- Retries & backoff: idempotent POST with request IDs; 429/5xx backoff.
- Observability: structured logs (traceId, model, tokens, latency); metrics to dashboards.
3) Config shape (example)
models:
gpt-5:
provider: openai
model: gpt-5
timeout_ms: 60000
limits:
max_tokens: 2000
rpm: 120
safety:
pii_scrub: true
blocklist_categories: [self_harm, violence]
rollout:
traffic_percent: 5 # canary
allow_tenants: [beta-team]
claude-3-5-sonnet:
provider: anthropic
model: claude-3-5-sonnet-latest
fallback: gpt-4o
routes:
chat:
model_alias: gpt-5
primary: gpt-5
fallback: claude-3-5-sonnet
limits:
max_prompt_chars: 32000
tools_schema: schemas/tool-calls.json
4) Minimal gateway handler(Node/TypeScript 示例)
import express from 'express';
import axios from 'axios';
import { validate } from './schema';
import { pickModelByPolicy } from './routing';
import { rateLimit } from './ratelimit';
import { scrubPII } from './safety';
import { config } from './config';
const app = express();
app.use(express.json({ limit: '2mb' }));
app.post('/v1/chat', rateLimit(), async (req, res) => {
const traceId = req.header('x-trace-id') || crypto.randomUUID();
const { messages, tools, model_alias } = req.body;
try {
validate(messages, 'schemas/messages.json');
if (tools) validate(tools, 'schemas/tools.json');
const route = config.routes.chat;
const model = model_alias || route.model_alias;
const target = pickModelByPolicy(model, route, req);
const payload = {
model: target.model,
messages: scrubPII(messages),
tools,
temperature: Math.min(req.body.temperature ?? 0.7, 1),
max_tokens: Math.min(req.body.max_tokens ?? 512, target.limits.max_tokens || 2048)
};
const resp = await axios.post(target.url, payload, {
headers: { Authorization: `Bearer ${target.key}`, 'x-trace-id': traceId },
timeout: target.timeout_ms || 60000
});
res.json({ traceId, provider: target.provider, data: resp.data });
} catch (err: any) {
if (err.response?.status === 429 || err.response?.status >= 500) {
// backoff + optional fallback
if (config.models[model_alias || config.routes.chat.fallback]) {
req.body.model_alias = config.routes.chat.fallback;
return app.handle(req, res);
}
}
res.status(err.response?.status || 500).json({ traceId, error: err.message });
}
});
app.listen(8080, () => console.log('LLM gateway running on :8080'));
5) Canary / A-B / Kill switch
- Canary: route 1-5% traffic to the new model/prompt; compare success/cost/latency before ramp.
- A/B: bucket by user/tenant; log bucket in traces for dashboards.
- Kill switch: mark a model or prompt version as disabled; route instantly to the last stable alias.
6) Caching & idempotency
- Cache deterministic calls (FAQ, structured extraction) with hashed input + model alias.
- Enforce idempotency keys on long/expensive calls; dedupe repeats.
- Protect downstream tools with circuit breakers and timeouts.
7) Multi-tenant controls
- Tenant-scoped limits and model allowlists.
- Billing hooks: log tokens per tenant for showback/chargeback.
- Data isolation: avoid cross-tenant context mixing; filter by
tenant_id.
8) Minimal checklist
- Config store (Git/DB) + hot reload.
- Health checks per provider; auto-disable unhealthy routes.
- Structured logs + metrics (success rate, P95 latency, cost, 429/5xx).
- Audit: who changed routing/prompt/config and when.
📚 相关资源
❓ 常见问题
关于本章主题最常被搜索的问题,点击展开答案
为什么要做 LLM gateway,直接调 OpenAI SDK 不行吗?
直接调有 4 个隐患:provider key 暴露给 client、quota 和 rate limit 没法统一、想换 provider / 灰度 / kill switch 全要改业务代码、observability 散落在各处。Gateway 把所有 LLM call 收到一个入口,统一鉴权、routing、quota、logging、safety,换模型只改 config 不动业务。
Gateway config 长什么样?
YAML 两段式:models 段定义每个 provider/model 的 timeout、max_tokens、rpm、PII 脱敏、blocklist、canary 流量百分比;routes 段定义业务路由(如 chat),指向 primary model + fallback model + tools_schema。例如 gpt-5 设 traffic_percent: 5 做 canary,claude-3-5-sonnet 设 fallback: gpt-4o 做兜底。
Canary、A/B、Kill switch 三个概念到底有什么区别?
Canary 是把 1-5% 流量切到新 model 或新 prompt,对比 success / cost / latency 再 ramp;A/B 是按 user / tenant 分桶,把 bucket 写进 trace 给 dashboard 看;Kill switch 是把某个 model 或 prompt version 立刻标记 disabled,瞬间路由回上一个 stable alias —— 出事时止损用。
429 / 5xx 时 gateway 怎么处理?
三件事:带 idempotency key 的 POST 加 backoff 重试避免重复副作用;如果配了 fallback model,自动切到 fallback alias 再发一次(示例代码里直接 app.handle 重入);下游 tool 用 circuit breaker + timeout 保护,避免单个 provider 抖动把整条链路拖垮。
多租户场景下 gateway 必须做哪些控制?
4 项:tenant 级 rate limit + model allowlist(不是所有租户都能用 gpt-5);billing hook 把每个 tenant 的 token 消耗写日志做 showback / chargeback;data isolation 按 tenant_id 过滤,绝不让 context 跨 tenant 混;audit log 记录是谁改的 routing / prompt / config 和何时改的。