Evaluation & Quality Monitoring

⏱️ 45分钟

Evaluation and quality monitoring keep your LLM features reliable. This chapter gives a practical playbook for building evals, catching regressions, and closing the feedback loop.

1) What to Measure

Relevance: Does the answer match the user intent?
Faithfulness: Is the answer grounded in sources (no hallucinations)?
Format: Does it follow the required schema/JSON/table?
Safety: No toxic/PII leakage/policy violations.
Latency & Cost: P95 response time, tokens per call.

2) Offline Eval Sets (Golden Tests)

Build a small, versioned dataset (e.g., 50-200 samples) with:
- Input: query + context (if RAG) + expected format.
- Labels: ideal answer, citations, expected refusal cases.
- Negative cases: empty context, conflicting docs, forbidden requests.
Keep per-direction/task evals (FAQ, RAG Q&A, extraction, summarization).

3) Automatic Checks (Cheap Baseline)

Format validation: JSON schema / regex / length limits.
Citation enforcement: answer must contain provided citation IDs.
Refusal checks: when no context, expect a refusal phrase.
Guardrails: block unsafe categories (violence/PII) via content filters.

4) LLM-as-Judge (Quality Scoring)

Use a judge prompt to score relevance/faithfulness/clarity on a 1-5 scale.
Ask judge to mark hallucinations and missing citations.
Run with a smaller, cheaper model for CI, a stronger model for nightly evals.

Sample Judge Prompt

You are an evaluator. Given:
- question
- retrieved context (with IDs)
- model answer
Rate 1-5 for: relevance, faithfulness, format compliance.
List any hallucinations or missing citations. Reply in JSON:
{
  "relevance": 1-5,
  "faithfulness": 1-5,
  "format": 1-5,
  "issues": ["..."]
}

5) Online Metrics & Instrumentation

Log per request: model, temperature, latency, tokens, response status, retry count.
For RAG: log top-k docs, scores, chosen citations.
Add user feedback hooks (thumbs up/down with optional comment).
Build dashboards: success rate, P95 latency, cost per 1k calls, top errors/429s.

6) Regression Gates

CI gate: run offline eval set; fail if score drops > threshold or format errors increase.
Schema gate: reject responses failing JSON schema; retry with “fix-format” prompt.
Safety gate: block outputs violating policy; return safe fallback.

7) Hallucination & Safety Mitigations

Prompt constraints: “answer only from provided context; if unknown, say UNKNOWN”.
Citations mandatory for factual claims.
Refusal logic: no context → refuse; unsafe request → safe decline.
Rerank/verify: cross-check with a secondary model or rule-based validators.

8) A/B and Canary

Route small traffic to new model/prompt; compare CTR/feedback/latency/cost.
Kill switch: instant rollback to previous config if metrics degrade.
Version configs: prompt/version ID baked into logs for traceability.

9) Red Teaming

Adversarial prompts: prompt injection, jailbreaks, policy bypass attempts.
Data exfiltration tests: attempts to leak secrets/PII.
Toxicity/abuse: profanity, harassment; ensure safe replies or refusal.

10) Cadence & Ownership

Daily/weekly: review dashboards, top failing prompts, cost anomalies.
Monthly: refresh eval sets with fresh data; rotate judge model if needed.
Ownership: assign an “LLM quality owner” per domain to triage regressions.

11) Minimal Toolkit

Eval runner: script to load golden set, call model, run checks, produce JSON report.
Schema validator: JSON schema for each task type.
Judge prompts: stored and versioned; tag with model version.
Dashboard: Grafana/Looker or vendor dashboard for latency/cost/success.

免费证书

12) Practice Tasks

Build a 50-sample golden set for your FAQ/RAG use case with citations.
Implement a judge prompt and JSON schema check; fail build if format score < 4 or faithfulness < 4.
Add user feedback logging (thumbs + comment) and surface top 10 failure queries in a dashboard.

📚 相关资源

OpenAI API 文档

1) What to Measure

2) Offline Eval Sets (Golden Tests)

3) Automatic Checks (Cheap Baseline)

4) LLM-as-Judge (Quality Scoring)

Sample Judge Prompt

5) Online Metrics & Instrumentation

6) Regression Gates

7) Hallucination & Safety Mitigations

8) A/B and Canary

9) Red Teaming

10) Cadence & Ownership

11) Minimal Toolkit

最新免费认证与项目

12) Practice Tasks

📚 相关资源