logo
20

Evaluation & Quality Monitoring

⏱️ 45分钟

Evaluation and quality monitoring keep your LLM features reliable. This chapter gives a practical playbook for building evals, catching regressions, and closing the feedback loop.

1) What to Measure

  • Relevance: Does the answer match the user intent?
  • Faithfulness: Is the answer grounded in sources (no hallucinations)?
  • Format: Does it follow the required schema/JSON/table?
  • Safety: No toxic/PII leakage/policy violations.
  • Latency & Cost: P95 response time, tokens per call.

2) Offline Eval Sets (Golden Tests)

  • Build a small, versioned dataset (e.g., 50-200 samples) with:
    • Input: query + context (if RAG) + expected format.
    • Labels: ideal answer, citations, expected refusal cases.
    • Negative cases: empty context, conflicting docs, forbidden requests.
  • Keep per-direction/task evals (FAQ, RAG Q&A, extraction, summarization).

3) Automatic Checks (Cheap Baseline)

  • Format validation: JSON schema / regex / length limits.
  • Citation enforcement: answer must contain provided citation IDs.
  • Refusal checks: when no context, expect a refusal phrase.
  • Guardrails: block unsafe categories (violence/PII) via content filters.

4) LLM-as-Judge (Quality Scoring)

  • Use a judge prompt to score relevance/faithfulness/clarity on a 1-5 scale.
  • Ask judge to mark hallucinations and missing citations.
  • Run with a smaller, cheaper model for CI, a stronger model for nightly evals.

Sample Judge Prompt

You are an evaluator. Given:
- question
- retrieved context (with IDs)
- model answer
Rate 1-5 for: relevance, faithfulness, format compliance.
List any hallucinations or missing citations. Reply in JSON:
{
  "relevance": 1-5,
  "faithfulness": 1-5,
  "format": 1-5,
  "issues": ["..."]
}

5) Online Metrics & Instrumentation

  • Log per request: model, temperature, latency, tokens, response status, retry count.
  • For RAG: log top-k docs, scores, chosen citations.
  • Add user feedback hooks (thumbs up/down with optional comment).
  • Build dashboards: success rate, P95 latency, cost per 1k calls, top errors/429s.

6) Regression Gates

  • CI gate: run offline eval set; fail if score drops > threshold or format errors increase.
  • Schema gate: reject responses failing JSON schema; retry with “fix-format” prompt.
  • Safety gate: block outputs violating policy; return safe fallback.

7) Hallucination & Safety Mitigations

  • Prompt constraints: “answer only from provided context; if unknown, say UNKNOWN”.
  • Citations mandatory for factual claims.
  • Refusal logic: no context → refuse; unsafe request → safe decline.
  • Rerank/verify: cross-check with a secondary model or rule-based validators.

8) A/B and Canary

  • Route small traffic to new model/prompt; compare CTR/feedback/latency/cost.
  • Kill switch: instant rollback to previous config if metrics degrade.
  • Version configs: prompt/version ID baked into logs for traceability.

9) Red Teaming

  • Adversarial prompts: prompt injection, jailbreaks, policy bypass attempts.
  • Data exfiltration tests: attempts to leak secrets/PII.
  • Toxicity/abuse: profanity, harassment; ensure safe replies or refusal.

10) Cadence & Ownership

  • Daily/weekly: review dashboards, top failing prompts, cost anomalies.
  • Monthly: refresh eval sets with fresh data; rotate judge model if needed.
  • Ownership: assign an “LLM quality owner” per domain to triage regressions.

11) Minimal Toolkit

  • Eval runner: script to load golden set, call model, run checks, produce JSON report.
  • Schema validator: JSON schema for each task type.
  • Judge prompts: stored and versioned; tag with model version.
  • Dashboard: Grafana/Looker or vendor dashboard for latency/cost/success.
免费证书

最新免费认证与项目

快速补充简历亮点,提升竞争力。

立即查看 →

12) Practice Tasks

  1. Build a 50-sample golden set for your FAQ/RAG use case with citations.
  2. Implement a judge prompt and JSON schema check; fail build if format score < 4 or faithfulness < 4.
  3. Add user feedback logging (thumbs + comment) and surface top 10 failure queries in a dashboard.

📚 相关资源