22

Synthetic Data & Augmentation

⏱️ 30分钟

Synthetic data helps bootstrap evals and improve coverage without large labeled sets.

1) When to Use

You lack labeled data for evals or fine-tuning.
Need coverage of rare/edge cases (safety, refusal, tricky formats).
Want balanced datasets across intents/locales.

2) Generation Patterns

Paraphrase: restate user queries in varied wording.
Template expansion: slot values for entities/locales/domains.
Adversarial generation: create injection/jailbreak attempts for red teaming.
Context-aware: provide source docs and ask model to derive Q&A with citations.

3) Quality Controls

Dedup: hash canonicalized text.
Filters: length, profanity/safety, format compliance.
Diversity: vary temperature/models; enforce locale/style mixes.

4) Labeling & Verification

Self-check: use a judge prompt to verify correctness/faithfulness to source.
Spot-check manually a sample; reject low-confidence items.
Keep provenance: which model/prompt generated it; version/tag.

5) Balancing & Splits

Ensure class balance for classification tasks.
Split by source to avoid leakage (train/val/test).
Tag difficulty levels to target weak spots in evals.

6) Storage & Reuse

Store in a structured format (JSONL) with fields: input, expected_output, source, citations, tags.
Version datasets; keep changelog.
Share small subsets for CI; larger sets for nightly evals.

7) Safety & Policy

Do not include real PII; fabricate placeholders.
Include refusal cases for unsafe requests; mark expected refusal.
Generate locale-appropriate content for compliance.

8) Minimal Checklist

Generation prompts + judge prompts versioned.
Dedup + safety/format filters applied.
Balanced splits with provenance; stored as JSONL with tags.

📚 相关资源

OpenAI API 文档