22
Synthetic Data & Augmentation
Synthetic data helps bootstrap evals and improve coverage without large labeled sets.
1) When to Use
- You lack labeled data for evals or fine-tuning.
- Need coverage of rare/edge cases (safety, refusal, tricky formats).
- Want balanced datasets across intents/locales.
2) Generation Patterns
- Paraphrase: restate user queries in varied wording.
- Template expansion: slot values for entities/locales/domains.
- Adversarial generation: create injection/jailbreak attempts for red teaming.
- Context-aware: provide source docs and ask model to derive Q&A with citations.
3) Quality Controls
- Dedup: hash canonicalized text.
- Filters: length, profanity/safety, format compliance.
- Diversity: vary temperature/models; enforce locale/style mixes.
4) Labeling & Verification
- Self-check: use a judge prompt to verify correctness/faithfulness to source.
- Spot-check manually a sample; reject low-confidence items.
- Keep provenance: which model/prompt generated it; version/tag.
5) Balancing & Splits
- Ensure class balance for classification tasks.
- Split by source to avoid leakage (train/val/test).
- Tag difficulty levels to target weak spots in evals.
6) Storage & Reuse
- Store in a structured format (JSONL) with fields: input, expected_output, source, citations, tags.
- Version datasets; keep changelog.
- Share small subsets for CI; larger sets for nightly evals.
7) Safety & Policy
- Do not include real PII; fabricate placeholders.
- Include refusal cases for unsafe requests; mark expected refusal.
- Generate locale-appropriate content for compliance.
8) Minimal Checklist
- Generation prompts + judge prompts versioned.
- Dedup + safety/format filters applied.
- Balanced splits with provenance; stored as JSONL with tags.