logo
22

Synthetic Data & Augmentation

⏱️ 30分钟

Synthetic data helps bootstrap evals and improve coverage without large labeled sets.

1) When to Use

  • You lack labeled data for evals or fine-tuning.
  • Need coverage of rare/edge cases (safety, refusal, tricky formats).
  • Want balanced datasets across intents/locales.

2) Generation Patterns

  • Paraphrase: restate user queries in varied wording.
  • Template expansion: slot values for entities/locales/domains.
  • Adversarial generation: create injection/jailbreak attempts for red teaming.
  • Context-aware: provide source docs and ask model to derive Q&A with citations.

3) Quality Controls

  • Dedup: hash canonicalized text.
  • Filters: length, profanity/safety, format compliance.
  • Diversity: vary temperature/models; enforce locale/style mixes.

4) Labeling & Verification

  • Self-check: use a judge prompt to verify correctness/faithfulness to source.
  • Spot-check manually a sample; reject low-confidence items.
  • Keep provenance: which model/prompt generated it; version/tag.

5) Balancing & Splits

  • Ensure class balance for classification tasks.
  • Split by source to avoid leakage (train/val/test).
  • Tag difficulty levels to target weak spots in evals.

6) Storage & Reuse

  • Store in a structured format (JSONL) with fields: input, expected_output, source, citations, tags.
  • Version datasets; keep changelog.
  • Share small subsets for CI; larger sets for nightly evals.

7) Safety & Policy

  • Do not include real PII; fabricate placeholders.
  • Include refusal cases for unsafe requests; mark expected refusal.
  • Generate locale-appropriate content for compliance.

8) Minimal Checklist

  • Generation prompts + judge prompts versioned.
  • Dedup + safety/format filters applied.
  • Balanced splits with provenance; stored as JSONL with tags.

📚 相关资源