22

Synthetic Data & Augmentation

⏱️ 30 min

Synthetic Data Augmentation

Synthetic data is easily misunderstood as "don't have enough data, so let the model make up more." That's dangerous. Effective synthetic data isn't about making the dataset bigger -- it's about making coverage reasonable, making evals sharper, and exposing model blind spots earlier.

So this page isn't about batch-generating text. It's about how AI engineers should use synthetic data in a more reliable augmentation workflow.

Synthetic Data Pipeline


Bottom line: fix coverage first, don't chase scale

When teams are short on data, the instinct is to 10x or 100x the dataset immediately.

A more stable sequence:

  1. Identify real data gaps
  2. Decide which case types need filling
  3. Generate, then verify and deduplicate
  4. Only then consider scaling production

If coverage isn't thought through, synthetic data just amplifies bias.


Where synthetic data works best

ScenarioWhy it fits
eval set expansionFill edge cases and rare cases
safety / refusal testingGenerate jailbreak and policy-boundary samples
structured extractionBatch-generate structurally similar but diverse inputs
locale / style variationCover different languages and expression styles
low-resource task bootstrapQuickly build a trainable/evaluable starting point

Its biggest value usually isn't training data -- it's evaluation and coverage gap-filling.


The most common misuses

MisuseConsequence
Same model generates and evaluatesEasy to fool yourself, misjudge quality
Only paraphrasingLooks diverse, actually low information
No provenance trackingCan't tell later which batch contaminated results
train / test source mixingLeakage makes evaluation unreliable

Synthetic data isn't "more is stronger." Once it's out of control, it actually makes judging real capability harder.


A more stable generation pipeline

gap analysis
  -> generation prompt
  -> judge / filter
  -> dedup
  -> difficulty tagging
  -> split and version

The middle steps are the ones that get skipped most often.

But it's precisely the judge, dedup, and tagging steps that determine whether a batch is usable.


Common generation methods

MethodGood for
paraphraseAdding expression diversity
template expansionAdding entity, industry, locale variables
adversarial generationAdding attack samples, boundary samples
source-grounded generationGenerating Q&A / citation cases from real docs

The most reliable one is usually the last.

Because it's closest to real tasks and easiest to check for correctness.


Verification can't rely on "looks fine"

Each synthetic item should at minimum go through:

CheckWhy it's needed
format validationEnsures structure fits the pipeline
faithfulness checkMakes sure it didn't drift from source
safety filterPrevents unwanted content from sneaking in
manual spot checkAutomated judges miss things too
duplicate controlPrevents fake diversity

Without these, the improvements you see might just be dataset contamination.


Provenance and versioning are critical

Each synthetic sample should record at minimum:

  • source
  • generation model
  • generation prompt version
  • judge result
  • tags
  • difficulty
  • split

Without provenance, the moment you get bad samples, weird improvements, or regressions, you'll have almost no way to trace them.


Difficulty layering matters more than averaging

The most valuable thing AI engineers can do with synthetic data is difficulty tagging.

DifficultyExamples
easyStandard format, standard expression
mediumParaphrase, light ambiguity
hardMulti-constraint, long context, edge case
adversarialInjection, manipulation, format breaking

This way you actually know whether the model improved broadly or just looks better on easy cases.


When to stop expanding synthetic data

SituationWhy
Real data pipeline isn't organized yetYou'll just keep spinning on synthetic
Judge mechanism is unstableMore data means more noise
Bad cases haven't been reviewedGenerating more won't hit the right spots
Team has no version governanceData becomes increasingly unmaintainable

Synthetic data is an amplifier, not a shortcut that replaces real data governance.


Practice

Take your current eval set. Don't 10x it yet.

Answer these 4 questions first:

  1. Which case types are missing?
  2. Which ones are worth filling with synthetic data?
  3. How will you judge the generated data?
  4. How will you record provenance?

Get these 4 clear, then start batch generation.

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题,点击展开答案

Synthetic data 是不是数据不够就让模型多生成一点?

不是,这种用法很危险。Synthetic data 真正的价值不是把 dataset 扩 10 倍,而是把 coverage 补合理、把 eval 变更有杀伤力、把模型盲区更早暴露。先做 gap analysis 找出真实数据缺什么,再针对性生成;coverage 没想清楚就批量扩,只会把偏差放大。

Synthetic data 最适合用在什么场景?

5 个高价值场景:eval set 扩充补 edge / rare case;safety 与 refusal 测试生成 jailbreak 边缘样本;格式化 extraction 批量生成结构相近内容多样的输入;locale / style variation 补不同语言表达;low-resource task 的 bootstrap。它最值钱的地方通常是评估和覆盖补洞,不是训练主数据。

怎么判断生成的 synthetic 样本到底能不能用?

至少过 5 道关:format validation 保证能进 pipeline、faithfulness check 确保没偏离 source、safety filter 过滤危险内容、manual spot check(自动 judge 也会漏)、duplicate control 避免虚假多样性。让同一个模型既生成又评估是最常见的错法 —— 容易自嗨误判质量。

为什么每条 synthetic 样本都要记 provenance?

没有 provenance,后面一旦出现奇怪提升、regression、坏样本,几乎追不回来源。每条样本最少记 source、generation model、prompt version、judge result、tags、difficulty、split 这 7 个字段,才能在出问题时回溯到底是哪一批数据污染了结果。

什么时候应该停下,别再扩 synthetic data?

4 个红线场景:真实数据 pipeline 还没整理好(你只会一直在 synthetic 上兜圈)、judge 机制不稳(扩越多噪声越大)、bad case 还没复盘(生成再多也补不到点上)、团队没有版本治理(数据越来越难维护)。Synthetic data 是放大器,不是替代真实数据治理的捷径。