Synthetic Data & Augmentation
Synthetic Data Augmentation
Synthetic data is easily misunderstood as "don't have enough data, so let the model make up more." That's dangerous. Effective synthetic data isn't about making the dataset bigger -- it's about making coverage reasonable, making evals sharper, and exposing model blind spots earlier.
So this page isn't about batch-generating text. It's about how AI engineers should use synthetic data in a more reliable augmentation workflow.
Bottom line: fix coverage first, don't chase scale
When teams are short on data, the instinct is to 10x or 100x the dataset immediately.
A more stable sequence:
- Identify real data gaps
- Decide which case types need filling
- Generate, then verify and deduplicate
- Only then consider scaling production
If coverage isn't thought through, synthetic data just amplifies bias.
Where synthetic data works best
| Scenario | Why it fits |
|---|---|
| eval set expansion | Fill edge cases and rare cases |
| safety / refusal testing | Generate jailbreak and policy-boundary samples |
| structured extraction | Batch-generate structurally similar but diverse inputs |
| locale / style variation | Cover different languages and expression styles |
| low-resource task bootstrap | Quickly build a trainable/evaluable starting point |
Its biggest value usually isn't training data -- it's evaluation and coverage gap-filling.
The most common misuses
| Misuse | Consequence |
|---|---|
| Same model generates and evaluates | Easy to fool yourself, misjudge quality |
| Only paraphrasing | Looks diverse, actually low information |
| No provenance tracking | Can't tell later which batch contaminated results |
| train / test source mixing | Leakage makes evaluation unreliable |
Synthetic data isn't "more is stronger." Once it's out of control, it actually makes judging real capability harder.
A more stable generation pipeline
gap analysis
-> generation prompt
-> judge / filter
-> dedup
-> difficulty tagging
-> split and version
The middle steps are the ones that get skipped most often.
But it's precisely the judge, dedup, and tagging steps that determine whether a batch is usable.
Common generation methods
| Method | Good for |
|---|---|
| paraphrase | Adding expression diversity |
| template expansion | Adding entity, industry, locale variables |
| adversarial generation | Adding attack samples, boundary samples |
| source-grounded generation | Generating Q&A / citation cases from real docs |
The most reliable one is usually the last.
Because it's closest to real tasks and easiest to check for correctness.
Verification can't rely on "looks fine"
Each synthetic item should at minimum go through:
| Check | Why it's needed |
|---|---|
| format validation | Ensures structure fits the pipeline |
| faithfulness check | Makes sure it didn't drift from source |
| safety filter | Prevents unwanted content from sneaking in |
| manual spot check | Automated judges miss things too |
| duplicate control | Prevents fake diversity |
Without these, the improvements you see might just be dataset contamination.
Provenance and versioning are critical
Each synthetic sample should record at minimum:
- source
- generation model
- generation prompt version
- judge result
- tags
- difficulty
- split
Without provenance, the moment you get bad samples, weird improvements, or regressions, you'll have almost no way to trace them.
Difficulty layering matters more than averaging
The most valuable thing AI engineers can do with synthetic data is difficulty tagging.
| Difficulty | Examples |
|---|---|
| easy | Standard format, standard expression |
| medium | Paraphrase, light ambiguity |
| hard | Multi-constraint, long context, edge case |
| adversarial | Injection, manipulation, format breaking |
This way you actually know whether the model improved broadly or just looks better on easy cases.
When to stop expanding synthetic data
| Situation | Why |
|---|---|
| Real data pipeline isn't organized yet | You'll just keep spinning on synthetic |
| Judge mechanism is unstable | More data means more noise |
| Bad cases haven't been reviewed | Generating more won't hit the right spots |
| Team has no version governance | Data becomes increasingly unmaintainable |
Synthetic data is an amplifier, not a shortcut that replaces real data governance.
Practice
Take your current eval set. Don't 10x it yet.
Answer these 4 questions first:
- Which case types are missing?
- Which ones are worth filling with synthetic data?
- How will you judge the generated data?
- How will you record provenance?
Get these 4 clear, then start batch generation.
📚 相关资源
❓ 常见问题
关于本章主题最常被搜索的问题,点击展开答案
Synthetic data 是不是数据不够就让模型多生成一点?
不是,这种用法很危险。Synthetic data 真正的价值不是把 dataset 扩 10 倍,而是把 coverage 补合理、把 eval 变更有杀伤力、把模型盲区更早暴露。先做 gap analysis 找出真实数据缺什么,再针对性生成;coverage 没想清楚就批量扩,只会把偏差放大。
Synthetic data 最适合用在什么场景?
5 个高价值场景:eval set 扩充补 edge / rare case;safety 与 refusal 测试生成 jailbreak 边缘样本;格式化 extraction 批量生成结构相近内容多样的输入;locale / style variation 补不同语言表达;low-resource task 的 bootstrap。它最值钱的地方通常是评估和覆盖补洞,不是训练主数据。
怎么判断生成的 synthetic 样本到底能不能用?
至少过 5 道关:format validation 保证能进 pipeline、faithfulness check 确保没偏离 source、safety filter 过滤危险内容、manual spot check(自动 judge 也会漏)、duplicate control 避免虚假多样性。让同一个模型既生成又评估是最常见的错法 —— 容易自嗨误判质量。
为什么每条 synthetic 样本都要记 provenance?
没有 provenance,后面一旦出现奇怪提升、regression、坏样本,几乎追不回来源。每条样本最少记 source、generation model、prompt version、judge result、tags、difficulty、split 这 7 个字段,才能在出问题时回溯到底是哪一批数据污染了结果。
什么时候应该停下,别再扩 synthetic data?
4 个红线场景:真实数据 pipeline 还没整理好(你只会一直在 synthetic 上兜圈)、judge 机制不稳(扩越多噪声越大)、bad case 还没复盘(生成再多也补不到点上)、团队没有版本治理(数据越来越难维护)。Synthetic data 是放大器,不是替代真实数据治理的捷径。