Is synthetic data just "not enough data, let the model make more"?

No, and that approach is dangerous. The real value of synthetic data is not 10x-ing the dataset — it is making coverage reasonable, making eval more aggressive, and surfacing model blind spots earlier. Run gap analysis first to find what real data is missing, then generate targeted samples. Bulk expansion before you understand coverage just amplifies bias.

Where does synthetic data actually pay off?

Five high-value use cases: expanding the eval set for edge and rare cases; safety and refusal testing with jailbreak-adjacent samples; structured extraction with format-similar but content-diverse inputs; locale and style variation for different languages; bootstrapping low-resource tasks. The biggest payoff is usually evaluation and coverage patching — not the primary training set.

How do I tell if generated synthetic samples are actually usable?

At minimum five gates: format validation to keep the pipeline alive, faithfulness check against source, safety filter for dangerous content, manual spot check (auto-judges miss things), and duplicate control to avoid fake diversity. Using the same model to generate and judge is the most common anti-pattern — it grades its own homework and inflates quality.

Why must every synthetic sample carry provenance metadata?

Without provenance, when you see weird improvements, regressions, or bad samples later, you cannot trace what poisoned them. Each sample needs at least seven fields: source, generation model, prompt version, judge result, tags, difficulty, and split — so you can rewind to find which batch caused the contamination.

When should I stop expanding synthetic data?

Four red lines: the real data pipeline is still messy (you will loop on synthetic forever), the judge is unstable (more data, more noise), bad cases are not yet reviewed (more generation will not hit the right points), and the team has no version governance (data becomes unmaintainable). Synthetic data is an amplifier, not a shortcut around real data governance.

Synthetic Data & Augmentation

⏱️ 30 min

Synthetic Data Augmentation

Synthetic data is easily misunderstood as "don't have enough data, so let the model make up more." That's dangerous. Effective synthetic data isn't about making the dataset bigger -- it's about making coverage reasonable, making evals sharper, and exposing model blind spots earlier.

So this page isn't about batch-generating text. It's about how AI engineers should use synthetic data in a more reliable augmentation workflow.

Synthetic Data Pipeline

Bottom line: fix coverage first, don't chase scale

When teams are short on data, the instinct is to 10x or 100x the dataset immediately.

A more stable sequence:

Identify real data gaps
Decide which case types need filling
Generate, then verify and deduplicate
Only then consider scaling production

If coverage isn't thought through, synthetic data just amplifies bias.

Where synthetic data works best

Scenario	Why it fits
eval set expansion	Fill edge cases and rare cases
safety / refusal testing	Generate jailbreak and policy-boundary samples
structured extraction	Batch-generate structurally similar but diverse inputs
locale / style variation	Cover different languages and expression styles
low-resource task bootstrap	Quickly build a trainable/evaluable starting point

Its biggest value usually isn't training data -- it's evaluation and coverage gap-filling.

The most common misuses

Misuse	Consequence
Same model generates and evaluates	Easy to fool yourself, misjudge quality
Only paraphrasing	Looks diverse, actually low information
No provenance tracking	Can't tell later which batch contaminated results
train / test source mixing	Leakage makes evaluation unreliable

Synthetic data isn't "more is stronger." Once it's out of control, it actually makes judging real capability harder.

A more stable generation pipeline

gap analysis
  -> generation prompt
  -> judge / filter
  -> dedup
  -> difficulty tagging
  -> split and version

The middle steps are the ones that get skipped most often.

But it's precisely the judge, dedup, and tagging steps that determine whether a batch is usable.

Common generation methods

Method	Good for
paraphrase	Adding expression diversity
template expansion	Adding entity, industry, locale variables
adversarial generation	Adding attack samples, boundary samples
source-grounded generation	Generating Q&A / citation cases from real docs

The most reliable one is usually the last.

Because it's closest to real tasks and easiest to check for correctness.

Verification can't rely on "looks fine"

Each synthetic item should at minimum go through:

Check	Why it's needed
format validation	Ensures structure fits the pipeline
faithfulness check	Makes sure it didn't drift from source
safety filter	Prevents unwanted content from sneaking in
manual spot check	Automated judges miss things too
duplicate control	Prevents fake diversity

Without these, the improvements you see might just be dataset contamination.

Provenance and versioning are critical

Each synthetic sample should record at minimum:

source
generation model
generation prompt version
judge result
tags
difficulty
split

Without provenance, the moment you get bad samples, weird improvements, or regressions, you'll have almost no way to trace them.

Difficulty layering matters more than averaging

The most valuable thing AI engineers can do with synthetic data is difficulty tagging.

Difficulty	Examples
easy	Standard format, standard expression
medium	Paraphrase, light ambiguity
hard	Multi-constraint, long context, edge case
adversarial	Injection, manipulation, format breaking

This way you actually know whether the model improved broadly or just looks better on easy cases.

When to stop expanding synthetic data

Situation	Why
Real data pipeline isn't organized yet	You'll just keep spinning on synthetic
Judge mechanism is unstable	More data means more noise
Bad cases haven't been reviewed	Generating more won't hit the right spots
Team has no version governance	Data becomes increasingly unmaintainable

Synthetic data is an amplifier, not a shortcut that replaces real data governance.

Practice

Take your current eval set. Don't 10x it yet.

Answer these 4 questions first:

Which case types are missing?
Which ones are worth filling with synthetic data?
How will you judge the generated data?
How will you record provenance?

Get these 4 clear, then start batch generation.

📚 相关资源

OpenAI API Docs

❓ 常见问题

关于本章主题最常被搜索的问题，点击展开答案

Synthetic data 是不是数据不够就让模型多生成一点？

不是，这种用法很危险。Synthetic data 真正的价值不是把 dataset 扩 10 倍，而是把 coverage 补合理、把 eval 变更有杀伤力、把模型盲区更早暴露。先做 gap analysis 找出真实数据缺什么，再针对性生成；coverage 没想清楚就批量扩，只会把偏差放大。

Synthetic data 最适合用在什么场景？

5 个高价值场景：eval set 扩充补 edge / rare case；safety 与 refusal 测试生成 jailbreak 边缘样本；格式化 extraction 批量生成结构相近内容多样的输入；locale / style variation 补不同语言表达；low-resource task 的 bootstrap。它最值钱的地方通常是评估和覆盖补洞，不是训练主数据。

怎么判断生成的 synthetic 样本到底能不能用？

至少过 5 道关：format validation 保证能进 pipeline、faithfulness check 确保没偏离 source、safety filter 过滤危险内容、manual spot check（自动 judge 也会漏）、duplicate control 避免虚假多样性。让同一个模型既生成又评估是最常见的错法 —— 容易自嗨误判质量。

为什么每条 synthetic 样本都要记 provenance？

没有 provenance，后面一旦出现奇怪提升、regression、坏样本，几乎追不回来源。每条样本最少记 source、generation model、prompt version、judge result、tags、difficulty、split 这 7 个字段，才能在出问题时回溯到底是哪一批数据污染了结果。

什么时候应该停下，别再扩 synthetic data？

4 个红线场景：真实数据 pipeline 还没整理好（你只会一直在 synthetic 上兜圈）、judge 机制不稳（扩越多噪声越大）、bad case 还没复盘（生成再多也补不到点上）、团队没有版本治理（数据越来越难维护）。Synthetic data 是放大器，不是替代真实数据治理的捷径。