logo
22

Synthetic Data & Augmentation

⏱️ 30 min

Synthetic Data Augmentation

Synthetic data is easily misunderstood as "don't have enough data, so let the model make up more." That's dangerous. Effective synthetic data isn't about making the dataset bigger -- it's about making coverage reasonable, making evals sharper, and exposing model blind spots earlier.

So this page isn't about batch-generating text. It's about how AI engineers should use synthetic data in a more reliable augmentation workflow.

Synthetic Data Pipeline


Bottom line: fix coverage first, don't chase scale

When teams are short on data, the instinct is to 10x or 100x the dataset immediately.

A more stable sequence:

  1. Identify real data gaps
  2. Decide which case types need filling
  3. Generate, then verify and deduplicate
  4. Only then consider scaling production

If coverage isn't thought through, synthetic data just amplifies bias.


Where synthetic data works best

ScenarioWhy it fits
eval set expansionFill edge cases and rare cases
safety / refusal testingGenerate jailbreak and policy-boundary samples
structured extractionBatch-generate structurally similar but diverse inputs
locale / style variationCover different languages and expression styles
low-resource task bootstrapQuickly build a trainable/evaluable starting point

Its biggest value usually isn't training data -- it's evaluation and coverage gap-filling.


The most common misuses

MisuseConsequence
Same model generates and evaluatesEasy to fool yourself, misjudge quality
Only paraphrasingLooks diverse, actually low information
No provenance trackingCan't tell later which batch contaminated results
train / test source mixingLeakage makes evaluation unreliable

Synthetic data isn't "more is stronger." Once it's out of control, it actually makes judging real capability harder.


A more stable generation pipeline

gap analysis
  -> generation prompt
  -> judge / filter
  -> dedup
  -> difficulty tagging
  -> split and version

The middle steps are the ones that get skipped most often.

But it's precisely the judge, dedup, and tagging steps that determine whether a batch is usable.


Common generation methods

MethodGood for
paraphraseAdding expression diversity
template expansionAdding entity, industry, locale variables
adversarial generationAdding attack samples, boundary samples
source-grounded generationGenerating Q&A / citation cases from real docs

The most reliable one is usually the last.

Because it's closest to real tasks and easiest to check for correctness.


Verification can't rely on "looks fine"

Each synthetic item should at minimum go through:

CheckWhy it's needed
format validationEnsures structure fits the pipeline
faithfulness checkMakes sure it didn't drift from source
safety filterPrevents unwanted content from sneaking in
manual spot checkAutomated judges miss things too
duplicate controlPrevents fake diversity

Without these, the improvements you see might just be dataset contamination.


Provenance and versioning are critical

Each synthetic sample should record at minimum:

  • source
  • generation model
  • generation prompt version
  • judge result
  • tags
  • difficulty
  • split

Without provenance, the moment you get bad samples, weird improvements, or regressions, you'll have almost no way to trace them.


Difficulty layering matters more than averaging

The most valuable thing AI engineers can do with synthetic data is difficulty tagging.

DifficultyExamples
easyStandard format, standard expression
mediumParaphrase, light ambiguity
hardMulti-constraint, long context, edge case
adversarialInjection, manipulation, format breaking

This way you actually know whether the model improved broadly or just looks better on easy cases.


When to stop expanding synthetic data

SituationWhy
Real data pipeline isn't organized yetYou'll just keep spinning on synthetic
Judge mechanism is unstableMore data means more noise
Bad cases haven't been reviewedGenerating more won't hit the right spots
Team has no version governanceData becomes increasingly unmaintainable

Synthetic data is an amplifier, not a shortcut that replaces real data governance.


Practice

Take your current eval set. Don't 10x it yet.

Answer these 4 questions first:

  1. Which case types are missing?
  2. Which ones are worth filling with synthetic data?
  3. How will you judge the generated data?
  4. How will you record provenance?

Get these 4 clear, then start batch generation.

📚 相关资源