YWT Data All articles
Data Literacy

The Provenance Problem: How Synthetic Training Data Is Quietly Corrupting the Scientific Record

YWT Data
The Provenance Problem: How Synthetic Training Data Is Quietly Corrupting the Scientific Record

There is a version of this problem that sounds manageable. A research team, unable to obtain sufficient real-world patient records due to HIPAA constraints, generates a synthetic dataset using a generative model trained on a smaller observed sample. They note this in a footnote. They proceed. The paper is published. Another team, working on a related question, incorporates those published findings into a meta-analysis. The synthetic origin of the underlying data is not mentioned in the abstract, is not searchable in the database, and is not visible to the meta-analysts. The cycle continues.

That version of the problem is already happening. What data professionals should be more concerned about is the version that is harder to trace: synthetic data that enters pipelines not through disclosed research decisions, but through undocumented preprocessing steps, third-party data vendors, or augmentation routines embedded in shared code repositories. In those cases, no footnote exists to miss.

The Legitimate Case for Synthetic Data

It would be intellectually dishonest to open a critique of synthetic data without acknowledging why its adoption has accelerated. The barriers to obtaining real-world data for research are genuinely significant. Patient privacy protections, proprietary business records, national security classifications, and the practical difficulty of collecting longitudinal data at scale all create legitimate research bottlenecks. Synthetic data generation — using variational autoencoders, generative adversarial networks, diffusion models, or large language models fine-tuned on domain-specific corpora — offers a plausible technical response to those constraints.

The US Census Bureau itself has moved toward differential privacy mechanisms and synthetic supplemental datasets for certain ACS releases. The National Institutes of Health has funded synthetic data infrastructure for genomics research. These are not fringe applications. They reflect a considered institutional response to real tension between data utility and data protection.

The problem is not synthetic data per se. The problem is synthetic data treated as if it were observed data — and the research infrastructure that currently lacks the tools, norms, or incentives to distinguish between the two.

What Synthetic Data Actually Preserves — and What It Loses

A generative model trained on observed data learns a statistical representation of that data's distribution. When it produces synthetic records, those records are plausible given the training distribution, but they are not drawn from the real-world phenomenon the original data was meant to capture. This distinction matters enormously.

Consider what a synthetic clinical dataset actually contains. It contains patterns that the generative model inferred from the training sample. If the training sample was unrepresentative — skewed toward certain demographic groups, collected at a particular type of facility, or truncated in time — those biases are not corrected in the synthetic output. They are, in many cases, amplified. Generative models are optimized to reproduce the statistical regularities they observe. Systematic absences in training data become systematic absences in synthetic data, without any marker indicating that a gap exists.

Furthermore, synthetic records by construction cannot contain genuine anomalies, edge cases, or rare events that were not represented in the training distribution. For research questions focused on outliers — rare disease presentations, extreme economic shocks, unusual behavioral patterns — synthetic data is particularly ill-suited, precisely because it is engineered to be statistically typical.

The Laundering Mechanism

The phrase "synthetic data laundering" may sound polemical, but it describes a real structural dynamic. When synthetic records enter a research pipeline without clear provenance documentation, they acquire a form of false legitimacy. Downstream researchers who access the dataset through a repository, an API, or a published supplementary file have no reliable way to determine whether they are working with observed measurements or generated approximations — unless the originating team documented that distinction clearly and that documentation was preserved through every subsequent transfer.

In practice, documentation degrades. Dataset readme files are not always read. Metadata schemas differ across repositories. Preprocessing scripts that introduce synthetic augmentation are not always committed to version control. A dataset that begins its life with a clear synthetic label may, after several handoffs and format conversions, arrive in a new researcher's working directory with no indication of its origins.

This is not a hypothetical failure mode. Researchers studying the provenance of publicly available machine learning benchmark datasets have found that a substantial proportion contain records whose original collection methodology cannot be verified from available documentation — and that this uncertainty is rarely flagged in studies that use those benchmarks. The issue is structurally identical when synthetic data is involved, except that the uncertainty is more fundamental: the question is not merely how the data was collected, but whether it was collected at all.

Peer Review Is Not Equipped to Catch This

The peer review process, whatever its other limitations, was designed to evaluate the logic and rigor of research methods applied to data that reviewers could assume was real. That assumption is increasingly untenable, and the review process has not adapted accordingly.

Most journals do not require authors to specify whether datasets contain synthetic records. Data availability statements — now standard in many fields — ask where data can be accessed, not what the data fundamentally is. Reviewers are rarely domain experts in data generation methodology, and even when they are, they typically receive only the paper and its supplementary materials, not the full data pipeline documentation.

The result is a systematic blind spot. A paper that reports a statistically significant association between two variables, derived from a dataset that is partially or wholly synthetic, will pass review on the strength of its statistical methods — even though the inferential validity of those methods depends entirely on the assumption that the data represents observed reality.

What Rigorous Practice Looks Like

Data professionals who work with synthetic datasets — whether generating them, receiving them from collaborators, or accessing them through repositories — bear responsibility for the provenance standards that peer review currently fails to enforce. Several concrete practices raise the bar:

Demand and document generation methodology. When receiving a synthetic dataset, the minimum acceptable documentation includes the generative model architecture, the training data source and its known limitations, the fidelity metrics used to evaluate the synthetic output, and any known divergences between synthetic and observed distributions. If this documentation does not exist, that absence is itself a methodological finding.

Treat synthetic data as a distinct data type in all reporting. Results derived from synthetic data should be reported separately from results derived from observed data, with explicit acknowledgment of the inferential limitations this introduces. Combining synthetic and observed records in a single analysis without distinguishing them in the results is not a methodological shortcut — it is a misrepresentation.

Apply fidelity testing before use. Statistical fidelity metrics — including marginal distribution comparisons, pairwise correlation preservation, and downstream task performance benchmarks — should be applied to any synthetic dataset before it enters an analysis pipeline. Fidelity is not the same as validity, but its absence is a strong signal that validity assumptions are unwarranted.

Advocate for repository-level provenance standards. Researchers who deposit datasets in public repositories should be required to specify, in machine-readable metadata, whether records are observed, synthetic, or mixed — and if mixed, in what proportions and by what method. This is a solvable infrastructure problem, and the data science community should be pressing repository operators and funding agencies to solve it.

The Deeper Issue

Behind the specific risks of synthetic data lies a more general challenge that the data science profession has been slow to confront: the assumption that data is a neutral raw material, and that rigor is purely a function of what you do with it analytically. Provenance — the full documented history of where data came from, how it was transformed, and what assumptions were embedded in those transformations — is itself a form of statistical information. Ignoring it is not a conservative methodological choice. It is an error.

Synthetic data makes this error unusually consequential because it introduces a categorical uncertainty that no amount of downstream statistical sophistication can resolve. You can correct for measurement error. You can model selection bias. You cannot recover the relationship between synthetic records and the real-world phenomenon they were meant to approximate, if the documentation of that relationship was never created or has been lost.

The scientific record is only as reliable as the data that underlies it. Data professionals who work at the interface of AI-generated datasets and empirical research have an obligation — to their field, to their downstream users, and to the policy decisions that research eventually informs — to treat provenance documentation not as administrative overhead, but as a core component of methodological rigor.

All Articles

Related Articles

Borrowed Assumptions: What Pre-Packaged Datasets Are Doing to Your Research Before You Run a Single Query

Borrowed Assumptions: What Pre-Packaged Datasets Are Doing to Your Research Before You Run a Single Query

Beyond the Headline Numbers: Unlocking the ACS Tables Most Data Professionals Have Never Opened

Beyond the Headline Numbers: Unlocking the ACS Tables Most Data Professionals Have Never Opened

What Isn't There: How Systematically Absent Health Data Distorts US Research and Policy

What Isn't There: How Systematically Absent Health Data Distorts US Research and Policy