YWT Data All articles
Data Literacy

Borrowed Assumptions: What Pre-Packaged Datasets Are Doing to Your Research Before You Run a Single Query

YWT Data
Borrowed Assumptions: What Pre-Packaged Datasets Are Doing to Your Research Before You Run a Single Query

There is a quiet transaction happening at the start of thousands of research projects every year. A data professional, pressed for time or simply following established practice, downloads a curated dataset from a government portal, an academic repository, or a commercial data provider. The file arrives labeled "cleaned," "harmonized," or "analysis-ready." The researcher opens it, confirms the columns look sensible, and begins work.

What rarely happens is a systematic examination of what that cleaning actually involved — who made it, under what assumptions, and for what original purpose. Those invisible decisions, compounded across an entire research pipeline, can quietly corrupt findings in ways that standard validation checks will never surface.

The Illusion of Neutrality in Data Preparation

Cleaning a dataset is not a neutral act. Every choice — how to handle missing values, which records to exclude as outliers, how to reconcile conflicting entries across sources, what date ranges to preserve — reflects a set of assumptions about the data-generating process. When those assumptions are undocumented, anyone who inherits the dataset inherits the assumptions without knowing it.

Consider the treatment of outliers. A preprocessing team working on a wage dataset for one purpose might flag and remove observations above the 99th percentile of earnings. That decision is defensible in certain contexts. But a subsequent researcher using the same cleaned file to study executive compensation or income inequality at the top of the distribution will find those records simply absent — with no warning in the documentation and no obvious artifact in the data structure to signal the loss.

Similar dynamics appear in housing data. Several widely circulated residential property datasets used in US urban economics research have undergone deduplication routines that merge records based on address-matching algorithms. When those algorithms are imprecise — conflating unit-level records in multi-family buildings, for instance — the resulting dataset systematically undercounts dense urban housing stock. Researchers studying affordability or density patterns in cities like Chicago, Houston, or Los Angeles may be drawing conclusions from a file that has already smoothed away the very variation they are trying to measure.

Public Health Data and the Standardization Trap

In public health research, the pressures toward convenience are particularly acute. Datasets from sources such as the Behavioral Risk Factor Surveillance System (BRFSS) or the National Health Interview Survey (NHIS) are frequently distributed in pre-harmonized form by intermediary repositories. These harmonized versions resolve coding inconsistencies across survey years, which is genuinely useful. But they also make categorical choices — collapsing racial and ethnic subgroups, recoding self-reported health status variables, or truncating age ranges — that are rarely flagged in the summary documentation researchers actually read.

The consequence is that a researcher studying health disparities among specific demographic populations may be working with a file where those very population boundaries have been redrawn by someone else's harmonization logic. The resulting findings are not wrong in any detectable technical sense. They are simply answers to a slightly different question than the one the researcher believes they are asking.

Economic Indicators and the Version Problem

Economic datasets present a distinct but related challenge: the version problem. Flagship indicators such as county-level employment figures from the Bureau of Labor Statistics or regional GDP estimates from the Bureau of Economic Analysis are periodically revised. Curated repositories that distribute these figures often do not clearly communicate which vintage of the data they are serving. A researcher comparing findings with a colleague may be using figures that have been revised substantially — not because of any error, but because the underlying methodology or source data was updated.

This is not a hypothetical concern. Documented cases in the applied economics literature have traced conflicting findings between research teams back to different vintages of the same nominal dataset. When neither team realizes they are working from different versions, the methodological debate that follows is a distraction from the actual source of disagreement.

Building a Provenance Auditing Practice

The solution is not to abandon curated datasets — in many research contexts, they represent the only practical path to a sufficient sample. The solution is to treat provenance auditing as a mandatory step in any data acquisition workflow, not an optional one.

The following checklist represents a starting point for that practice:

1. Identify the full chain of custody. Determine not just where you downloaded the file, but who produced the original data, who cleaned or harmonized it, and whether any intermediary organization transformed it before distribution. Each link in that chain should be documented.

2. Request or locate the preprocessing codebook. Many repositories publish data dictionaries but not cleaning scripts. If the cleaning logic is not documented, treat the dataset with the same caution you would apply to an unverified secondary source.

3. Compare a sample against the raw source. Where raw data is available — even for a subset of records or a single survey year — compare it against the packaged version. Look specifically at the extremes of distributions, rare categories, and records with incomplete fields. Discrepancies in those areas reveal the most consequential cleaning choices.

4. Check for vintage and revision history. For any economic or administrative dataset, confirm the vintage of the figures you are using and note it explicitly in your methods documentation. If revised versions exist, assess whether the revisions are material to your research question.

5. Interrogate exclusion criteria. Ask what records are not in the dataset. If documentation does not specify exclusion criteria clearly, that absence is itself informative — and worth investigating before you proceed.

6. Test sensitivity to cleaning assumptions. Where feasible, reconstruct one or two key variables using alternative cleaning approaches and assess how much your findings shift. High sensitivity is a signal that the cleaning decisions deserve explicit discussion in your analysis.

A Cultural Shift, Not Just a Technical One

Ultimately, the problem with pre-cleaned datasets is not technical — it is cultural. The research and data science communities have developed strong norms around model validation, statistical significance, and reproducibility. Provenance auditing has not yet achieved the same standing, in part because it is slower, less visible, and harder to automate.

That needs to change. As datasets grow larger and the distance between raw data collection and analysis widens, the assumptions embedded in preprocessing decisions carry more weight, not less. A research culture that scrutinizes p-values with care but accepts curated datasets without question has its priorities inverted.

The data you did not clean yourself is not clean. It is cleaned — by someone, for some purpose, under some assumptions. Knowing what those are is not a courtesy to your methodology. It is the methodology.

All Articles

Related Articles

Beyond the Headline Numbers: Unlocking the ACS Tables Most Data Professionals Have Never Opened

Beyond the Headline Numbers: Unlocking the ACS Tables Most Data Professionals Have Never Opened

What Isn't There: How Systematically Absent Health Data Distorts US Research and Policy

What Isn't There: How Systematically Absent Health Data Distorts US Research and Policy

Garbage In, Policy Out: Auditing the Structural Flaws in America's Most Trusted Federal Datasets

Garbage In, Policy Out: Auditing the Structural Flaws in America's Most Trusted Federal Datasets