What Isn't There: How Systematically Absent Health Data Distorts US Research and Policy
What Isn't There: How Systematically Absent Health Data Distorts US Research and Policy
Every dataset tells two stories: the one written by the values it contains, and the one shaped by the values it does not. In health research, the second story is frequently the more consequential of the two—and the more dangerous to ignore. Across major federal repositories including the Centers for Disease Control and Prevention's National Health Interview Survey (NHIS), the Centers for Medicare and Medicaid Services (CMS) claims data, and the Agency for Healthcare Research and Quality's Medical Expenditure Panel Survey (MEPS), missing observations are not distributed by chance. They follow patterns. And those patterns, left unexamined, quietly bend the conclusions that researchers, clinicians, and lawmakers draw from the numbers they can see.
The Distinction That Changes Everything
Statisticians have long distinguished among three mechanisms by which data goes missing. Missing Completely at Random (MCAR) describes the benign scenario in which absences bear no relationship to any variable in the dataset, observed or unobserved. Missing at Random (MAR) allows for absences that correlate with observed variables but not with the missing value itself. The most treacherous category—Missing Not at Random (MNAR)—describes situations in which the probability of a value being absent is directly related to what that value would have been.
In US health datasets, MNAR patterns are far more common than the literature's treatment of them might suggest. When a low-income patient lacks a primary care provider and therefore never generates a CMS claim, that absence is not neutral noise. It is a structured silence. The populations least likely to appear in administrative health records are frequently those whose health outcomes are most precarious: uninsured individuals, undocumented residents, people experiencing homelessness, and rural communities with limited healthcare infrastructure. Researchers who proceed without accounting for these structural gaps are, in effect, building models of American health on the experiences of those already best served by the system.
Real-World Distortions in Published Research
Consider the body of research that has used CMS Medicare claims data to study treatment outcomes in older adults. Because Medicare eligibility begins at age 65 for most Americans, studies drawing exclusively from this source systematically exclude near-elderly adults aged 55 to 64—a cohort that carries a disproportionate share of chronic disease burden and economic precarity. Research conclusions about treatment efficacy drawn from Medicare populations have, in several documented instances, failed to generalize to slightly younger populations when those populations were subsequently studied through alternative data sources.
A second instructive case involves racial and ethnic data fields in hospital discharge records. CMS-required reporting of patient race and ethnicity has historically suffered from inconsistent collection practices across facilities, with non-white patients showing substantially higher rates of missing or miscoded racial identifiers. Studies published between 2010 and 2020 that used these records to examine racial disparities in cardiac care were later shown to have underestimated gaps in treatment rates for Black and Hispanic patients—precisely because those patients were more likely to appear in the dataset with incomplete demographic fields. The missing data was not random; it was correlated with the very variable under investigation.
The NHIS, one of the nation's primary sources of self-reported health information, presents a different challenge. Survey nonresponse has increased steadily over the past two decades, and the households least likely to complete the survey—those with lower educational attainment, non-English speakers, and residents of high-poverty ZIP codes—are not a random subset of the US population. Trend analyses that fail to weight for differential nonresponse can produce misleading conclusions about changes in population health over time.
Frameworks for Detection Before Conclusion
For data professionals working with federal health datasets, several practical approaches can help surface missing-data mechanisms before analysis proceeds.
Missingness mapping should be an early step in any exploratory workflow. Rather than simply counting missing values per variable, analysts should cross-tabulate missingness against all available demographic and socioeconomic fields. If the rate of missing values in a clinical variable rises systematically with decreasing income or in geographic regions with lower provider density, that is a signal—not a nuisance to be discarded.
Little's MCAR test offers a formal statistical check on whether observed missingness is consistent with a completely random mechanism. A significant result indicates that MCAR cannot be assumed, prompting deeper investigation into whether MAR or MNAR better characterizes the data. While the test cannot distinguish between MAR and MNAR on its own, it is a useful first filter.
Sensitivity analyses using multiple imputation allow researchers to model the range of conclusions that would emerge under different assumptions about missing values. Rather than imputing a single best-guess value and proceeding as though the data were complete, multiple imputation generates several plausible datasets, runs the analysis across each, and pools the results. The spread of estimates across imputed datasets provides an honest representation of the uncertainty introduced by missing observations.
Auxiliary variable incorporation strengthens imputation models by including variables that predict both missingness and the missing values themselves, even if those auxiliary variables are not part of the primary analysis. In health data contexts, variables such as ZIP code poverty rate, insurance type, and facility characteristics often serve this function effectively.
Finally, researchers should consider linkage to complementary data sources as a structural remedy rather than a post-hoc correction. Linking CMS claims to the American Community Survey, for instance, can restore demographic context that administrative records lack. The Census Bureau's Longitudinal Employer-Household Dynamics data has been used in exactly this way to recover employment and income context for health research populations otherwise invisible in clinical records.
Seeing the Shape of What Is Missing
Data literacy, at its most rigorous, demands that analysts develop as much fluency with absence as with presence. The instinct to work with the data one has, rather than to interrogate the data one does not have, is understandable—but it is also the instinct most likely to produce findings that mislead rather than illuminate.
For researchers and data professionals relying on CDC, CMS, or AHRQ datasets, the question worth asking before any analysis proceeds is not only "what does this dataset contain?" but "who and what is structurally unlikely to appear here, and why?" The answers to those questions do not always prevent a study from moving forward. But they do determine whether the conclusions drawn from that study are honest about the population they actually describe.
At YWT Data, our position is straightforward: a dataset's limitations are as analytically important as its contents. Treating missing data as a minor technical inconvenience, rather than as a substantive source of potential bias, is one of the most common and consequential errors in applied health research. The frameworks described above are not exhaustive, but they represent a minimum standard for any professional analysis that aspires to inform decisions affecting real people—particularly those least visible in the records we use to study them.