Seven Measures That Tell a Richer Story Than the P-Value Ever Could
In 2016, the American Statistical Association did something it had never done in its 177-year history: it issued a formal statement warning researchers against misusing one of the most familiar tools in quantitative science. The target was the p-value — specifically, the entrenched habit of treating a threshold of 0.05 as a binary verdict on whether a finding is real or imaginary.
The ASA followed up in 2019 with a special issue of The American Statistician dedicated to the theme "Moving to a World Beyond 'p < 0.05,'" featuring more than forty contributed papers. The message from the statistical community could not have been clearer. And yet, open nearly any issue of a major US research journal today — in public health, behavioral economics, or sociology — and the p-value remains the gravitational center of the results section.
This is not an argument for abolishing p-values. It is an argument for supplementing them with measures that actually answer the questions researchers and their audiences care about. What follows is a practical introduction to seven such measures, each illustrated with examples drawn from published US research.
1. Effect Size: The Magnitude Question
A p-value answers the question: "Assuming no true effect exists, how surprising is this result?" It does not answer the question most readers actually want answered: "How large is the effect?" Effect size measures fill that gap.
Cohen's d, for instance, expresses the difference between two group means in standard deviation units. When a 2021 study published in JAMA Internal Medicine evaluated a workplace wellness program across a large US employer, it reported a statistically significant reduction in emergency department visits (p = 0.03). The effect size, however, was d = 0.08 — a difference so small that its practical significance for program design decisions was negligible. Reporting the p-value without the effect size would have misled administrators into overestimating the program's impact.
For categorical outcomes, odds ratios, relative risks, and Cramér's V serve analogous functions. The key discipline is reporting them consistently, not selectively.
2. Confidence Intervals: Precision, Not Just Direction
A 95% confidence interval (CI) communicates something a p-value cannot: the range of values consistent with the observed data under repeated sampling. Two studies can both report p = 0.04 while having wildly different confidence intervals — one narrow and informative, the other so wide it spans both clinically trivial and clinically dramatic effect sizes.
Consider a 2019 economic analysis of minimum wage increases across several US states. The point estimate suggested a modest reduction in teen employment, but the 95% CI ranged from a 4% decrease to a 1.2% increase. That interval tells a substantially different story than the point estimate alone — one of genuine uncertainty rather than settled evidence. Presenting the interval alongside the estimate is standard practice in epidemiology and is increasingly expected in economics journals. It should be universal.
3. Bayesian Credible Intervals: Probability Where You Want It
One of the most persistent misunderstandings in applied research is treating a 95% confidence interval as though it means "there is a 95% probability the true value lies within this range." It does not, under classical frequentist theory. A Bayesian credible interval does make that direct probability statement — and for many research audiences, that is the more intuitive and actionable framing.
Bayesian methods have seen substantial adoption in US clinical trial design and in computational social science. A credible interval incorporates prior knowledge about plausible parameter values and updates it with observed data, producing a posterior distribution that can be communicated in plain probabilistic terms. For policy-facing research, where decision-makers need to reason about probability directly, this is a meaningful advantage.
4. Power Analysis: The Honesty Check
Statistical power — the probability that a study will detect a true effect of a given size — is typically discussed in grant applications and then quietly forgotten. It deserves a permanent place in results reporting.
Underpowered studies are endemic in US social science and health research. A systematic review of randomized controlled trials published in US public health journals found that a substantial proportion had power below 0.80 for their primary outcome, meaning they had less than an 80% chance of detecting the effect they were designed to find. When such studies return null results, it is genuinely ambiguous whether the intervention failed or the study was simply too small to see a real signal.
Reporting post-hoc power alongside null findings — and being transparent about what effect sizes the study was and was not equipped to detect — transforms a potentially misleading null result into genuinely informative evidence.
5. False Discovery Rate: Honest Accounting in Multiple Comparisons
Modern data science frequently involves testing dozens or hundreds of hypotheses simultaneously — across genomic datasets, behavioral surveys, or A/B testing platforms. Under these conditions, the expected number of false positives grows with the number of tests, even when the nominal significance threshold is held constant.
The Benjamini-Hochberg procedure for controlling the false discovery rate (FDR) has become standard in genomics and is gaining traction in social science. Rather than asking whether any individual test clears a threshold, FDR control asks: across the full set of findings I am reporting, what proportion can I expect to be false positives? Reporting the FDR-adjusted q-value alongside the p-value provides readers with a more honest picture of inferential risk, particularly in exploratory analyses.
6. Prediction Intervals: What Happens in the Next Case
Researchers in clinical and applied settings frequently need to reason not about population averages but about individual outcomes. A prediction interval answers a distinct question from a confidence interval: not "where does the true mean lie?" but "where will the next observation likely fall?"
In a 2020 study examining hospital readmission rates following cardiac procedures at US medical centers, the mean readmission rate and its confidence interval were reported prominently. But for hospital administrators evaluating their own institutional performance against the benchmark, the prediction interval — which was substantially wider — was the more relevant measure. Conflating the two leads to misplaced confidence in the precision of individual-level predictions.
7. Practical Significance Thresholds: Contextualizing the Number
Perhaps the most underused tool in the applied researcher's toolkit is the simplest: a domain-specific definition of what constitutes a meaningful effect, established before data collection begins. In educational research, the What Works Clearinghouse defines minimum effect size thresholds for practical significance. In clinical medicine, the concept of the minimal clinically important difference (MCID) serves the same function.
A study of a reading intervention program across Title I schools in three US states reported a statistically significant improvement in third-grade literacy scores (p = 0.01, d = 0.12). Whether that effect size clears the threshold for practical significance depends on cost, implementation burden, and the availability of alternatives — context that only domain-specific benchmarks can provide. Reporting where an effect falls relative to an established practical significance threshold anchors statistical findings in the decisions they are meant to inform.
Toward Richer Statistical Storytelling
None of these seven measures requires abandoning the p-value. They require surrounding it with context — the magnitude of the effect, the precision of the estimate, the probability of detection, the risk of false discovery, and the practical meaning of the finding in the domain where it will be applied.
The ASA's guidance was never a call to eliminate a tool. It was a call to stop using a single tool as a substitute for thought. For data scientists and researchers working across US healthcare, economics, and social science, the transition to richer statistical reporting is both technically straightforward and professionally overdue.
At YWT Data, we advocate for reporting standards that serve the actual consumers of research — not just peer reviewers, but the policymakers, practitioners, and analysts who translate findings into action. The p-value is a starting point. These seven measures are where the story actually begins.