Significant others? Thinking beyond p-values in science

By Natalie Chazal, PhD student, OSU Department of Fisheries, Wildlife, & Conservation Sciences, Geospatial Ecology of Marine Megafauna Lab

Scientific inquiry relies on quantifying how certain we are of the differences we see in observations. This means that we must look at phenomena based on probabilities that we calculate from observed data, or data that we collect from sampling efforts. Historically, p-values have served as a relatively ubiquitous tool for assessing the strength of evidence in support of a hypothesis. However, as our understanding of statistical methods evolves, so does the scrutiny surrounding the appropriateness and interpretation of p-values. In the realm of research, the debate surrounding the use of p-values for determining statistical significance has sparked some controversy and reflection within the academic community. 

What is a p-value?

To understand the debate itself, we need to understand what a p-value is. The p-value represents the probability of obtaining a result as extreme as, or more extreme than, the observed data, under the assumption that there is no true difference or relationship between groups or variables. Traditionally, a p-value below a predetermined threshold (often 0.05) is considered statistically significant, suggesting that the observed data are unlikely (i.e., a 5% probability) to have occurred by chance alone. Many statistical tests provide p-values, which gives us a unified framework for interpretation across a range of analyses.

To illustrate this, imagine a study aimed at investigating the effects of underwater noise pollution on the foraging behavior of gray whales. Researchers collect data on the diving behavior of gray whales in both noisy and quiet regions of the ocean.

Drawings of gray whales with tags (depicted by orange shapes) in quiet areas (left) and noisy areas (right). 

In this example, the researchers hypothesize that gray whales stop foraging and ultimately change their diving behavior in response to increased marine noise pollution. The data collected from this hypothetical scenario could come from tags equipped with sensors that record diving depth, duration, and location, allowing us to calculate the exact length of time spent foraging. Data would be collected from both noisy areas (maybe near shipping lanes or industrial sites) and quiet areas (more remote regions with minimal human activity). 

To assess the significance of the differences between the two noise regimes, researchers may use statistical tests like t-tests to compare two groups. In our example, researchers use a t-test to compare the average foraging time between whales in noisy and quiet regimes. The next step would be to define hypotheses about the differences we expect to see. The null hypothesis (HN) would be that there is no difference in the average foraging time (X) between noisy and quiet areas: 

Scenario where the noisy area does not elicit a behavioral response that can be detected by the data collected by the tags (orange shapes on whales back). The lower graph shows the distribution of the data (foraging time) for the noisy and the quiet areas. The means of this data (X) are not different. 

And the alternative hypothesis (HA) would be that there is a difference between the noisy and quiet areas: 

Scenario where the noisy area elicits a behavioral response (swimming more towards the surface instead of foraging) that can be detected by the data collected by the tags (orange shapes on whales back). The lower graph shows the distribution of the data (foraging time) for the noisy and the quiet areas. The means of this data (X) are different with the noisy mean foraging time (pink) being lower than the quiet mean foraging time (blue).

For now, we will skip over the nitty gritty of a t-test and just say that the researchers get a “t-score” that says whether or not there is a difference in the means (X) of the quiet and noisy areas. A larger t-score means that there is a difference in the means whereas a smaller t-score would indicate that the means are more similar. This t-score comes along with a p-value. Let’s say we get a t-score (green dot) that is associated with a p-value of 0.03 shown as the yellow area under the curve: 

The t-score is a test statistic that tells us how different the means of our observed data groups are from each other (green dot). The area under the t-distribution that is above the t-score is the p-value (yellow shaded area).

A p-value of 0.03 means that there is a 3% probability of obtaining these observed differences in foraging time between noisy and quiet areas purely by chance, which assumes that the null hypothesis is true (that there is no difference). We usually compare this p-value to a threshold value to say whether this finding is significant. We set this threshold before looking at the results of the test. If the threshold is above our value, like 0.05, then we can “reject the null hypothesis” and conclude that there is a significant difference in foraging time between noisy and quiet areas (green check mark scenario). On the flip-side, if the threshold that we set before our results is too low (0.01), then we will “fail to reject the null hypothesis” and conclude that there was no significant difference in foraging time between noisy and quiet areas (red check mark scenario). The reason that we don’t ever “accept the null” is because we are testing an alternative hypothesis with observations and if those observations are consistent with the null rather than the alternative, this is not evidence for the null because it could be consistent with a different alternative hypothesis that we are not yet testing for.

When our pre-set threshold to determine significance is above or greater than our p-value that was calculated we have enough evidence to ‘reject the null hypothesis’ (left figure) whereas if our p-value is lower or smaller than our calculated p-value, then we ‘fail to reject the null hypothesis’ (right figure).

In this example, the use of p-values helps the researchers quantify the strength of evidence for their hypothesis and determine whether the observed differences in gray whale behavior are likely to be meaningful or merely due to chance. 

The Debate

Despite its widespread use, the reliance on p-values has been met with criticism. Firstly, because p-values are so ubiquitous, it can be easy to calculate them with or without enough critical thinking or interpretation. This critical thinking should include an understanding of what is biologically relevant and avoid the trap of using binary language like significant or non-significant results instead of looking directly at the uncertainty of your results. One of the other most common misconceptions about p-values is that they can measure the direct probability of the null hypothesis being true. As amazing as that would be, in reality we can only use p-values to understand the probability of our observed data. Additionally, it’s common to conflate the significance or magnitude of the p-value with effect size (which is the strength of the relationship between the variables). You can have a small p-value for an effect that isn’t very large or meaningful, especially if you have a large sample size. Sample size is an important metric to report. Larger number of samples generally means more precise estimates, higher statistical power, increased generalizability, and higher possibility for replication.

Furthermore, in studies that require multiple comparisons (i.e. multiple statistical analyses are done in a single study), there is an increased likelihood of observing false positives because each test introduces a chance of obtaining a significant result by random variability alone. In p-value language, a “false positive” is when you say something is significant (below your p-value threshold) when it actually is not, and a “false negative” is when you say something is not significant (above the p-value threshold) when it actually is. So, in terms of multiple comparisons, if there are no adjustments made for the increased risk of false positives, this can potentially lead to inaccurate conclusions of significance.

In our example using foraging time in gray whales, we didn’t consider the context of our findings. To make this a more reliable study, we have to consider factors like the number of whales tagged (sample size!), the magnitude of noise near the tagged whales, other variables in the environment (e.g. prey availability) that could affect our results, and the ecological significance in the difference in foraging time that was found. To make robust conclusions, we need to carefully build hypotheses and study designs that will answer the questions we seek. We must then carefully choose the statistical tests that we use and explore how our data align with the assumptions that these tests make. It’s essential to contextualize our results within the bounds of our study design and broader ecological system. Finally, performing sensitivity analyses (e.g. running the same tests multiple times on slightly different datasets) ensures that our results are stable over a variety of different model parameters and assumptions. 

In the real world, there have been many studies done on the effects of noise pollution on baleen whale behavior that incorporate multiple sources of variance and bias to get robust results that show behavioral responses and physiological consequences to anthropogenic sound stressors (Melcón et al. 2012, Blair et al. 2016, Gailey et al. 2022, Lemos et al. 2022).

Moving Beyond P-values

There has been growing interest in reassessing the role of p-values in scientific inference and publishing. Scientists appreciate p-values because they provide one clear numeric threshold to determine significance of their results. However, the reality is more complicated than this binary approach. We have to explore the uncertainty around these estimates and test statistics (e.g. t-score) and what they represent ecologically. One avenue to explore might be focusing more on effect sizes and confidence intervals as more informative measures of the magnitude and precision of observed effects. There has also been a shift towards using Bayesian methods, which allow for the incorporation of prior knowledge and a more nuanced quantification of uncertainty.

Bayesian methods in particular are a leading alternative to p-values because instead of looking at how likely our observations are given a null hypothesis, we get a direct probability of the hypothesis given our data. For example, we can use Bayes factor for our noisy vs quiet gray whale behavioral t-test (Johnson et al. 2023). Bayes factor measures the likelihood of the data being observed for each hypothesis separately (instead of assuming the null hypothesis is true) so if we calculate a Bayes factor of 3 for the alternative hypothesis (HA), we could directly say that it is 3 times more likely for there to be decreased foraging time in a noisy area than for there to be no difference in the noisy vs quiet group. But that is just one example of Bayesian methods at work. The GEMM lab uses Bayesian methods in many projects from Lisa’s spatial capture-recapture models (link to blog) and Dawn’s blue whale abundance estimates (Barlow et al. 2018) to quantifying uncertainty associated with drone photogrammetry data collection methods in KC’s body size models (link to blog). 

Ultimately, the debate surrounding p-values highlights the necessity of nuanced and transparent approaches to statistical inference in scientific research. Rather than relying solely on arbitrary thresholds, researchers can consider the context, relevance, and robustness of their findings. From justifying our significance thresholds to directly describing parameters based on probability, we have increasingly powerful tools to improve the methodological rigor of our studies. 

References

Agathokleous, E., 2022. Environmental pollution impacts: Are p values over-valued? Science of The Total Environment 850, 157807. https://doi.org/10.1016/j.scitotenv.2022.157807

Barlow, D.R., Torres, L.G., Hodge, K.B., Steel, D., Baker, C.S., Chandler, T.E., Bott, N., Constantine, R., Double, M.C., Gill, P., Glasgow, D., Hamner, R.M., Lilley, C., Ogle, M., Olson, P.A., Peters, C., Stockin, K.A., Tessaglia-Hymes, C.T., Klinck, H., 2018. Documentation of a New Zealand blue whale population based on multiple lines of evidence. Endangered Species Research 36, 27–40. https://doi.org/10.3354/esr00891

Blair, H.B., Merchant, N.D., Friedlaender, A.S., Wiley, D.N., Parks, S.E., 2016. Evidence for ship noise impacts on humpback whale foraging behaviour. Biol Lett 12, 20160005. https://doi.org/10.1098/rsbl.2016.0005

Brophy, C., 2015. Should ecologists be banned from using p-values? Journal of Ecology Blog. URL https://jecologyblog.com/2015/03/06/should-ecologists-be-banned-from-using-p-values/ (accessed 4.19.24).

Castilho, L.B., Prado, P.I., 2021. Towards a pragmatic use of statistics in ecology. PeerJ 9, e12090. https://doi.org/10.7717/peerj.12090

Gailey, G., Sychenko, O., Zykov, M., Rutenko, A., Blanchard, A., Melton, R.H., 2022. Western gray whale behavioral response to seismic surveys during their foraging season. Environ Monit Assess 194, 740. https://doi.org/10.1007/s10661-022-10023-w

Halsey, L.G., 2019. The reign of the p-value is over: what alternative analyses could we employ to fill the power vacuum? Biology Letters 15, 20190174. https://doi.org/10.1098/rsbl.2019.0174

Johnson, V.E., Pramanik, S., Shudde, R., 2023. Bayes factor functions for reporting outcomes of hypothesis tests. Proceedings of the National Academy of Sciences 120, e2217331120. https://doi.org/10.1073/pnas.2217331120

Lemos, L.S., Haxel, J.H., Olsen, A., Burnett, J.D., Smith, A., Chandler, T.E., Nieukirk, S.L., Larson, S.E., Hunt, K.E., Torres, L.G., 2022. Effects of vessel traffic and ocean noise on gray whale stress hormones. Sci Rep 12, 18580. https://doi.org/10.1038/s41598-022-14510-5

LU, Y., BELITSKAYA-LEVY, I., 2015. The debate about p-values. Shanghai Arch Psychiatry 27, 381–385. https://doi.org/10.11919/j.issn.1002-0829.216027

Melcón, M.L., Cummins, A.J., Kerosky, S.M., Roche, L.K., Wiggins, S.M., Hildebrand, J.A., 2012. Blue Whales Respond to Anthropogenic Noise. PLOS ONE 7, e32681. https://doi.org/10.1371/journal.pone.0032681

Murtaugh, P.A., 2014. In defense of P values. Ecology 95, 611–617. https://doi.org/10.1890/13-0590.1

Vidgen, B., Yasseri, T., 2016. P-Values: Misunderstood and Misused. Front. Phys. 4. https://doi.org/10.3389/fphy.2016.00006

Print Friendly, PDF & Email

Leave a Reply

Your email address will not be published. Required fields are marked *