VIEWPOINT

Vol. 139 No. 1631 |

DOI: 10.26635/6965.7313

Addressing significant inequity

Researchers often face small or undercounted minority sub-groups, and equal explanatory power will not always be possible. Here we show that some common sub-group or small sample statistical analysis practices risk increasing inequities and provide practical steps to reduce them.

Full article available to subscribers

Health research in Aotearoa New Zealand shows persistent inequities. Within the context of national health surveys, there has been recognition that for Māori this is at least in part due to sample size disparities of cohorts that did not give “equal explanatory power”. Equal explanatory power was defined as “Māori statistical needs as having equal status with those of the total New Zealand population”.1 Equal explanatory power is achieved by equal sample sizes for Māori and non-Māori and has been framed as a right for tangata whenua and a Te Tiriti o Waitangi/the Treaty of Waitangi obligation. The authors also identified equal explanatory power’s corollary “equal analytical power”: the power of definition, explanation and meaning.

We believe the right for equal explanatory power and equal analytical power applies to all health research within New Zealand and for all cohorts based on sex, gender, ethnicity and age. Some journals have called for action to address demographic inequity by reducing enrolment imbalances or conducting sub-group analysis.2 The recognition of sex- and ethnicity-based differences in pathophysiology, biochemistry and clinical outcomes highlights the importance of the inequity issue.3,4

Researchers often face small or undercounted minority sub-groups,5 and equal explanatory power will not always be possible. Here we show that some common sub-group or small sample statistical analysis practices risk increasing inequities and provide practical steps to reduce them.

How small sample sizes lead to inequity

Most healthcare studies use sampling methods that may result in p-values that fail to reflect the true difference between groups in a population. This includes the use of exclusion criteria that systematically exclude older people, pregnant women or other minority groups. Even if the cohorts were true random samples, they may misrepresent the overall population due to chance, so any difference observed may be erroneous. This is more likely for small cohorts and sub-groups. Small cohorts are therefore less likely to represent the underlying population than large cohorts (i.e., less likely to be externally valid). The findings for small cohorts may erroneously support conclusions that lead to either over- or under-investigation or treatment of the sub-group involved.

The smaller the sample size, the less likely to observe a rare event (e.g., an adverse reaction). For example, for a sub-group comprising 20% of the overall sample, the likelihood of a rare event would be five times smaller than for the overall sample. The rare event rate in the small sub-group may truly be much higher than in all patients (e.g., two to three times), but it will still be more likely to be missed. In some cases, the size of a sub-group is determined by the chosen analytical methods, particularly where continuous variables are clustered into groups. Analysis by age group or by discretising a measurement like systolic blood pressure or a biomarker is common, seemingly to aid interpretation of results. Such discretisation is effectively extreme rounding. This causes loss of information, resulting in less power to detect real relationships and increased probability of false positives.6 Where appropriate, discretisation should be avoided and continuous variables should be treated as continuous.

Even where the estimated effect size is the same in all patients as in a sub-group, the p-values and confidence intervals will be larger in the sub-group. The commonly applied notion of statistical significance, usually defined as p<0.05, along with the misinterpretation of the meaning of p-values, can lead to inequity. A common example is when a “statistically significant” association between two variables in the overall study population is re-analysed in a sub-group (e.g., Indigenous sub-population) giving a p-value >0.05, which results in a conclusion of “no association”. Using a p-value threshold of 0.05 as a prompt for interpretation of a sub-group analysis is based on two common mistakes. The first is to forget that the null hypothesis is being tested. The assumption of the null hypothesis is that there is “no association”. Therefore, one can never conclude “no association” but merely recognise the p-value along with the associated effect size as quantifying the evidence against the null hypothesis. A correct interpretation is that the p-value is the probability of observing the effect size or greater if the null hypothesis is true and given all the assumptions of the statistical methods used. The second mistake is assuming that the arbitrary threshold of 0.05 represents a clinically significant difference, which results in the erroneous conclusion that there is “no difference”. This dichotomisation, simply, does not make sense. Indeed, the difference in probability between 0.055 and 0.045, say, could well be down to one or two patients when the sample size is small. The same mistakes are made by interpreting 95% confidence intervals by whether they cross the value representing the null (e.g., a hazard ratio of 1, or a difference in proportions of 0). A concomitant error is interpreting p-values slightly above 0.05 as demonstrating a “trend”; they do nothing of the sort.7

Good practices for handling small sample sizes

There are no magic statistical bullets by which to handle small sample sizes, but there are better practices in study design and analysis that could at least reduce the possibility of drawing erroneous conclusions and introducing or perpetuating inequities and perhaps provide meaningful explanatory power where equal explanatory power is not possible. We have outlined some examples in Table 1.

View Table 1, Box 1.

Collecting more data to achieve the ideal of samples with equal explanatory power may not always be possible, although co-design methods8 to improve sample sizes for specific sub-groups may help overall numbers and/or ensure the sub-group is truly represented. This may include “over-sampling”, which means the targeted sub-group will be represented beyond the proportion in the underlying population. Another method to improve explanatory power may be to use one-tailed instead of two-tailed tests.9 Often, we are interested in only one side of the null hypothesis—“did the intervention help?”—rather than “did it help or hinder?”. The effect of using a one-tailed test instead of a two-tailed test for a given sample size is the equivalent of having a 20% larger sample size (at the same alpha and power for the sample size calculation). However, this is not a universal panacea, and there will be situations where two-sided testing remains appropriate.

Consideration should be made at a study design stage of whether there is any biological or sociological plausibility for a difference between sub-groups. If so, this becomes a priority and should drive recruitment. On the other hand, if there is no a priori plausibility and the size of sub-groups is small, then one is only justified in doing a sub-group analysis if a very large difference in effect size would have major implications, as only a very large effect size difference could be detected. This does not preclude reporting on raw data from sub-groups that may aid future meta-analyses from multiple studies.

Perhaps the most important consideration for researchers, reviewers and editors is not to use the concept of statistical significance at all. While an α of 0.05 (usually offered with no justification!) along with a power (1-β) and clinically meaningful effect size is often used in power calculations, this does not mean one must define p<0.05 as “statistically significant”. Indeed, as many have noted, this has led to false interpretations of results,10 along with “p-hacking” and the failure to publish when p>0.05.11 Statistics is a young science and, as such, it changes. Because of the prevalence of using the arbitrary p=0.05 threshold, in 2016 the American Statistical Association released a statement on p-values that included: “The widespread use of ‘statistical significance’ (generally interpreted as ‘p≤0.05’) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process.12 New Zealand health statisticians have since argued that “We should be challenging ourselves to write and interpret results from studies without using the words ‘statistically’ or ‘significant.’13 Leading members who formulated the American Statistical Association’s statement have taken the additional step of recommending abandonment of the term “statistical significance”.14 We believe that if New Zealand researchers and journals were to take that step, it would inhibit the perpetuation of some inequities and potentially prevent the emergence of new ones. This will require researchers to invest some time into better understanding and presenting the meaning of p-values and confidence intervals, particularly to avoid misinterpretations.15

It is common to use analysis techniques that adjust for covariates. Such multivariable analysis is good practice. However, if sample sizes are small with few events, then there is danger of overfitting. The rules of thumb commonly used for binary outcome events are that 10 or 15 events in the sample per variable are needed. That is, if the sub-group has only 19 events, then one should not have any covariates in the model and stick to the simple univariate analysis. There are, though, more sophisticated methods being developed for determining the minimum number of events per variable for a variety of statistical models.16 

Two practices to aid interpretation and presentation of results may reduce the risk of unintended conclusions being drawn from them: first, always present p-values in relation to the effect size; and second, preferably present the effect size with a confidence interval and provide an interpretation of the confidence interval. Thus, instead of stating (erroneously) “there was no difference between A and B, p=0.06”, state “the difference between A and B was 7% (95% CI [confidence interval]: −1% to 15%), indicating a plausible reduction in <the outcome> of 1% but equally plausible an increase in <the outcome> of 15%”. It is then helpful to the reader to indicate if either of these plausible values is clinically meaningful. Second, if p-values are to be used, and in particular if they are to be compared, then the concept of s-values (binary surprisals) may aid researchers’ interpretation of results.17 An s-value is just the -log2(p-value), so for p=0.6 s=4.1. Its beauty is that it can be interpreted as “how surprised would I be if a coin was fair to get s-value heads in a row?”, and that unlike the p-values, the difference between two s-values is consistently indicative of the difference in evidence against the null hypothesis (Box 1).

In summary, significant inequity in health research exists, but this inequity could be reduced by adopting better interpretation of statistics, study design and analysis, which includes intentional decision making as to whether sub-group analyses should be performed or not.

Achieving equity in health research requires sub-groups to have meaningful, if not equal, explanatory power, ideally through similiar sample sizes. Obtaining equal sample size, though, is often not possible. Small sub-group sizes increase the risk of false conclusions being drawn, which may reinforce inequities if results are misinterpreted (e.g., saying there is a difference between study arms when there is not and, conversely, saying there is no difference when there is). Here we provide examples of common pitfalls and potential considerations to guide researchers, reviewers and editors when analysing and interpreting sub-group data. We propose that researchers focus on presenting effect sizes and confidence intervals rather than statistical significance.

Authors

Professor John W Pickering, PhD: Christchurch Heart Institute, Department of Medicine, University of Otago Christchurch; Department of Emergency Medicine, Christchurch Hospital, Christchurch, New Zealand.

Associate Professor Anna P Pilbrow, PhD: Christchurch Heart Institute, Department of Medicine, University of Otago Christchurch, Christchurch, New Zealand.

Dr Allamanda Fa’atoese, PhD: Christchurch Heart Institute, Department of Medicine, University of Otago Christchurch, Christchurch, New Zealand.

Dr Laura Joyce, MBChB: Department of Surgery and Critical Care, University of Otago Christchurch, Christchurch, New Zealand; Department of Emergency Medicine, Christchurch Hospital, Christchurch, New Zealand.

Acknowledgements

Contributions: All authors contributed to both concept and writing.

Correspondence

John W Pickering, PhD: Christchurch Heart Institute, Department of Medicine, University of Otago Christchurch; Department of Emergency Medicine, Christchurch Hospital, Christchurch, New Zealand.

Correspondence email

john.pickering@otago.ac.nz

Competing interests

Nil.

1)       Te Rōpū Rangahau Hauora a Eru Pōmare, Wellington School of Medicine and Health Sciences, University of Otago. Mana Whakamārama - Equal Explanatory Power: Māori and non-Māori sample size in national health surveys [Internet]. Wellington, New Zealand: Te Rōpū Rangahau Hauora a Eru Pōmare, Wellington School of Medicine and Health Sciences, University of Otago; 2002 [cited 2025 Nov 11]. Available from: https://www.fmhs.auckland.ac.nz/assets/fmhs/Te Kupenga Hauora Māori/docs/Equalexplanatorypower.pdf

2)       Weber EJ, Body R. Sex and gender reporting in scientific papers now strongly recommended by the Emergency Medicine Journal. Emerg Med J. 2025 Jan 21;42(2):80-81. doi: 10.1136/emermed-2024-214743.

3)       Pearson AG, Pearson JF, Lewis LK, et al. Lower NT-proBNP plasma concentrations in Pacific peoples with heart failure. ESC Heart Fail. 2025 Aug;12(4):2976-2984. doi: 10.1002/ehf2.15314.

4)       Rubini Giménez M, Koechlin L, López-Ayala P, et al. Clinical implications of sex-specific upper reference limits for high-sensitivity cardiac troponin I in myocardial infarction diagnosis. Rev Esp Cardiol (Engl Ed). 2025 Dec;78(12):1064-1075. English, Spanish. doi: 10.1016/j.rec.2025.05.003.

5)       Harris R, Paine SJ, Atkinson J, et al. We still don't count: the under-counting and under-representation of Māori in health and disability sector data. N Z Med J. 2022 Dec 16;135(1567):54-78. doi: 10.26635/6965.5849.

6)       Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med. 2006 Jan 15;25(1):127-41. doi: 10.1002/sim.2331.

7)       Nead KT, Wehner MR, Mitra N. The Use of “Trend” Statements to Describe Statistically Nonsignificant Results in the Oncology Literature. JAMA Oncol. 2018;4;(12):1778-1779. doi:10.1001/jamaoncol.2018.4524.

8)       Goodwin DWT, Boulton A, Stayner C, Mann J. Authentic co-design: an essential prerequisite to health equity in Aotearoa New Zealand. J R Soc N Z. 2025 Apr 2;55(6):2600-2614. doi: 10.1080/03036758.2025.2480207.

9)       Willink R. One-tailed hypothesis tests in dental research: more bite. N Z Dent J. 2023 Mar;119:2-6.

10)    Hemming K, Javid I, Taljaard M. A review of high impact journals found that misinterpretation of non-statistically significant results from randomized trials was common. J Clin Epidemiol. 2022 May;145:112-120. doi: 10.1016/j.jclinepi.2022.01.014.

11)    Head ML, Holman L, Lanfear R, et al. The extent and consequences of p-hacking in science. PLoS Biol. 2015 Mar 13;13(3):e1002106. doi: 10.1371/journal.pbio.1002106.

12)    Wasserstein RL, Lazar NA. The ASA’s Statement on p-Values: Context, Process, and Purpose. The American Statistician. 2016 Mar 2;70(2):129-133. doi: 10.1080/00031305.2016.1154108.

13)    Cameron C, Turner R, Samaranayaka A. Understanding confidence intervals and why they are so important. The New Zealand Medical Student Journal. 2021;0(33):42-43. doi: 10.57129/​AGAG5939.

14)    Wasserstein RL, Schirm AL, Lazar NA. Moving to a World Beyond “p < 0.05”. The American Statistician. 2019;73(sup1):1-19. 73(sup1), 1–19. doi: 10.1080/00031305.2019.1583913.

15)    Greenland S, Senn SJ, Rothman KJ, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016 Apr;31(4):337-50. doi: 10.1007/s10654-016-0149-3.

16)    Riley RD, Snell KIE, Ensor J, et al. Minimum sample size for developing a multivariable prediction model: Part I - Continuous outcomes. Stat Med. 2019 Mar 30;38(7):1262-1275. doi: 10.1002/sim.7993.

17)    Rafi Z, Greenland S. Semantic and cognitive tools to aid statistical science: replace confidence and significance by compatibility and surprise. BMC Med Res Methodol. 2020 Sep 30;20(1):244. doi: 10.1186/s12874-020-01105-9.