Statistical Reporting Guidelines for Academic Research

General Principles for Reporting Statistical Results in Academic Manuscripts

Principles for Reporting Statistical Results

Introduction

Accurate and transparent reporting of statistics is based on the foundation of scholarly integrity. Accurate reporting aids the communication and interpretation of research findings where they can be repeated and trusted. Over the past several years, several guidelines from journals focusing on medical and methodological research have focused on achieving greater clarity in the reporting of both descriptive and inferential statistics. This article reviews what we consider best practices of reporting, and we include this review from recent literature of clinical, biomedical, and life sciences research (Boutron et al. (2008), Concato et al. (2000).

Precision in Descriptive Statistics

1. Appropriate Precision

Results should be reported with a degree of precision which is consistent with the accuracy of the measurement – it is misleading to report precision to many decimal places. For instance, reporting demographic information such as the mean age would be appropriate to the nearest year, unless you have an accurate measurement tool that justifies measuring to a decimal (Boutron et al. (2008)). Also, context should also matter regarding the precision of reports of decimals.

2. Sample Sizes and Proportions

Transparency is important when reporting total sample sizes and subgroup sample sizes. For example, researchers can report n = 150 for the sample and indicate subgroup sample sizes. When reporting percentages, researchers should indicate raw counts along with the percentages (Guillemin et al. (2019); Concato et al. (2000)). This improves interpretability and accurate cross-study comparison.

Choosing Summary Statistics Based on Data Distribution

1. Normal vs. Non-Normal Distributions

Summary statistics need to consider the underlying data distribution. Normal (normally) distributed variables are best described by mean and (standard deviation, SD), non-normal or skewed data is best described by median and interquartile range (IQR) and range endpoints (e.g., “median = 4.0; IQR = 2.0-6.5; range = 1-10”) (Stang, et al., 2022; Boutron et al. (2008)). Not including range endpoints can also obfuscate distribution.

2. Avoiding Misuse of SEM in Descriptions

The standard error of the mean (SEM) is a measure of precision of an estimate, not a measure of variability of the data. Best practices state to use SD, or IQR if the data are non-normal, to accurately depict variability (Boutron et al. (2008); Guillemin et al. (2019)), whereas the SEM should only be applied to inferential talk and not to descriptive reporting, considering it may understate variability (Concato et al. (2000)).

Visual and Tabular Data Presentation

1. Tables

Tables are best used to report precise numerical data, including the mean, SD, median, IQR, sample sizes, and/or statistical test results. Tables should be displayed in a simple and readable form (Boutron et al. (2008); Huser et al. (2019)).

2. Figures

Graphs (e.g., histograms, boxplots, and Kaplan–Meier curves) should be used in addition to tables. Graphs allow for the assessment of trends and assumptions (e.g., normality, homoscedasticity) and should avoid misleading any reader, particularly with truncated axes and 3D effects (Johnston & Hauser (2014); Boutron et al. (2008)). Additionally, graphs should be presented with consistency in labeling, and should be easy to read (e.g., scale should be visible, dotted lines) (Huser et al. (2019)).

Inferential Statistics: Transparency and Rigor

1. Exact p-Values and Confidence Intervals (CIs)

Inferential findings should report the exact p-values, reporting them as p = 0.012 rather than vague categories such as p < 0.05 (Johnston & Hauser (2014); Guillemin et al. (2019)). Report and report 95% CIs along with p-values so that they may be interpreted for both statistical significance and clinical importance (Huser et al. (2019)).

2. Effect Sizes and Test Statistics

Report test statistics and effect sizes together, for example: t(98) = 2.45, p = 0.016, Cohen’s d = 0.49. This allows for full transparency in each of its results (Boutron et al. (2008)). Effect sizes convey magnitude; therefore, it expands on the limits of interpreting p-values (Concato et al. (2000)).

3. Degrees of Freedom

The use of degrees of freedom promotes reproducibility (Johnston & Hauser (2014); Boutron et al. (2008)). The use of t-tests should state: “t(df) = value.” The inclusion of degrees of freedom indicates the sample size upon which the statistic is based.

Managing Multiple Testing and Corrections

For research that involves multiple comparisons family-wise error rates must be controlled, and corrections made like Bonferroni or Holm, or a false-discovery rate (Concato et al. (2000); Guillemin et al. (2019)). Authors should indicate if tests were one- or two-tailed and provide information about an alpha threshold. Most commonly α = 0.05 (Huser et al. (2019)).

Statistical Software and Reproducibility

Researcher’s TPR is enhanced (at least in theory) when authors report the statistics software and version (e.g., R 4.2.2, SPSS v29) and publicly share their analysis workflow through a community repository (associative repository, OSF, GitHub, etc). Neuroscience journal is a proactive field regarding open science; they are even advocating for published code so researchers can replicate their findings (Huser et al. (2019)). Furthermore, to further the recommendations earlier in the manuscript JAMA’s guidelines suggest where possible authors provide code (Concato et al. (2000)).

Justifying Statistical Tests and Checking Assumptions

Reports should also document justification of test procedures and checking assumptions (e.g., Shapiro–Wilk to check for normality, Levene’s test for symmetry in variance), and reports of any data transformations (e.g., log-transformation for a skewed variable) (Johnston & Hauser (2014); Guillemin et al. (2019)). Failed reporting on these aspects leads to questionable validity of research findings (Boutron et al. (2008)).

Handling Missing Data

Like statistical tests, it is important to be clear and explicit about missing data proportions, methods for handling missing data (e.g., multiple imputation, full-case analysis), and directions to consider bias if any (Huser et al. (2019); Concato et al. (2000)). Reports that do not provide this information may be less credible.

Sample Size Determination and Power Analysis

Sufficiently powered studies provide sample size calculation documents, cite an α threshold, planned effect size, and the intended statistical power (typically 80%) (Guillemin et al. (2019); Johnston & Hauser (2014)). Underpowered studies exhibit exaggerated effect sizes and low repeatability (Huser et al. (2019)).

Protocol Preregistration

Preregistration is an experimental protocol and analytical method (e.g., ClinicalTrials.gov, OSF) creates a clearly outlined list of specific methods, specific to the variables you measured, and can reduce biases associated with selective reporting (Boutron et al. (2008); Nosek et al, 2021). It is certainly okay to deviate from preregistered approaches, but you should describe any deviations and justify your decision.

Ethical Reporting and Inclusion

Many guidelines today recommend reporting on demographic characteristics of participants, including sex, age group, ethnicity; and for groups to be reported in a separate table for subgroup analysis of characteristics (Boutron et al. (2008)). Ethical transparency can enhance generalisability and push back on over-generalizations.

Summary Checklist for Manuscript Reporting

To promote rigorous reporting and publication, researchers should consider the following actions:

Always report sample size as the total, and subgroups.
Report raw data, and values presented as percentages.
Report as mean/SD or median/IQR that reflects the distribution of the data, to summarize data.
Avoid using SEM in any descriptive statistics.
Present exact p-values, and confidence intervals.
Clearly report the test statistics including degrees of freedom, and effect size.
Make clear if the tests adjusted for multiple testing, and how/what they adjusted.
Report both software and any analytical code made publicly available.
Justify any tests, transformations, or how you checked underlying assumptions.
Describe how missing data were dealt with i.e. excluded or imputed.
Report on any pre-study power analyses, the outcomes of these, and document.
Report and cite, preregister, and report actual deviations.
Report, and if possible, analysis, demographic subgroups as they arise, if appropriate.

Conclusion

Following accurate and precise reporting standards increases stabilizes scientific credibility and reproducibility. We have compiled these guidelines—developed the recent BMJ, The Journal of Bone & Joint Surgery, BMC Medical Ethics, JAMIA, Annals of the New York Academy of Sciences, and Journal of Neuroscience Research publications—as a practical framework for good statistical reporting. When researchers delineate their manuscripts with clarity, accuracy and transparency, they help to strengthen the impact and reliability of their findings.

References

1. Boutron, I., Moher, D., Altman, D. G., Schulz, K. F., & Ravaud, P. (2008). Extending the CONSORT statement to randomized trials of nonpharmacologic treatment: Explanation and elaboration. BMJ, 329(7471), 883. https://www.bmj.com/content/329/7471/883.full

2. Concato, J., Shah, N., & Horwitz, R. I. (2000). Randomized, controlled trials, observational studies, and the hierarchy of research designs. The Journal of Bone & Joint Surgery, 87(Supplement_2), 2-7. https://journals.lww.com/jbjsjournal/fulltext/2009/05003/Analysis_of_Observational_Studies__A_Guide_to.9.aspx/1000

3. Guillemin, M., Gillam, L., Rosenthal, D., & Bolitho, A. (2019). Researcher discomfort in qualitative research: acknowledging emotional challenges in research with vulnerable populations. BMC Medical Ethics, 20, Article 39. https://link.springer.com/article/10.1186/s12910-019-0359-9

4. Huser, V., Cimino, J. J., & Lai, A. M. (2019). Desiderata for computable representations of electronic health records-driven phenotype algorithms. Journal of the American Medical Informatics Association, 26(3), 185–195. https://academic.oup.com/jamia/article-abstract/26/3/185/5301680

5. Ioannidis, J. P. A. (2018). The proposal to lower p value thresholds to .005. Annals of the New York Academy of Sciences, 1429(1), 1–10. https://nyaspubs.onlinelibrary.wiley.com/doi/abs/10.1111/nyas.13325

6. Johnston, M., & Hauser, S. L. (2014). Reporting standards for preclinical and clinical research in neuroscience. Journal of Neuroscience Research, 92(9), 1150–1152. https://onlinelibrary.wiley.com/doi/full/10.1002/jnr.24340

7. Shane, E., Burr, D., Abrahamsen, B., Adler, R. A., Brown, T. D., Cheung, A. M., … & Watts, N. B. (2019). Atypical subtrochanteric and diaphyseal femoral fractures: Second report of a task force of the American Society for Bone and Mineral Research. Journal of Bone and Mineral Research, 34(11), 1985–2012. https://academic.oup.com/jbmr/article-abstract/34/11/1981/7606045