Choosing Exploratory, Predictive, or Causal Multivariable Models in Biomedical Research: A Practical Methodological Guide
DOI:
https://doi.org/10.6000/1929-6029.2026.15.16Keywords:
Models, Statistical, Regression Analysis, Causality, Confounding Factors, Epidemiologic, Calibration, Decision Support TechniquesAbstract
Background: Multivariable regression is widely used in biomedical research, but models built for different purposes are often treated as if they were interchangeable. This confuses variable handling, covariate adjustment, model evaluation, and interpretation.
Objective: To provide a practical guide for clinicians, biomedical researchers, and collaborating statisticians on how to choose and report multivariable models according to whether the aim is exploratory, predictive, or causal.
Methods: We prepared a narrative methodological tutorial using a targeted search of PubMed, Scopus, and Google Scholar, together with key textbooks and reporting guidance (STROBE, TRIPOD+AI, and PROBAST+AI). We prioritized seminal papers and recent methodological references (2021-2025) on variable prespecification, continuous predictors, validation, calibration, and causal diagrams. Illustrative examples are simulated and are used only for didactic purposes.
Results: The first step is to state the analytic objective explicitly. Exploratory models are used to describe adjusted associations and generate hypotheses; predictive models aim to estimate individual risk and therefore require attention to discrimination, calibration, and internal/external validation; causal models aim to estimate an effect and should rely on temporality, substantive knowledge, and directed acyclic graphs (DAGs) to define adjustment sets. Across objectives, arbitrary dichotomization and univariable screening are discouraged. Continuous predictors should usually be kept on their original scale, with flexible functions such as restricted cubic splines when nonlinearity is plausible. Penalization is generally preferable to stepwise procedures when overfitting is a concern in prediction, whereas full theory-based models are often preferable in causal analyses.
Conclusions: The research question should determine the model, not the reverse. A practical workflow is to define the objective first, prespecify candidate variables, choose a functional form that preserves information, and evaluate the model with objective-specific criteria. Clear separation of exploratory, predictive, and causal aims improves transparency, interpretability, and clinical usefulness.
References
Shmueli G. To Explain or to Predict? Stat Sci 2010; 25(3): 289-310. DOI: https://doi.org/10.1214/10-STS330
Grimes DA, Schulz KF. Descriptive studies: what they can and cannot do. Lancet 2002; 359(9301): 145-149. DOI: https://doi.org/10.1016/S0140-6736(02)07373-7
Westreich D, Greenland S. The table 2 fallacy: presenting and interpreting confounder and modifier coefficients. Am J Epidemiol 2013; 177(4): 292-298. DOI: https://doi.org/10.1093/aje/kws412
Hernan MA. The C-Word: Scientific Euphemisms Do Not Improve Causal Inference from Observational Data. Am J Public Health 2018; 108(5): 616-619. DOI: https://doi.org/10.2105/AJPH.2018.304337
Collins GS, Moons KGM, Dhiman P, Riley RD, Beam AL, Van Calster B, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 2024; 385: e078378. DOI: https://doi.org/10.1136/bmj-2023-078378
Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Bossuyt P, et al. Calibration: the Achilles heel of predictive analytics. BMC Med 2019; 17(1): 230. DOI: https://doi.org/10.1186/s12916-019-1466-7
Schisterman EF, Cole SR, Platt RW. Overadjustment bias and unnecessary adjustment in epidemiologic studies. Epidemiology 2009 20(4): 488-495. DOI: https://doi.org/10.1097/EDE.0b013e3181a819a1
Tennant PWG, Murray EJ, Arnold KF, Berrie L, Fox MP, Gadd SC, et al. Use of directed acyclic graphs (DAGs) to identify confounders in applied health research: review and recommendations. Int J Epidemiol 2021; 50(2): 620-632. DOI: https://doi.org/10.1093/ije/dyaa213
Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med 2006; 25(1): 127-141. DOI: https://doi.org/10.1002/sim.2331
Vandenbroucke JP, von Elm E, Altman DG, Gotzsche PC, Mulrow CD, Pocock SJ, et al. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): explana- tion and elaboration. Epidemiology 2007; 18(6): 805-835. DOI: https://doi.org/10.1097/EDE.0b013e3181577511
Sauerbrei W, Royston P, Binder H. Selection of important variables and determination of functional form for continuous predictors in multivariable model building. Stat Med 2007; 26(30): 5512-5528. DOI: https://doi.org/10.1002/sim.3148
Box GEP. Science and Statistics. J Am Stat Assoc 1976; 71(356): 791-799. DOI: https://doi.org/10.1080/01621459.1976.10480949
Burnham KP, Anderson DR. Multimodel Inference: Understanding AIC and BIC in Model Selection. Sociol Methods Res 2004; 33(2): 261-304. DOI: https://doi.org/10.1177/0049124104268644
Babyak MA. What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. Psychosom Med 2004; 66(3): 411-421. DOI: https://doi.org/10.1097/00006842-200405000-00021
Heinze G, Schemper M. A solution to the problem of separa- tion in logistic regression. Stat Med 2002; 21(16): 2409-2419. DOI: https://doi.org/10.1002/sim.1047
VanderWeele TJ. Principles of confounder selection. Eur J Epidemiol 2019; 34(3): 211-219. DOI: https://doi.org/10.1007/s10654-019-00494-6
17. Riley RD, Snell KI, Ensor J, Burke DL, Harrell FE, Moons KGM, et al. Minimum sample size for developing a multivariable prediction model: part II - binary and time-to-event outcomes. Stat Med 2019; 38(7): 1276-1296. DOI: https://doi.org/10.1002/sim.7992
Sun GW, Shook TL, Kay GL. Inappropriate use of bivariable analysis to screen risk factors for use in multivariable analysis. J Clin Epidemiol 1996; 49(8): 907-916. DOI: https://doi.org/10.1016/0895-4356(96)00025-X
Heinze G, Wallisch C, Dunkler D. Variable selection - A review and recommendations for the practicing statistician. Biom J 2018; 60(3): 431-449. DOI: https://doi.org/10.1002/bimj.201700067
Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 1996; 15(4): 361-387. DOI: https://doi.org/10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4
Austin PC, Tu JV. Bootstrap methods for developing predictive models. Am Stat 2004; 58(2): 131-137. DOI: https://doi.org/10.1198/0003130043277
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc B 1996; 58(1): 267-288. DOI: https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc B 2005; 67(2): 301-320. DOI: https://doi.org/10.1111/j.1467-9868.2005.00503.x
Dunkler D, Plischke M, Leffondre K, Heinze G. Augmented backward elimination: a pragmatic and purposeful way to develop statistical models. PLoS One 2014; 9(11): e113677. DOI: https://doi.org/10.1371/journal.pone.0113677
Greenland S. Dose-response and trend analysis in epidemiology: alternatives to categorical analysis. Epidemiology 1995; 6(4): 356-365. DOI: https://doi.org/10.1097/00001648-199507000-00005
Desquilbet L, Mariotti F. Dose-response analyses using restricted cubic spline functions in public health research. Stat Med 2010; 29(9): 1037-1057. DOI: https://doi.org/10.1002/sim.3841
Royston P, Altman DG. Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling. J R Stat Soc C 1994; 43(3): 429-453. DOI: https://doi.org/10.2307/2986270
Sauerbrei W, Meier-Hirmer C, Benner A, Royston P. Multivariable regression model building by using fractional polynomials: description of SAS, STATA and R programs. Comput Stat Data Anal 2006; 50(12): 3464-3485. DOI: https://doi.org/10.1016/j.csda.2005.07.015
Royston P, Sauerbrei W. Stability of multivariable fractional polynomial models with selection of variables and transform- ations: a bootstrap investigation. Stat Med 2003; 22(4): 639-659. DOI: https://doi.org/10.1002/sim.1310
Harrell FE. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. 2nd ed. Cham: Springer 2015. DOI: https://doi.org/10.1007/978-3-319-19425-7
Akaike H. A new look at the statistical model identification. IEEE Trans Autom Control 1974; 19(6): 716-723. DOI: https://doi.org/10.1109/TAC.1974.1100705
Burnham KP, Anderson DR. Model Selection and Multimodel Inference. 2nd ed. New York: Springer 2004. DOI: https://doi.org/10.1007/b97636
Vrieze SI. Model selection and psychological theory: a discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Psychol Methods 2012; 17(2): 228-243. DOI: https://doi.org/10.1037/a0027127
Nagelkerke NJD. A note on a general definition of the coefficient of determination. Biometrika 1991; 78(3): 691-692. DOI: https://doi.org/10.1093/biomet/78.3.691
Schwarz G. Estimating the dimension of a model. Ann Stat 1978; 6(2): 461-464. DOI: https://doi.org/10.1214/aos/1176344136
Kass RE, Raftery AE. Bayes factors. J Am Stat Assoc 1995; 90(430): 773-795. DOI: https://doi.org/10.1080/01621459.1995.10476572
Vittinghoff E, McCulloch CE. Relaxing the rule of ten events per variable in logistic and Cox regression. Am J Epidemiol 2007; 165(6): 710-718. DOI: https://doi.org/10.1093/aje/kwk052
Van Smeden M, de Groot JAH, Moons KGM, Collins GS, Altman DG, Eijkemans MJC, et al. No rationale for 1 variable per 10 events criterions for binary logistic regression analysis. BMC Med Res Methodol 2016; 16(1): 163. DOI: https://doi.org/10.1186/s12874-016-0267-3
Steyerberg EW, Vergouwe Y. Towards better clinical prediction models: seven steps for development and an ABCD for validation. Eur Heart J 2014; 35(29): 1925-1931. DOI: https://doi.org/10.1093/eurheartj/ehu207
Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2nd ed. Cham: Springer 2019. DOI: https://doi.org/10.1007/978-3-030-16399-0
Ogundimu EO, Altman DG, Collins GS. Adequate sample size for developing prediction models is not simply related to events per variable. J Clin Epidemiol 2016; 76: 175-182. DOI: https://doi.org/10.1016/j.jclinepi.2016.02.031
Altman DG, Vergouwe Y, Royston P, Moons KGM. Prognosis and prognostic research: validating a prognostic model. BMJ 2009; 338: b605. DOI: https://doi.org/10.1136/bmj.b605
Hernan MA, Robins JM. Causal Inference: What If. Boca Raton: Chapman & Hall/CRC 2020.
Shrier I, Platt RW. Reducing bias through directed acyclic graphs. BMC Med Res Methodol 2008; 8: 70. DOI: https://doi.org/10.1186/1471-2288-8-70
Hernan MA, Hernandez-Diaz S, Robins JM. A structural approach to selection bias. Epidemiology 2004; 15(5): 615-625. DOI: https://doi.org/10.1097/01.ede.0000135174.63482.43
Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70(1): 41-55. DOI: https://doi.org/10.1093/biomet/70.1.41
Brookhart MA, Schneeweiss S, Rothman KJ, Glynn RJ, Avorn J, Sturmer T. Variable selection for propensity score models. Am J Epidemiol 2006; 163(12): 1149-1156. DOI: https://doi.org/10.1093/aje/kwj149
Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000; 11(5): 550-560. DOI: https://doi.org/10.1097/00001648-200009000-00011
Naimi AI, Cole SR, Kennedy EH. An introduction to g methods. Int J Epidemiol 2017; 46(2): 756-762.
Cole SR, Hernan MA. Constructing inverse probability weights for marginal structural models. Am J Epidemiol 2008; 168(6): 656-664. DOI: https://doi.org/10.1093/aje/kwn164
Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics 2005; 61(4): 962-973. DOI: https://doi.org/10.1111/j.1541-0420.2005.00377.x
Wolff RF, Moons KGM, Riley RD, Whiting PF, Westwood M, Collins GS, et al. PROBAST: a tool to assess the risk of bias and applicability of prediction model studies. Ann Intern Med 2019; 170(1): 51-58. DOI: https://doi.org/10.7326/M18-1376
Collins GS, Dhiman P, Ma J, Schlussel MM, Archer L, Van Calster B, et al. Evaluation of clinical prediction models (part 1): from development to external validation. BMJ 2024; 384: e074819. DOI: https://doi.org/10.1136/bmj-2023-074819
Riley RD, Archer L, Snell KIE, Ensor J, Debray TPA, Van Calster B, et al. Evaluation of clinical prediction models (part 2): how to undertake an external validation study. BMJ 2024; 384: e074820. DOI: https://doi.org/10.1136/bmj-2023-074820
Riley RD, Snell KIE, Archer L, Ensor J, Debray TPA, Van Calster B, et al. Evaluation of clinical prediction models (part 3): calculating the sample size required for an external validation study. BMJ 2024; 384: e074821. DOI: https://doi.org/10.1136/bmj-2023-074821
Efthimiou O, Seo M, Chalkou K, Debray TPA, Egger M, Salanti G. Developing clinical prediction models: a step-by-step guide. BMJ 2024; 386: e078276. DOI: https://doi.org/10.1136/bmj-2023-078276
Dhiman P, Ma J, Qi C, Bullock GS, Sergeant JC, Riley RD, Collins GS. Sample size requirements are not being considered in studies developing prediction models for binary outcomes: a systematic review. BMC Med Res Methodol 2023; 23(1): 188. DOI: https://doi.org/10.1186/s12874-023-02008-1
Kuhle S, Brown MM, Stanojevic S. Building a better model: abandon kitchen sink regression. Arch Dis Child Fetal Neonatal Ed 2024; 109(6): 574-579. DOI: https://doi.org/10.1136/archdischild-2023-326340
Fung A, Beyene J, Mediratta RP. Principles of Clinical Prediction Model Development and Validation. Hosp Pediatr 2025; 15(6): e280-e285. DOI: https://doi.org/10.1542/hpeds.2024-008218
Moons KGM, Damen JAA, Kaul T, Hooft L, Andaur Navarro CL, Dhiman P, et al. PROBAST+AI: an updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods. BMJ 2025; 388: e082505. DOI: https://doi.org/10.1136/bmj-2024-082505
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Policy for Journals/Articles with Open Access
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are permitted and encouraged to post links to their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work
Policy for Journals / Manuscript with Paid Access
Authors who publish with this journal agree to the following terms:
- Publisher retain copyright .
- Authors are permitted and encouraged to post links to their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work .