Performance of the Classical Model in Feature Selection Across Varying Database Sizes of Healthcare Data

Kannan  Thiruvengadam; Dadakhalandar  Doddamani; Rajendran  Krishnan

doi:10.6000/1929-6029.2024.13.21

Authors

Kannan Thiruvengadam ICMR – National Institute for Research in Tuberculosis, Chennai, India https://orcid.org/0000-0002-3758-1094
Dadakhalandar Doddamani ICMR – Regional Medical Research Centre, Port Blair, Port Blair, India https://orcid.org/0000-0003-4566-9130
Rajendran Krishnan ICMR – National Institute for Research in Tuberculosis, Chennai, India

DOI:

https://doi.org/10.6000/1929-6029.2024.13.21

Keywords:

Models, regression analysis, machine learning, data interpretation, variable selection

Abstract

Machine learning is increasingly being applied to medical research, particularly in selecting predictive modelling variables. By identifying relevant variables, researchers can improve model accuracy and reliability, leading to better clinical decisions and reduced overfitting. Efficient utilization of resources and the validity of medical research findings depend on selecting the right variables. However, few studies compare the performance of classical and modern methods for selecting characteristics in health datasets, highlighting the need for a critical evaluation to choose the most suitable approach. We analysed the performance of six different variable selection methods, which includes stepwise, forward and backward selection using p-value and AIC, LASSO, and Elastic Net. Health-related surveillance data on behaviors, health status, and medical service usage were used across ten databases, with sizes ranging from 10% to 100%, maintaining consistent outcome proportions. Varying database sizes were utilized to assess their impact on prediction models, as they can significantly influence accuracy, overfitting, generalizability, statistical power, parameter estimation reliability, computational complexity, and variable selection. The stepwise and backward AIC model showed the highest accuracy with an Area under the ROC Curve (AUC) of 0.889. Despite its sparsity, the Lasso and Elastic Net model also performed well. The study also found that binary variables were considered more crucial by the Lasso and Elastic Net model. Importantly, the significance of variables remained consistent across different database sizes. The study shows that no major variations in results between the fitness metric of the model and the number of variables in stepwise and backward p-value models, irrespective of the database's size. LASSO and Elastic Net models surpassed other models throughout various database sizes, and with fewer variables.

References

Evans RS. Electronic Health Records: Then, Now, and in the Future. Yearb Med Inform 2016; 25. https://doi.org/10.15265/IYS-2016-s006 DOI: https://doi.org/10.15265/IYS-2016-s006

Brnabic A, Hess LM. Systematic literature review of machine learning methods used in the analysis of real-world data for patient-provider decision making. BMC Medical Informatics and Decision Making 2021; 21(1): 1-19. https://doi.org/10.1186/s12911-021-01403-2 DOI: https://doi.org/10.1186/s12911-021-01403-2

van der Ploeg T, Austin PC, Steyerberg EW. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. BMC Medical Research Methodology 2014; 14(1): 137-137. https://doi.org/10.1186/1471-2288-14-137 DOI: https://doi.org/10.1186/1471-2288-14-137

Hossain E, Hossain E, Khan A, Moni MA, Uddin S. Use of Electronic Health Data for Disease Prediction: A Comprehensive Literature Review. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2019; 18(2): 745-58. https://doi.org/10.1109/TCBB.2019.2937862 DOI: https://doi.org/10.1109/TCBB.2019.2937862

Bagherzadeh-Khiabani F, Ramezankhani A, Azizi F, Hadaegh F, Steyerberg EW, Khalili D. A tutorial on variable selection for clinical prediction models: feature selection methods in data mining could improve the results. Journal of Clinical Epidemiology 2016; 71: 76-85. https://doi.org/10.1016/j.jclinepi.2015.10.002 DOI: https://doi.org/10.1016/j.jclinepi.2015.10.002

Chowdhury MZI, Turin TC. Variable selection strategies and its importance in clinical prediction modelling. Family Medicine and Community Health 2020; 8(1). https://doi.org/10.1136/fmch-2019-000262 DOI: https://doi.org/10.1136/fmch-2019-000262

He L, He Lingjun, Levine RA, Fan J, Beemer J, Stronach J. Random Forest as a Predictive Analytics Alternative to Regression in Institutional Research. Practical Assessment, Research and Evaluation 2018; 23(1): 1.

Steyerberg EW. Clinical Prediction Models 2009. https://doi.org/10.1007/978-0-387-77244-8 DOI: https://doi.org/10.1007/978-0-387-77244-8

Demšar J. Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research 2006; 7(1): 1-30.

Rodriguez D, Catal C, Cagatay Catal, Diri B. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Information Sciences 2009; 179(8): 1040-58. https://doi.org/10.1016/j.ins.2008.12.001 DOI: https://doi.org/10.1016/j.ins.2008.12.001

Murtaugh PA. In defense of P values. Ecology 2014; 95(3): 611-7. https://doi.org/10.1890/13-0590.1 DOI: https://doi.org/10.1890/13-0590.1

Portet S. A primer on model selection using the Akaike Information Criterion. Infectious Disease Modelling 2020; 5: 111-28. https://doi.org/10.1016/j.idm.2019.12.010 DOI: https://doi.org/10.1016/j.idm.2019.12.010

Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. Journal of The Royal Statistical Society Series B-statistical Methodology 2011; 73(3): 273-82. https://doi.org/10.1111/j.1467-9868.2011.00771.x DOI: https://doi.org/10.1111/j.1467-9868.2011.00771.x

Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of The Royal Statistical Society Series B-statistical Methodology 2005; 67(2): 301-20. https://doi.org/10.1111/j.1467-9868.2005.00503.x DOI: https://doi.org/10.1111/j.1467-9868.2005.00503.x

Control C for D, Prevention, others. Behavioral risk factor surveillance system survey questionnaire. Atlanta, Georgia: US Department of Health and Human Services, Centers for Disease Control and Prevention 2022; 22-3.

Control C for D, Prevention, others. Behavioral risk factor surveillance system survey data. http://appsnccdcdc gov/brfss/listasp?cat=OH&yr-2008&qkey=6610&state=All 2022.

Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference and prediction. Math Intell 2005.

R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria 2022. [Internet]. Available from: https://www.R-project.org/

Mukherjee T, Mukherjee T, Duckett M, Kumar P, Paquet J, Paquet JD, et al. RSSI-Based Supervised Learning for Uncooperative Direction-Finding. ECML/PKDD 2017; 216-27. https://doi.org/10.1007/978-3-319-71273-4_18 DOI: https://doi.org/10.1007/978-3-319-71273-4_18

Guyon I, Jason Weston, Weston J, Barnhill S, Vladimir Vapnik, Vapnik V. Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning 2002; 46(1): 389-422. https://doi.org/10.1023/A:1012487302797 DOI: https://doi.org/10.1023/A:1012487302797

Deng H, George C. Runger, Runger GC. Feature Selection via Regularized Trees. arXiv: Learning 2012.

Gelman A, John B. Carlin, John B. Carlin, Carlin JB, Hal S. Stern, Stern HS, et al. Bayesian data analysis, third edition 2013. https://doi.org/10.1201/b16018 DOI: https://doi.org/10.1201/b16018

Box GEP, George C. Tiao, Tiao GC, Tiao GC. Bayesian Inference in Statistical Analysis: Box/Bayesian 1992. https://doi.org/10.1002/9781118033197 DOI: https://doi.org/10.1002/9781118033197

Smith G, Frank Campbell, Campbell F. A Critique of Some Ridge Regression Methods. Journal of the American Statistical Association 1980; 75(369): 74-81. https://doi.org/10.1080/01621459.1980.10477428 DOI: https://doi.org/10.1080/01621459.1980.10477428

Way TW, Berkman Sahiner, Sahiner B, Hadjiiski LM, Chan HP. Effect of finite sample size on feature selection and classification: a simulation study. Medical Physics 2010; 37(2): 907-20. https://doi.org/10.1118/1.3284974 DOI: https://doi.org/10.1118/1.3284974