Performance of the Classical Model in Feature Selection Across Varying Database Sizes of Healthcare Data
DOI:
https://doi.org/10.6000/1929-6029.2024.13.21Keywords:
Models, regression analysis, machine learning, data interpretation, variable selectionAbstract
Machine learning is increasingly being applied to medical research, particularly in selecting predictive modelling variables. By identifying relevant variables, researchers can improve model accuracy and reliability, leading to better clinical decisions and reduced overfitting. Efficient utilization of resources and the validity of medical research findings depend on selecting the right variables. However, few studies compare the performance of classical and modern methods for selecting characteristics in health datasets, highlighting the need for a critical evaluation to choose the most suitable approach. We analysed the performance of six different variable selection methods, which includes stepwise, forward and backward selection using p-value and AIC, LASSO, and Elastic Net. Health-related surveillance data on behaviors, health status, and medical service usage were used across ten databases, with sizes ranging from 10% to 100%, maintaining consistent outcome proportions. Varying database sizes were utilized to assess their impact on prediction models, as they can significantly influence accuracy, overfitting, generalizability, statistical power, parameter estimation reliability, computational complexity, and variable selection. The stepwise and backward AIC model showed the highest accuracy with an Area under the ROC Curve (AUC) of 0.889. Despite its sparsity, the Lasso and Elastic Net model also performed well. The study also found that binary variables were considered more crucial by the Lasso and Elastic Net model. Importantly, the significance of variables remained consistent across different database sizes. The study shows that no major variations in results between the fitness metric of the model and the number of variables in stepwise and backward p-value models, irrespective of the database's size. LASSO and Elastic Net models surpassed other models throughout various database sizes, and with fewer variables.
References
Evans RS. Electronic Health Records: Then, Now, and in the Future. Yearb Med Inform 2016; 25. https://doi.org/10.15265/IYS-2016-s006 DOI: https://doi.org/10.15265/IYS-2016-s006
Brnabic A, Hess LM. Systematic literature review of machine learning methods used in the analysis of real-world data for patient-provider decision making. BMC Medical Informatics and Decision Making 2021; 21(1): 1-19. https://doi.org/10.1186/s12911-021-01403-2 DOI: https://doi.org/10.1186/s12911-021-01403-2
van der Ploeg T, Austin PC, Steyerberg EW. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. BMC Medical Research Methodology 2014; 14(1): 137-137. https://doi.org/10.1186/1471-2288-14-137 DOI: https://doi.org/10.1186/1471-2288-14-137
Hossain E, Hossain E, Khan A, Moni MA, Uddin S. Use of Electronic Health Data for Disease Prediction: A Comprehensive Literature Review. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2019; 18(2): 745-58. https://doi.org/10.1109/TCBB.2019.2937862 DOI: https://doi.org/10.1109/TCBB.2019.2937862
Bagherzadeh-Khiabani F, Ramezankhani A, Azizi F, Hadaegh F, Steyerberg EW, Khalili D. A tutorial on variable selection for clinical prediction models: feature selection methods in data mining could improve the results. Journal of Clinical Epidemiology 2016; 71: 76-85. https://doi.org/10.1016/j.jclinepi.2015.10.002 DOI: https://doi.org/10.1016/j.jclinepi.2015.10.002
Chowdhury MZI, Turin TC. Variable selection strategies and its importance in clinical prediction modelling. Family Medicine and Community Health 2020; 8(1). https://doi.org/10.1136/fmch-2019-000262 DOI: https://doi.org/10.1136/fmch-2019-000262
He L, He Lingjun, Levine RA, Fan J, Beemer J, Stronach J. Random Forest as a Predictive Analytics Alternative to Regression in Institutional Research. Practical Assessment, Research and Evaluation 2018; 23(1): 1.
Steyerberg EW. Clinical Prediction Models 2009. https://doi.org/10.1007/978-0-387-77244-8 DOI: https://doi.org/10.1007/978-0-387-77244-8
Demšar J. Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research 2006; 7(1): 1-30.
Rodriguez D, Catal C, Cagatay Catal, Diri B. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Information Sciences 2009; 179(8): 1040-58. https://doi.org/10.1016/j.ins.2008.12.001 DOI: https://doi.org/10.1016/j.ins.2008.12.001
Murtaugh PA. In defense of P values. Ecology 2014; 95(3): 611-7. https://doi.org/10.1890/13-0590.1 DOI: https://doi.org/10.1890/13-0590.1
Portet S. A primer on model selection using the Akaike Information Criterion. Infectious Disease Modelling 2020; 5: 111-28. https://doi.org/10.1016/j.idm.2019.12.010 DOI: https://doi.org/10.1016/j.idm.2019.12.010
Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. Journal of The Royal Statistical Society Series B-statistical Methodology 2011; 73(3): 273-82. https://doi.org/10.1111/j.1467-9868.2011.00771.x DOI: https://doi.org/10.1111/j.1467-9868.2011.00771.x
Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of The Royal Statistical Society Series B-statistical Methodology 2005; 67(2): 301-20. https://doi.org/10.1111/j.1467-9868.2005.00503.x DOI: https://doi.org/10.1111/j.1467-9868.2005.00503.x
Control C for D, Prevention, others. Behavioral risk factor surveillance system survey questionnaire. Atlanta, Georgia: US Department of Health and Human Services, Centers for Disease Control and Prevention 2022; 22-3.
Control C for D, Prevention, others. Behavioral risk factor surveillance system survey data. http://appsnccdcdc gov/brfss/listasp?cat=OH&yr-2008&qkey=6610&state=All 2022.
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference and prediction. Math Intell 2005.
R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria 2022. [Internet]. Available from: https://www.R-project.org/
Mukherjee T, Mukherjee T, Duckett M, Kumar P, Paquet J, Paquet JD, et al. RSSI-Based Supervised Learning for Uncooperative Direction-Finding. ECML/PKDD 2017; 216-27. https://doi.org/10.1007/978-3-319-71273-4_18 DOI: https://doi.org/10.1007/978-3-319-71273-4_18
Guyon I, Jason Weston, Weston J, Barnhill S, Vladimir Vapnik, Vapnik V. Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning 2002; 46(1): 389-422. https://doi.org/10.1023/A:1012487302797 DOI: https://doi.org/10.1023/A:1012487302797
Deng H, George C. Runger, Runger GC. Feature Selection via Regularized Trees. arXiv: Learning 2012.
Gelman A, John B. Carlin, John B. Carlin, Carlin JB, Hal S. Stern, Stern HS, et al. Bayesian data analysis, third edition 2013. https://doi.org/10.1201/b16018 DOI: https://doi.org/10.1201/b16018
Box GEP, George C. Tiao, Tiao GC, Tiao GC. Bayesian Inference in Statistical Analysis: Box/Bayesian 1992. https://doi.org/10.1002/9781118033197 DOI: https://doi.org/10.1002/9781118033197
Smith G, Frank Campbell, Campbell F. A Critique of Some Ridge Regression Methods. Journal of the American Statistical Association 1980; 75(369): 74-81. https://doi.org/10.1080/01621459.1980.10477428 DOI: https://doi.org/10.1080/01621459.1980.10477428
Way TW, Berkman Sahiner, Sahiner B, Hadjiiski LM, Chan HP. Effect of finite sample size on feature selection and classification: a simulation study. Medical Physics 2010; 37(2): 907-20. https://doi.org/10.1118/1.3284974 DOI: https://doi.org/10.1118/1.3284974
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Policy for Journals/Articles with Open Access
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are permitted and encouraged to post links to their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work
Policy for Journals / Manuscript with Paid Access
Authors who publish with this journal agree to the following terms:
- Publisher retain copyright .
- Authors are permitted and encouraged to post links to their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work .