Multiple Imputation of Missing Race and Ethnicity in CDC COVID-19 Case-Level Surveillance Data
Keywords:Multiple Imputation, Missing Data, Race and Ethnicity, Health Equity
The COVID-19 pandemic has resulted in a disproportionate burden on racial and ethnic minority groups, but incompleteness in surveillance data limits understanding of disparities. CDC’s case-based surveillance system contains most COVID-19 cases in the United States. Data analyzed in this paper contain COVID-19 cases with case-level information through September 25, 2020, which represent 70.9% of all COVID-19 cases reported to CDC during the period. Case-level surveillance data are used to investigate COVID-19 disparities by race/ethnicity, sex, and age. However, demographic information on race and ethnicity is missing for a substantial percentage of COVID-19 cases (e.g., 35.8% and 47.2% of cases analyzed were missing race and ethnicity information, respectively). Our goal in this study was to impute missing race and ethnicity to derive more accurate incidence and incidence rate ratio (IRR) estimates for different racial and ethnic groups, and evaluate the results from imputation compared to complete case analysis, which involves removing cases with missing race/ethnicity information from the analysis. Two multiple imputation (MI) models were developed. Model 1 imputes race using six binary race variables, and Model 2 imputes race as a composite multinomial variable. Our evaluation found that compared with complete case analysis, MI reduced biases and improved coverage on incidence and IRR estimates for all race/ethnicity groups, except for the Non-Hispanic Multiple/other group. Our research highlights the importance of supplementing complete case analysis with additional methods of analysis to better describe racial and ethnic disparities. When race and ethnicity data are missing, multiple imputation may provide more accurate incidence and IRR estimates to monitor these disparities in tandem with efforts to improve the collection of race and ethnicity information for pandemic surveillance.
Wang Q, Berger NA, Xu R. Analyses of Risk, Racial Disparity, and Outcomes Among US Patients With Cancer and COVID-19 Infection. JAMA Oncol 2021; 7(2): 220-227. https://doi.org/10.1001/jamaoncol.2020.6178
Yancy CW. COVID-19 and African Americans. JAMA 2020; 323(19): 1891-1892. https://doi.org/10.1001/jama.2020.6548
Mahajan UV, Larkins-Pettigrew M. Racial demographics and COVID-19 confirmed cases and deaths: a correlational analysis of 2886 US counties. J Public Health 2020; 42(3): 445-447. https://doi.org/10.1093/pubmed/fdaa070
Karaca-Mandic P, Georgiou A, Sen S. Assessment of COVID-19 hospitalizations by race/ethnicity in 12 states. JAMA Intern Med 2021; 181(1): 131-134. https://doi.org/10.1001/jamainternmed.2020.3857
Yoon P, Hall J, Fuld J, et al. Alternative Methods for Grouping Race and Ethnicity to Monitor COVID-19 Outcomes
and Vaccination Coverage. MMWR Morb Mortal Wkly Rep 2021; 70: 1075-1080. https://doi.org/10.15585/mmwr.mm7032a2
Adjaye-Gbewonyo D, Bednarczyk RA, Davis RL, Omer SB. Using the Bayesian improved surname geocoding method (BISG) to create a working classification of race and ethnicity in a diverse managed care population: a validation study. Health Serv Res 2013; 49(1): 268-283. https://doi.org/10.1111/1475-6773.12089
Hassett P. Taking on racial and ethnic disparities in health care: the experience at Aetna. Health Aff 2005; 24(2): 417-420. https://doi.org/10.1377/hlthaff.24.2.417
Silva GC, Trivedi AN, Gutman R. Developing and evaluating methods to impute race/ethnicity in an incomplete dataset. Health Services and Outcomes Research Methodology 2019; 19: 175-195. https://doi.org/10.1007/s10742-019-00200-9
Little RJA, Rubin DB. Statistical Analysis with Missing Data, New York: Wiley 2019. https://doi.org/10.1002/9781119482260
Fiscella K, Fremont AM. Use of geocoding and surname analysis to estimate race and ethnicity. Health Serv Res 2006; 41(4 Pt 1): 1482-1500.
Elliott MN, Fremont A, Morrison PA, Pantoja P, Lurie N. A new method for estimating race/ethnicity and associated disparities where administrative records lack self-reported race/ethnicity. Health Serv Res 2008; 43(5p1): 1722-1736. https://doi.org/10.1111/j.1475-6773.2008.00854.x
Elliott MN, Morrison PA, Fremont A, McCaffrey DF, Pantoja P, Lurie N. Using the Census Bureau’s surname list to improve estimates of race/ethnicity and associated disparities. Health Serv Outcomes Res Methodol 2009; 9(2): 69. https://doi.org/10.1007/s10742-009-0047-1
Grundmeier RW, Song L, Ramos MJ, et al. Imputing missing race/ethnicity in pediatric electronic health records: reducing bias with use of U.S. Census location and surname data. Health Serv Res 2015; 50(4): 946?960. https://doi.org/10.1111/1475-6773.12295
Ma Y, Zhang W, Lyman S, Huang Y. The HCUP SID imputation project: improving statistical inferences for health disparities research by imputing missing race data. Health Serv Res 2018; 53(3): 1870-1889. https://doi.org/10.1111/1475-6773.12704
Kim JS, Gao X, Rzhetsky A. RIDDLE: Race and ethnicity Imputation from Disease history with Deep Learning. PloS Comput Biol 2018; 14(4): e1006106. https://doi.org/10.1371/journal.pcbi.1006106
Labgold K, Hamid S, Shah S, Gandhi NR, Chamberlain A, Khan F, Khan S, Smith S, Williams S, Lash TL, Collin LJ. Estimating the Unknown: Greater Racial and Ethnic Disparities in COVID-19 Burden After Accounting for Missing Race and Ethnicity Data. Epidemiology 2021; 32(2): 157-161. https://doi.org/10.1097/EDE.0000000000001314
Schafer JL. Analysis of Incomplete Multivariate Data, London: Chapman and Hall 1997. https://doi.org/10.1201/9781439821862
Raghunathan TE, Lebkowski JM, VanHoewyk J, Solenberger P. A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models. Survey Methodology 2001; 27: 85-95.
Van Buuren S. Multiple Imputation of Discrete and Continuous Data by Fully Conditional Specification. Statistical Methods in Medical Research 2007; 16: 219-242. https://doi.org/10.1177/0962280206074463
Van Buuren S. Flexible Imputation of Missing Data, Boca Raton, FL: Chapman & Hall/CRC 2012.
He Y. Missing Data Analysis Using Multiple Imputation: Getting to the Heart of the Matter. Circulation: Cardiovascular Quality and Outcomes 2010; 3: 98-105. https://doi.org/10.1161/CIRCOUTCOMES.109.875658
Van Buuren S, Karin G. Mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 2011; 45(3). https://doi.org/10.18637/jss.v045.i03
Liu Y, De A. Multiple Imputation by Fully Conditional Specification for Dealing with Missing Data in a Large Epidemiologic Study. International Journal of Statistics in Medical Research 2015; 4(3): 287-295. https://doi.org/10.6000/1929-6029.2015.04.03.7
SAS Institute Inc. SAS/STAT® 14.1 User’s Guide. Cary, NC: SAS Institute Inc. 2015.
Rubin DB. Multiple Imputation in Sample Surveys – A Phenomenological Bayesian Approach to Nonresponse. In Proceedings of the Section on Survey Research Methods., American Statistical Association 1978; pp. 20-34.
Rubin DB. Multiple Imputation for Nonresponse in Surveys, New York: John Wiley 1987. https://doi.org/10.1002/9780470316696
Rubin DB. Multiple Imputation After 18+ Years. Journal of the American Statistical Association 1996; 91: 473-489. https://doi.org/10.1080/01621459.1996.10476908
Rubin DB, Schenker N. Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse. Journal of the American Statistical Association 1986; 81: 366-374. https://doi.org/10.1080/01621459.1986.10478280
Barnard J, Rubin DB. Small-Sample Degrees of Freedom with Multiple Imputation. Biometrika 1999; 86: 948-955. https://doi.org/10.1093/biomet/86.4.948
Pan Y, He Y, Song R, Wang G, An Q. A passive and inclusive strategy to impute missing values of a composite categorical variable with an application to determine HIV transmission categories. Ann Epidemiol 2020; 51: 41-47.e2. https://doi.org/10.1016/j.annepidem.2020.07.012
How to Cite
This work is licensed under a Creative Commons Attribution 4.0 International License.
Policy for Journals/Articles with Open Access
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are permitted and encouraged to post links to their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work
Policy for Journals / Manuscript with Paid Access
Authors who publish with this journal agree to the following terms:
- Publisher retain copyright .
- Authors are permitted and encouraged to post links to their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work .