Multiple Imputation of Missing Race and Ethnicity in CDC COVID-19 Case-Level Surveillance Data


  • Guangyu Zhang CDC COVID-19 Response Team, Centers for Disease Control and Prevention, Atlanta, Georgia
  • Charles E. Rose CDC COVID-19 Response Team, Centers for Disease Control and Prevention, Atlanta, Georgia
  • Yujia Zhang CDC COVID-19 Response Team, Centers for Disease Control and Prevention, Atlanta, Georgia
  • Rui Li Health Resources and Services Administration, Rockville, Maryland, USA
  • Florence C. Lee CDC COVID-19 Response Team, Centers for Disease Control and Prevention, Atlanta, Georgia
  • Greta Massetti CDC COVID-19 Response Team, Centers for Disease Control and Prevention, Atlanta, Georgia
  • Laura E. Adams CDC COVID-19 Response Team, Centers for Disease Control and Prevention, Atlanta, Georgia



Multiple Imputation, Missing Data, Race and Ethnicity, Health Equity


The COVID-19 pandemic has resulted in a disproportionate burden on racial and ethnic minority groups, but incompleteness in surveillance data limits understanding of disparities. CDC’s case-based surveillance system contains most COVID-19 cases in the United States. Data analyzed in this paper contain COVID-19 cases with case-level information through September 25, 2020, which represent 70.9% of all COVID-19 cases reported to CDC during the period. Case-level surveillance data are used to investigate COVID-19 disparities by race/ethnicity, sex, and age. However, demographic information on race and ethnicity is missing for a substantial percentage of COVID-19 cases (e.g., 35.8% and 47.2% of cases analyzed were missing race and ethnicity information, respectively). Our goal in this study was to impute missing race and ethnicity to derive more accurate incidence and incidence rate ratio (IRR) estimates for different racial and ethnic groups, and evaluate the results from imputation compared to complete case analysis, which involves removing cases with missing race/ethnicity information from the analysis. Two multiple imputation (MI) models were developed. Model 1 imputes race using six binary race variables, and Model 2 imputes race as a composite multinomial variable. Our evaluation found that compared with complete case analysis, MI reduced biases and improved coverage on incidence and IRR estimates for all race/ethnicity groups, except for the Non-Hispanic Multiple/other group. Our research highlights the importance of supplementing complete case analysis with additional methods of analysis to better describe racial and ethnic disparities. When race and ethnicity data are missing, multiple imputation may provide more accurate incidence and IRR estimates to monitor these disparities in tandem with efforts to improve the collection of race and ethnicity information for pandemic surveillance.


Wang Q, Berger NA, Xu R. Analyses of Risk, Racial Disparity, and Outcomes Among US Patients With Cancer and COVID-19 Infection. JAMA Oncol 2021; 7(2): 220-227.

Yancy CW. COVID-19 and African Americans. JAMA 2020; 323(19): 1891-1892.

Mahajan UV, Larkins-Pettigrew M. Racial demographics and COVID-19 confirmed cases and deaths: a correlational analysis of 2886 US counties. J Public Health 2020; 42(3): 445-447.

Karaca-Mandic P, Georgiou A, Sen S. Assessment of COVID-19 hospitalizations by race/ethnicity in 12 states. JAMA Intern Med 2021; 181(1): 131-134.

Yoon P, Hall J, Fuld J, et al. Alternative Methods for Grouping Race and Ethnicity to Monitor COVID-19 Outcomes

and Vaccination Coverage. MMWR Morb Mortal Wkly Rep 2021; 70: 1075-1080.

Adjaye-Gbewonyo D, Bednarczyk RA, Davis RL, Omer SB. Using the Bayesian improved surname geocoding method (BISG) to create a working classification of race and ethnicity in a diverse managed care population: a validation study. Health Serv Res 2013; 49(1): 268-283.

Hassett P. Taking on racial and ethnic disparities in health care: the experience at Aetna. Health Aff 2005; 24(2): 417-420.

Silva GC, Trivedi AN, Gutman R. Developing and evaluating methods to impute race/ethnicity in an incomplete dataset. Health Services and Outcomes Research Methodology 2019; 19: 175-195.

Little RJA, Rubin DB. Statistical Analysis with Missing Data, New York: Wiley 2019.

Fiscella K, Fremont AM. Use of geocoding and surname analysis to estimate race and ethnicity. Health Serv Res 2006; 41(4 Pt 1): 1482-1500.

Elliott MN, Fremont A, Morrison PA, Pantoja P, Lurie N. A new method for estimating race/ethnicity and associated disparities where administrative records lack self-reported race/ethnicity. Health Serv Res 2008; 43(5p1): 1722-1736.

Elliott MN, Morrison PA, Fremont A, McCaffrey DF, Pantoja P, Lurie N. Using the Census Bureau’s surname list to improve estimates of race/ethnicity and associated disparities. Health Serv Outcomes Res Methodol 2009; 9(2): 69.

Grundmeier RW, Song L, Ramos MJ, et al. Imputing missing race/ethnicity in pediatric electronic health records: reducing bias with use of U.S. Census location and surname data. Health Serv Res 2015; 50(4): 946?960.

Ma Y, Zhang W, Lyman S, Huang Y. The HCUP SID imputation project: improving statistical inferences for health disparities research by imputing missing race data. Health Serv Res 2018; 53(3): 1870-1889.

Kim JS, Gao X, Rzhetsky A. RIDDLE: Race and ethnicity Imputation from Disease history with Deep Learning. PloS Comput Biol 2018; 14(4): e1006106.

Labgold K, Hamid S, Shah S, Gandhi NR, Chamberlain A, Khan F, Khan S, Smith S, Williams S, Lash TL, Collin LJ. Estimating the Unknown: Greater Racial and Ethnic Disparities in COVID-19 Burden After Accounting for Missing Race and Ethnicity Data. Epidemiology 2021; 32(2): 157-161.

Schafer JL. Analysis of Incomplete Multivariate Data, London: Chapman and Hall 1997.

Raghunathan TE, Lebkowski JM, VanHoewyk J, Solenberger P. A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models. Survey Methodology 2001; 27: 85-95.

Van Buuren S. Multiple Imputation of Discrete and Continuous Data by Fully Conditional Specification. Statistical Methods in Medical Research 2007; 16: 219-242.

Van Buuren S. Flexible Imputation of Missing Data, Boca Raton, FL: Chapman & Hall/CRC 2012.

He Y. Missing Data Analysis Using Multiple Imputation: Getting to the Heart of the Matter. Circulation: Cardiovascular Quality and Outcomes 2010; 3: 98-105.

Van Buuren S, Karin G. Mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 2011; 45(3).

Liu Y, De A. Multiple Imputation by Fully Conditional Specification for Dealing with Missing Data in a Large Epidemiologic Study. International Journal of Statistics in Medical Research 2015; 4(3): 287-295.

SAS Institute Inc. SAS/STAT® 14.1 User’s Guide. Cary, NC: SAS Institute Inc. 2015.

Rubin DB. Multiple Imputation in Sample Surveys – A Phenomenological Bayesian Approach to Nonresponse. In Proceedings of the Section on Survey Research Methods., American Statistical Association 1978; pp. 20-34.

Rubin DB. Multiple Imputation for Nonresponse in Surveys, New York: John Wiley 1987.

Rubin DB. Multiple Imputation After 18+ Years. Journal of the American Statistical Association 1996; 91: 473-489.

Rubin DB, Schenker N. Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse. Journal of the American Statistical Association 1986; 81: 366-374.

Barnard J, Rubin DB. Small-Sample Degrees of Freedom with Multiple Imputation. Biometrika 1999; 86: 948-955.

Pan Y, He Y, Song R, Wang G, An Q. A passive and inclusive strategy to impute missing values of a composite categorical variable with an application to determine HIV transmission categories. Ann Epidemiol 2020; 51: 41-47.e2.




How to Cite

Zhang, G., Rose, C. E., Zhang, Y., Li, R., Lee, F. C., Massetti, G., & Adams, L. E. (2022). Multiple Imputation of Missing Race and Ethnicity in CDC COVID-19 Case-Level Surveillance Data. International Journal of Statistics in Medical Research, 11, 1–11.



General Articles