Multiple Imputation of Missing Race and Ethnicity in CDC COVID-19 Case-Level Surveillance Data

The COVID-19 pandemic has resulted in a disproportionate burden on racial and ethnic minority groups, but incompleteness in surveillance data limits understanding of disparities. CDC’s case-based surveillance system contains case-level information on most COVID-19 cases in the United States. Data analyzed in this paper contain COVID-19 cases with case-level information through September 25, 2020, which represent 70.9% of all COVID-19 cases reported to CDC during the period. Case-level surveillance data are used to investigate COVID-19 disparities by race/ethnicity, sex, and age. However, demographic information on race and ethnicity is missing for a substantial percentage of COVID-19 cases (e.g., 35.8% and 47.2% of cases analyzed were missing race and ethnicity information, respectively). Our goal in this study was to impute missing race and ethnicity to derive more accurate incidence and incidence rate ratio (IRR) estimates for different racial and ethnic groups, and evaluate the results from imputation compared to complete case analysis, which involves removing cases with missing race/ethnicity information from the analysis. Two multiple imputation (MI) models were developed. Model 1 imputes race using six binary race variables, and Model 2 imputes race as a composite multinomial variable. Our evaluation found that compared with complete case analysis, MI reduced biases and improved coverage on incidence and IRR estimates for all race/ethnicity groups, except for the Non-Hispanic Multiple/other group. Our research highlights the importance of supplementing complete case analysis with additional methods of analysis to better describe racial and ethnic disparities. When race and ethnicity data are missing, multiple imputation may provide more accurate incidence and IRR estimates to monitor these disparities in tandem with efforts to improve the collection of race and ethnicity information for pandemic surveillance.

This section contains detailed information on the setup of the evaluation study, using Evaluation 1 (low percent of missingness generated from IA propensity models) as an example. Evaluation 2 follows the same procedure, except using the PA data to develop the propensity models. The procedures are described below.
(1) Step 1: Develop the propensity model for missing race using a logistic regression model with IA data: logit (P (race is missing)) = beta0 + X1*beta1 Covariates considered include age, sex and 18 county level variables ( Table 1) while the final model contains only significant predictors (P<=0.05). Let X1 denote a vector of variables included in the final model, beta0 is the intercept, and beta1 is a vector of parameter estimates.
(2) Step 2: Develop the propensity model for missing ethnicity using a logistic regression model with IA data: logit (P (ethnicity is missing)) = gamma0 + X2*gamma1 Covariates considered include age, sex and 18 county level variables while the final model contains only significant predictors (P<=0.05). Let X2 denote a vector of variables included in the final model, gamma0 is the intercept and gamma2 is a vector of parameter estimates. (3) Step 3: Calculate propensity of missing race for the target population. Apply the propensity model on race from step (1) to the MN/UT data, the propensity of missing race (P1) for each person in the MN/UT data is: Where X1 are the observed values from the MN/UT data, beta0 and beta1 are parameter estimates from Step 1.
(4) Step 4: Calculate propensity of missing ethnicity for the target population. Apply the propensity model on ethnicity from step (2) to the MN/UT data, the propensity of missing ethnicity (P2) for each person is: P2 = exp (gamma0 + X2*gamma1) / (1+exp (gamma0 + X2*gamma1)) Where X2 are the observed values from MN/UT data, and gamma0 and gamma1 are parameter estimates from Step 2. (5) Step 5: Generate two random numbers U1 and U2. For each person in MN/UT, generate two random numbers from a Uniform (0, 1) distribution.
(6) Step 6: Generate missing data on race in MN/UT data. Compare P1 to U1 to decide if a person has a missing value on race, e.g., if P1 <U1 then race is missing.
Step 7: Generate missing data on ethnicity in MN/UT data. Compare P2 to U2 to decide if a person has a missing value on ethnicity, e.g., if P2 <U2 then ethnicity is missing.
(9) Step 9: Apply the imputation model to the 100 replicates generated from Step 8 with missing race and ethnicity. For each replicate, 10 imputations were conducted.
Part 2. Sensitivity analysis on missing data mechanism -not missing at random (NMAR) missingness The evaluation study described in Section 3 assumes the missing data mechanism is missing at random, where the missingness of race and ethnicity depends on the observed covariates. However, it is possible that not all variables related to the missingness are included in the propensity models due to limited individual level information available in the CRF data. Moreover, it is possible that the missingness of race and ethnicity still depends on race and ethnicity after controlling all possible covariates, i.e., not missing at random (NMAR) missingness. In this section, we repeat the evaluation study described in Section 3, with two indicator variables included in the propensity models to generate not missing at random missing data. Specifically, in Evaluation 3 (NMAR, LOW missingness), we include a Hispanic/Latino indicator variable (denoted as HISP-IND, with HISP-IND =1 if a person answered "Yes" to the Hispanic ethnicity question, and HISP-IND =0 if a person answered "No" to the Hispanic ethnicity question) in the IA state ethnicity propensity model, with a coefficient of 0.5; and we include a race indicator variable (denoted as White-IND, with White-IND =1 if a person chose "White" as race, and White-IND =0 if a person didn't choose White as race) to IA race propensity model with a coefficient of 0.5. In Evaluation 4 (NMAR, High missingness), we included these two variables to the PA state ethnicity propensity model and race propensity model, respectively. With these propensity models, in Evaluation 3, on average, 19.1% subjects have missing data on race, 23.7% subjects have missing data on ethnicity, 34.3% subjects have missing values on combined race/ethnicity; in Evaluation 4, on average, 51.1% subjects have missing data on race, 55.3% subjects have missing data on ethnicity, 68.3% subjects have missing values on combined race/ethnicity. Table S3. After including HISP-IND and White-IND in the IA propensity models, more Hispanic/Latino and non-Hispanic White subjects have missing values on race/ethnicity information. Based on the complete case analysis, the incidence for Hispanic/Latino is 23.77 per 1,000, compared to 26.92 per 1,000 in Evaluation 1; the incidence for NH White is 4.74 per 1,000 compared to 5.09 per 1,000 in Evaluation 1. Both imputation models reduce biases and improve coverage for all race/ethnic groups, except for NH Multiple/other group, where imputation Model 1 over imputed NH Multiple/other group and imputation Model 2, though yields smaller biases, yield a coverage of 0.57 which is below the 95% nominal level. Excluding NH Multiple/other group, biases using imputation Model 1 range from -2.49 (Hispanic/Latino) to 2.14 (NH NHPI) per 1,000, biases using imputation Model 2 range from -3.03 (Hispanic/Latino) to 6.85 (NH NHPI) per 1,000, where the complete case analysis yields biases ranging from -21.55 (NH NHPI) to -2.43 (NH White) per 1,000; coverage rates are one for both imputation models compared to the coverage rate of zero from the complete case analysis. Accordingly, both imputation models yield smaller biases and better coverage rates in terms of IRR estimates, except the NH Multiple/other group.

Evaluation 3 results are shown in
Evaluation 4 includes HISP-IND and White-IND in the PA propensity models, the incidence for Hispanic/Latino is 17.32 per 1,000, compared to 19.63 per 1,000 in Evaluation 2; the incidence for NH White is 1.90 per 1,000 compared to 2.25 per 1,000 in Evaluation 2. Complete case analysis yields incidence estimates with biases range from -27.53 (NH NHPI) to -4.85 (NH AIAN) per 1,000, relative biases range from -85.58% (NH Black) to -36.94% (NH AIAN) ( Table S4). Imputation Model 1 yields incidence estimates with smaller biases compared to the complete cases analysis for all groups except the NH Multiple/other group. Excluding the NH Multiple/other group, biases based on imputation Model 1 range from -3.57 (Hispanic/Latino) to 5.71 (NH-Asian) per 1,000, relative biases range from -20.87% (NH AIAN) to 46.31% (NH-Asian), and coverage rates are above 90%. Imputation Model 2 yields incidence estimates with smaller biases compared to the complete cases analysis for all groups, however, it yields relatively larger bias for NH NHPI group, and low coverage for both NH Multiple/other and NH NHPI groups. Similar patterns are shown in terms of IRR estimates.
In general, imputation Model 1 seems to reduce biases for all race/ethnicity groups except the NH Multiple/other group, in which imputation Model 1 tends to over impute this category. For the remaining race/ethnic groups, imputation Model 1 consistently reduces biases of complete case analysis and improves coverage. Imputation Model 2, though doesn't over impute NH Multiple/other group, seems to yield estimates with poor coverage for this group, and the biases are slightly larger for the remaining race/ethnicity groups compared to those of imputation Model 1. iii