Supplementing Missing Self-Reported Race Data with a Probability Distribution in Logistic Regression Models

: Race is often included as an independent variable in health services research, especially in recent studies of racial and ethnic disparities in health care. Although self-reported information on race exists in large electronic health records (EHR) data, these data are sometimes missing. Recently Bayesian Improved Surname Geocoding method (BISG) is used to estimate the probability distribution of race categories for those with missing information on race. The BISG estimated probability distribution has been used in reporting health care measures but not in statistical modellings with dichotomous events as outcomes. We propose two approaches to accommodate available distribution probability of an independent categorical variable (e.g., race) in logistic regression models: 1) a direct substitution approach and 2) a partial information maximum likelihood estimator (PIMLE). In examining the association between race and up-to-date immunization status of children by three years old from an integrated health care organization, 11.3% of 14,903 children have missing self-reported race information but have BISG estimated probability distribution for the six race/ethnicity categories. We employed the direct substitution approach and PIMLE approach to analyze the under vaccination data. Both approaches included all observations and thus yielded smaller standard errors of estimated coefficients compared to the complete data analyses. Our simulation study showed that the direct substitution approach and PIMLE yielded nearly unbiased coefficient estimates and preserved efficiency when the missing rate of the independent categorical variable was up to 30%.


INTRODUCTION
Demographic characteristics such as race are important independent variables (risk factors or predictors) in health services research. Self-reported information on race has been considered as superior to other sources such as observed race [1]. Self-reported information on race often exists in large electronic health records (EHR) data; however, these data are sometimes missing. Reasons for missing race data include the patient declining to report this information or the provider failing to obtain or document this information.
Recent focus on racial and ethnic disparities in health care has encouraged health care organizations to increase their effort to fill in missing race and ethnicity data [2,3]. While ideally race would be selfreported by all patients, other approaches such as the recent Bayesian Improved Surname Geocoding method (BISG) have been developed to compensate for missing information on race [4,5]. BISG utilizes a Bayesian approach to combine racial/ethnic data from last names and geographic units to calculate the probability distribution of race categories for a given individual whose self-reported race/ethnicity is missing *Address correspondence to this author at the Kaiser Permanente Colorado, Institute for Health Research, Denver Highlands, 10065 E. Harvard Ave., Suite 300, Denver CO 80231, USA; Tel: (303) 614-1200; Fax: (303)-614-1225; E-mail: stan.xu@kp.org in EHR. In 2008, Kaiser Permanente, an integrated health care organization, began applying the BISG algorithm to geocoded member addresses to link to Census Bureau data which describes the racial/ethnic composition of census block groups [6]. Based on members' address and surname analysis using the Census Bureau's list of more than 150,000 surnames and their association with race/ethnicity, an individual's probability distribution is estimated for the following six standardized mutually-exclusive racial/ethnic categories: Asian or Pacific Islander, Black or African American, Hispanic or Latino, American Indian or Alaska Native, Multiracial, and White. In this paper, we use A, B, H, N, M and W to denote these six categories, respectively. Adjaye-Gbewonyo et al. [6] validated the classification of race/ethnicity based on the BISG and concluded that BISG may be useful for classifying race/ethnicity of health plan members when needed for health care studies. They also showed that sensitivity and specificity of classification varied by race/ethnic group: using a cutoff of 0.5, sensitivity for A, B, H, N, M and W was 64.4, 71.8, 71.0, 0.0, 0.3, and 85.2 percent, respectively; specificity was 99.6, 91.1, 99.0, 100.0,100.0, and 76.8 percent, respectively. It remains of great interest in health care research how to use the newly available probability distribution information without classifying race/ethnicity along with existing self-reported race information in analyzing EHR data.
Analyzing outcomes with covariates (e.g., race) missing is challenging. Usually multiple imputation can be used to impute missing values of important independent variables such as race in analyzing outcomes [7][8][9]. In a recent study examining the association between initial antihyperglycemic therapy and patient-level baseline characteristics, to enable inclusion of patients for whom data were missing on race (17.4%), the authors employed multiple imputations, imputing each missing value five times using site-specific distributions of race [9,10]. With the availability of BISG estimated race probability distribution, other approaches may be more appropriate and more efficient. Recently McCaffrey and Elliott [11] suggested a direct substitution approach and a partial information maximum likelihood estimator approach for a linear regression model with missing binary independent variable but its probability distribution is available.
This paper focuses on dichotomous dependent variable (e.g., up-to-date immunization status of children) and independent categorical variable (i.e., race). We propose logistic regression models that use 1) the direct substitution approach and 2) a partial information maximum likelihood estimator when missing independent variable is categorical (e.g., race). We then demonstrate these two methods in analyzing up-to-date immunization status of children by three years old in an integrated health care organization. We also conduct simulation study to evaluate these two approaches regarding their bias and efficiency.

Statistical Models with a Categorical Variable as an Independent Variable without Missing
Suppose a categorical independent variable z has K levels. Let y i be the dependent variable for the ith subject where i =1 to n. For example, y i =1 if the immunization status of a child is up-to-date and y i =0 if the immunization status of a child is not up-to-date. Also let μ 1(i ) = r + I iz k + x i , where r is the coefficient for the reference category of the categorical independent variable, a k is the coefficient for category k(k r),k = 1 to K 1, I iz is the indicator variable for the categorical independent variable equal to 1 if z = k and equal to zero if z k, x i is a row vector of covariates (does not include z ) and is a column of corresponding coefficients. Then the probability of y i = 1 can be written as p 1(i ) in a logistic regression model. The following log likelihood for the ith subject can be obtained ll 1(i ) = y i log( p 1(i ) ) + (1 y i )log(1 p 1(i ) ) (1) In analyzing binary outcome with independent categorical variables, a categorical independent variable appears in the analytic dataset as a single variable. Statistical software (e.g., SAS) creates a design matrix in initiating analytic procedures so that the parameters in equation (1) can be estimated accordingly [12]. For example, if a classification variable z has K levels, then its main effect has degrees of freedom (K 1) , and the design matrix has (K 1) columns that correspond to the (K 1) levels of z . Overall log likelihood across subjects can be obtained and can then be maximized to obtain the maximum likelihood estimates (MLE).

A Direct Substitution Approach when some Values of a Categorical Independent Variable are Missing but Supplemented with Probability Distribution
We propose statistical models for binary outcomes with some of the independent categorical variable missing but supplemented with probability distribution. Let d k denote the probability for category level k , where d k =1 for each individual. In analyzing data with probability distribution for missing values of a categorical variable, a similar design matrix as the one without missing information must be created. For demonstration purposes, let the rth category be the reference group, r (k). For the ith individual with missing value for a categorical variable, we let i = d ik k where k is the coefficient for the kth category and k r in order to avoid a full rank matrix problem. Let μ 2(i ) = r + i + x i , where definitions of r , x i and remain the same as in (1), then .
The following log likelihood can be obtained for an individual with missing self-reported race, For those with self-reported race information available, log likelihood values can be obtained as in (1). Then the overall log likelihood values across individuals can be calculated and can be maximized to obtain MLEs of s . This approach is similar to the direct substitution method proposed by McCaffrey and Elliott [11] for a linear model using predicted probabilities for a dichotomous independent variable rather than the actual variable.

A partial Information Maximum Likelihood Estimator when some Values of a Categorical Independent
Variable are

Missing but Supplemented with Probability Distribution
McCaffrey and Elliott [11] also proposed a partial information maximum likelihood estimator (PIMLE) for the linear model with dichotomous independent variable. In this paper, we derive the PIMLE for logistic regressions with categorical independent variables. Let p ik represent the probability of outcome (y = 1) if an individual belongs to kth category, , where μ ik = r + k + x i , the log partial information likelihood for subject i with missing categorical variable is, where k r . For those with race information available, log likelihood values can be obtained as in (1). Then the overall log likelihood values across individuals can be calculated and can be maximized to obtain MLEs of s . MLEs from models (2) and (3) can be obtained in SAS PROC NLMIXED using general (ll).

AN EXAMPLE
Under vaccination of young children is a public health challenge in the US and worldwide [13]. Recent outbreaks of vaccine-preventable diseases such as measles are an apparent result of under vaccination in some communities in the US [14,15]. Examining the association between race and childhood immunization is of interest to vaccination researchers and policy makers [16]. In examining the association between race and up-to-date vaccination status by three years of age, 14  We conducted three analyses to the under vaccination data: a) complete data analysis which included only those individuals who have self-reported race. A conventional logistic regression model was employed to obtain odds ratios (OR) with White as the reference category; b) entire population analysis with the direct substitution approach which included those with self-reported race and those without self-reported race but with probability distribution of the six categories; c) entire population analysis with PIMLE approach which included those with self-reported race and those without self-reported race but with probability distribution of the six categories. For analyses b) and c), models (2) and (3) in Section 2 were fit. SAS programs for fitting these two models were provided in Appendix A.
In the complete data analyses, the point estimate of OR for Native race was very large and the range of 95% confidence intervals is extremely wide, indicating estimation instability due to low prevalence of the Native race in the population ( Table 2). With inclusion of those with missing self-reported race, both the direct substitution and the PIMLE yielded stable estimation of ORs and confidence intervals for the Native race category. Comparing to the results from complete data analyses, the direct substitution and PIMLE yielded comparable point estimates of ORs for other race categories (Black, Asian Pacific, Hispanic and Multiracial races) with White as the reference group although ORs from both direct substitution and PIMLE were slightly underestimated for Black, Asian Pacific and Hispanic. In general, the 95% confidence intervals of ORs from the direct substitution and PIMLE are narrower than those from the complete data analyses. This is consistent with the fact that the direct substitution and PIMLE approach included all subjects in the analyses.

SIMULATION STUDY AND RESULTS
We also conducted a simulation study to evaluate the performance of the direct substitution method and PIMLE using the complete data set (N= 13,217). We used the following simulation strategy as in Xu et al. [18]. Briefly, while keeping the covariates (gender and race variables) in the complete data set, we used the coefficients from the complete data analyses of the Kaiser under vaccination data to simulate the outcome (up-to-date vaccination) based on the following probabilistic model prob(y i = 1ˆ r ,ˆ k ,ˆ ) = exp(ˆ r + I izˆ k + genderˆ ) 1+ exp(ˆ r + I izˆ k + genderˆ ) where y i = 1 if a child's immunization status is up-todate and y i = 0 if not, I iz is an indicator variable for race categories, I iz =1 if z = k , otherwise I iz =0 with White being the reference; the estimated intercept, ˆ r =2.318; ˆ k were estimated coefficients for race categories with ˆ A = 0.677 , ˆ B = 0.0.270 , ˆ H = 0.182 , ˆ N = 11.212 , and ˆ M = 0.332 , indicating that White is more likely under vaccinated in this population. The coefficient for gender (gender =1 if male) is ˆ = 0.028 . Note that exponentiation of these estimated coefficients results in ORs from the complete data analyses in Table 2.
For each Monte Carlo sample, we randomly assigned missing self-reported race value in the complete dataset in which both self-reported race and BISG estimated probability distribution of race categories were available. For each rate of missing self-reported race, 5000 random samples were generated from the complete dataset by randomly assigning missing self-reported race. We then conducted four analyses: 1) an analysis which excluded those individuals with missing self-reported race; 2) the direct substitution approach; 3) the PIMLE approach and 4) multiple imputation. For the multiple imputation approach, each missing race was imputed five times using the distribution of race in the complete dataset as in Raebel et al. [9]; five separate models were fit; then coefficient estimates and their standard errors were pooled from these five models [9,10]. Six rates of missing self-reported race were evaluated: 0% (without missing race information), 10%, 20%, 30%, 50% and 70%. The analytic results without missing race information (footnotes in Table 3) served as gold standards for comparing the results from the four analyses. For convenient comparison among these three analytic approaches, we reported mean coefficients and mean of standard errors of coefficients instead of ORs and their confidence intervals to evaluate bias and efficiency of these two methods. Table 3 showed the mean coefficients (mean standard errors) in the simulation study. As expected,  when observations with missing race were excluded, the mean standard errors of estimated coefficients increased for gender and all categories of race while the mean coefficients remained nearly unbiased. The Direct Substitution approach produced the same coefficients and standard errors for intercept and gender as those without race information missing due to no loss of observations. It yielded nearly unbiased coefficients and similar standard errors to gold standard for all race categories for the rates of missing self-reported race up to 30%. When the rates of missing self-reported race increased to 50% and 70%, the coefficients remained unbiased but their standard errors were overestimated slightly except for the Native category.
Similar to the direct substitution approach, the PIMLE approach produced the same coefficients and standard errors for intercept and gender due to no loss of observations. With increasing rates of missing selfreported race, the PIMLE approach yielded underestimated coefficients for all categories while the estimates of standard errors remained consistent with those without race information missing. When 50% of self-reported race were missing, the estimated coefficient for Black decreased 40% (from 0.28 to 0.20); for Asian Pacific, the estimated coefficient decreased 17.1% (from 0.7 to 0.58); for the Hispanic, the estimated coefficient decreased 16.7% (from 0.18 to 0.15); for multiracial, the estimated coefficient decreased 7.8% (from 0.34 to 0.28).
The results using the multiple imputation approach were also reported in Table 3. While the standard errors of coefficients changed slightly with rates of missing race increasing, the estimated coefficients decreased significantly with missing rates of race increasing. Compared to the PIMLE approach, the multiple imputation approach underestimated coefficients more.

DISCUSSION
We proposed two approaches to accommodate available distribution probability of an independent categorical variable (e.g., race) in logistic regression models when some of the independent categorical variable missing: 1) the direct substitution approach and 2) a partial information maximum likelihood estimator. These two methods included all observations and thus yielded smaller standard errors of estimated coefficients in analyzing up-to-date immunization status of children by three years old. Our simulation study showed that, when the missing rate of the independent categorical variable was up to 30%, the direct substitution approach and PIMLE yielded coefficient estimates and their standard errors similar to those without race missing. For a given missing rate of race, the multiple imputation approach yielded the most biased coefficient estimates due to the fact that it just used the raw probabilities of race categories in the complete dataset.
When the missing rate was 50% or higher, the direct substitution produced greater standard errors of the categorical variable's coefficients and thus the efficiency decreased. However the standard errors were still less than those from the analyses that excluded observations with missing values. The PIMLE approach underestimated coefficients of the categorical variable's coefficients when the missing rate was 50% or higher. Thus the direct substitution approach is preferred when the missing rate was 50% or higher.
There are some limitations in this study. First, the sample size in both example application and simulation study is large. For a large sample size data, the impact of different missing rates of the independent categorical variable may not be dramatic, especially on the standard errors of coefficients. The performance of these two approaches for small and medium size of datasets is unknown. A missing rate of 30% may result in significant bias of coefficient estimation and less efficiency (larger standard error) for a small or medium dataset. Second, we used the estimated coefficients of race categories from the KPCO example to simulate the outcome; thus the performance may differ when the effects of race categories on a dichotomous outcome differ.
The process of these two newly proposed methods is relatively simpler and easier to implement with SAS codes provided in Appendix A than multiple imputation. Our simulation showed that both the direct substitution and PIMLE improved efficiency by accommodating the available BISG estimated probability distribution and yielded nearly unbiased coefficient estimates when the rate of missing categorical variable is not higher (e.g., less than 30%) while multiple imputation using raw distribution of race categories yielded biased coefficient estimates.