Enriched-Data Problems and Essential Non-Identifiability

Geert Molenberghs; Edmund Njeru Njagi; Michael G. Kenward; Geert Verbeke

doi:10.6000/1929-6029.2012.01.01.02

Authors

Geert Molenberghs I-BioStat, Universiteit Hasselt, B-3590 Diepenbeek, Belgium
Edmund Njeru Njagi I-BioStat, Universiteit Hasselt, B-3590 Diepenbeek, Belgium
Michael G. Kenward Medical Statistics Unit, London School of Hygiene and Tropical Medicine, London WC1E7HT, UK
Geert Verbeke I-BioStat, Universiteit Hasselt, B-3590 Diepenbeek, Belgium

DOI:

https://doi.org/10.6000/1929-6029.2012.01.01.02

Keywords:

Compound-symmetry, Empirical Bayes, Enriched data, Exponential random effects, Gamma random effects, Linear mixed model, Missing at random, Missing completely at random, Non-future dependence, Pattern-mixture model, Selection model, Shared-parameter model

Abstract

There are two principal ways in which statistical models extend beyond the data available. First, the data may be coarsened, that is, what is actually observed is less detailed than what is planned, owing to, for example, attrition, censoring, grouping, or a combination of these. Second, the data may be augmented, that is, the observed data are hypothetically but conveniently supplemented with structures such as random effects, latent variables, latent classes, or component membership in mixture distributions. These two settings together will be referred to as enriched data. Reasons for modelling enriched data include the incorporation of substantive information, such as the need for predictions, advantages in interpretation, and mathematical and computational convenience. The fitting of models for enriched data combine evidence arising from empirical data with non-verifiable model components, i.e., that are purely assumption driven. This has important implications for the interpretation of statistical analyses in such settings. While widely known, the exploration and discussion of these issues is somewhat scattered. The user should be fully aware of the potential dangers and pitfalls that follows from this. Therefore, we provide a unified framework for enriched data and show in general that to any given model an entire class of models can be assigned, with all of its members producing the same fit to the observed data but arbitrary regarding the unobservable parts of the enriched data. The implications of this are explored for several specific settings, namely that of latent classes, finite mixtures, factor analysis, random-effects models, and incomplete data. The results are applied to a range of relevant examples

References

Gould JS. The Mismeasure of Man. New York: WW Norton and Company 1981.

Heitjan DF, Rubin DB. Ignorability and coarse data. Ann Stat 1991; 19: 2244-53. http://dx.doi.org/10.1214/aos/1176348396 DOI: https://doi.org/10.1214/aos/1176348396

Zhang J, Heitjan DF. Impact of nonignorable coarsening on Bayesian inference. Biostatistics 2007; 8: 722-43. http://dx.doi.org/10.1093/biostatistics/kxm001 DOI: https://doi.org/10.1093/biostatistics/kxm001

Verbeke G, Molenberghs G. Arbitrariness of models for augmented and coarse data, with emphasis on incomplete-data and random-effects models. Stat Mod 2010: 10: 391-19. http://dx.doi.org/10.1177/1471082X0901000403 DOI: https://doi.org/10.1177/1471082X0901000403

Molenaar PCM. A manifesto on psychology as idiographic science: Bringing the person back into scientific psychology, this time forever. Measurement 2004; 2: 201-18. DOI: https://doi.org/10.1207/s15366359mea0204_1

Molenaar PCM. On the implications of the classical ergodic theorems: Analysis of developmental processes has to focus on intra-individual variation. Dev Psychobiol 2008; 50: 60-9. http://dx.doi.org/10.1002/dev.20262 DOI: https://doi.org/10.1002/dev.20262

Molenberghs G, Kenward MG. Missing Data in Clinical Studies. Chichester: Wiley 2007. http://dx.doi.org/10.1002/9780470510445 DOI: https://doi.org/10.1002/9780470510445

De Backer M, De Keyser P, De Vroey C, Lesaffre E. A 12-week treatment for dermatophyte toe onychomycosis: terbinafine 250mg/day vs. itraconazole 200mg/day-a double-blind comparative trial. Br J Dermatol 1996; 124: 16-7. DOI: https://doi.org/10.1111/j.1365-2133.1996.tb15653.x

Roberts DT. Prevalence of dermatophyte onychomycosis in the United Kingdom: Results of an omnibus survey. Br J Dermatol 1992; 126(Suppl 39): 23-7. http://dx.doi.org/10.1111/j.1365-2133.1992.tb00005.x DOI: https://doi.org/10.1111/j.1365-2133.1992.tb00005.x

Verbeke G, Molenberghs G. Linear Mixed Models for Longitudinal Data. New York: Springer 2000. DOI: https://doi.org/10.1007/978-1-4419-0300-6

Windholz M. The Merck Index: An Encyclopedia of Chemicals, Drugs, and Biologicals. 10th ed. Rahway, NJ: Merck and Co 1983.

Shiota K, Chou MJ, Nismimura H. Embryotoxic effects of di-2-ethylhexyl phthalate (DEHP) and di-n-butyl phthalate (DBP) in mice. Environm Res 1980; 22: 245-53. http://dx.doi.org/10.1016/0013-9351(80)90136-X DOI: https://doi.org/10.1016/0013-9351(80)90136-X

Tyl RW, Price CJ, Marr MC, Kimmel CA. Developmental toxicity evaluation of dietary di(2-ethylhexyl)phthalate in Fischer 344 rats and CD-1 mice. Fund Appl Toxicol 1988; 10: 395-12. http://dx.doi.org/10.1016/0272-0590(88)90286-2 DOI: https://doi.org/10.1093/toxsci/10.3.395

Collins LM, Lanza ST. Latent Class and Latent Transition Analysis: With Applications in the Social, Behavioral, and Health Sciences. New Jersey: Wiley 2009. DOI: https://doi.org/10.1002/9780470567333

Böhning D. Computer-Assisted Analysis of Mixtures and Applications: meta-analysis, disease mapping, and others. Boca Raton: Chapman & Hall/CRC 2000.

Thyrion P. Contribution à létude du bonus pour non sinistre en assurence automobile. ASTIN Bull 1960; 1: 142-62. DOI: https://doi.org/10.1017/S0515036100007534

Simar L. Maximum likelihood estimation of a compound Poisson process. Ann Stat 1976; 4: 1200-9. http://dx.doi.org/10.1214/aos/1176343651 DOI: https://doi.org/10.1214/aos/1176343651

Carlin BP, Louis TA. Bayes and Empirical Bayes Methods for Data Analysis. London: Chapman & Hall 1996.

Molenberghs G, Verbeke G, Demétrio CGB, Vieira A. A family of generalized linear models for repeated measures with normal and conjugate random effects. Stat Sci 2010; 25: 325-47. http://dx.doi.org/10.1214/10-STS328 DOI: https://doi.org/10.1214/10-STS328

Duchateau L, Janssen P. The Frailty Model. New York: Springer 2008.

Duchateau L, Opsomer G, Dewulf J, Janssen P. The non-linear effect (determined by the penalised partial-likelihood approach) of milk-protein concentration on time to first insemination in Belgian dairy cows. Prevent Vet Med 2005; 68: 81-90. http://dx.doi.org/10.1016/j.prevetmed.2004.11.012 DOI: https://doi.org/10.1016/j.prevetmed.2004.11.012

Johnson RA, Wichern DW. Applied Multivariate Statistical Analysis. New Jersey: Pearson 2007. DOI: https://doi.org/10.1002/9780470061572.eqr239

Rubin DB. Inference and missing data. Biometrika 1976: 63: 581-92. http://dx.doi.org/10.1093/biomet/63.3.581 DOI: https://doi.org/10.1093/biomet/63.3.581

Bjørnstad JF. On the generalization of the likelihood function and likelihood principle. J Am Stat Assoc 1996; 91: 791-806. DOI: https://doi.org/10.1080/01621459.1996.10476947

Goodman LA. The analysis of systems of qualitative variables when some of the variables are unobservable. Part I-A Modified latent structure approach. Am J Sociol 1974; 79: 1179-59. http://dx.doi.org/10.1086/225676 DOI: https://doi.org/10.1086/225676

Xu H, Craig BA. A probit latent class model with general correlation structures for evaluating accuracy of diagnostic tests. Biometrics 2009; 65: 1145-55. http://dx.doi.org/10.1111/j.1541-0420.2008.01194.x DOI: https://doi.org/10.1111/j.1541-0420.2008.01194.x

Tatsuoka KK. Toward an integration of item-response theory and cognitive error diagnosis. In: Frederiksen N, Glazer R, Lesgold A, Shafto MG, editors. Diagnostic monitoring of skill and knowledge acquisition Hillsdale, NJ: Lawrence Erlbaum Associates 1990: pp. 453-88.

Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J Roy Stat Soc B 1977; 39: 1-38. DOI: https://doi.org/10.1111/j.2517-6161.1977.tb01600.x

Willink R. Normal moments and Hermite polynomials. Stat Prob Let 2005; 73, 271-75. http://dx.doi.org/10.1016/j.spl.2005.03.015 DOI: https://doi.org/10.1016/j.spl.2005.03.015

Verbeke G, Molenberghs G. The use of score tests for inference on variance components. Biometrics 2003, 59: 254-62. http://dx.doi.org/10.1111/1541-0420.00032 DOI: https://doi.org/10.1111/1541-0420.00032

Molenberghs G, Verbeke G. Likelihood ratio, score, and Wald tests in a constrained parameter space. Am Stat 2007; 61: 1-6. http://dx.doi.org/10.1198/000313007X171322 DOI: https://doi.org/10.1198/000313007X171322

Molenberghs G, Verbeke G. On the Weibull-Gamma frailty model, its infinite moments, and its connection to generalized log-logistic, logistic, Cauchy, and extreme-value distributions. J Stat Planning Inference 2011; 141: 861-8. http://dx.doi.org/10.1016/j.jspi.2010.08.008 DOI: https://doi.org/10.1016/j.jspi.2010.08.008

Beard RE. Note on some mathematical mortality models. In: Wolstenholme GEW, O’Connor M, editors. The lifespan of animals. Ciba colloquium on Aging. Brown: Boston 1959; pp. 302-11. DOI: https://doi.org/10.1002/9780470715253.app1

Vaupel JW, Manton KG, Stallard E. The impact of heterogeneity in individual frailty on the dynamics of mortality. Demography, 1979; 16: 439-54. http://dx.doi.org/10.2307/2061224 DOI: https://doi.org/10.2307/2061224

Little RJA, Rubin DB. Statistical Analysis with Missing Data. New York: Wiley 2002. DOI: https://doi.org/10.1002/9781119013563

Little RJA. Pattern-mixture models for multivariate incomplete data. J Am Stat Assoc 1993; 88: 125-34. http://dx.doi.org/10.2307/2290705 DOI: https://doi.org/10.1080/01621459.1993.10594302

Little RJA. A class of pattern-mixture models for normal incomplete data. Biometrika 1994; 81: 471-83. http://dx.doi.org/10.1093/biomet/81.3.471 DOI: https://doi.org/10.1093/biomet/81.3.471

Heckman JJ. The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. Ann Econ Soc Measurement 1976; 5: 475-92.

Wu MC, Bailey KR. Analysing changes in the presence of informative right censoring caused by death and withdrawal. Stat Med 1988; 7: 337-46. http://dx.doi.org/10.1002/sim.4780070134 DOI: https://doi.org/10.1002/sim.4780070134

Wu MC, Bailey KR. Estimation and comparison of changes in the presence of informative right censoring: conditional linear model. Biometrics 1989; 45: 939-55. http://dx.doi.org/10.2307/2531694 DOI: https://doi.org/10.2307/2531694

Wu MC, Carroll RJ. Estimation and comparison of changes in the presence of informative right censoring by modelling the censoring process. Biometrics 1988; 44: 175-88. http://dx.doi.org/10.2307/2531905 DOI: https://doi.org/10.2307/2531905

TenHave TR, Kunselman AR, Pulkstenis EP, Landis JR. Mixed effects logistic regression models for longitudinal binary response data with informative drop-out. Biometrics 1988; 54: 367-83. http://dx.doi.org/10.2307/2534023 DOI: https://doi.org/10.2307/2534023

Follmann D, Wu MC. An approximate generalized linear model with random effects for informative missing data. Biometrics 1995; 51: 151-68. http://dx.doi.org/10.2307/2533322 DOI: https://doi.org/10.2307/2533322

Little RJA. Modelling the drop-out mechanism in repeated-measures studies. J Am Stat Assoc 1995; 90: 1112-21. http://dx.doi.org/10.1080/01621459.1995.10476615 DOI: https://doi.org/10.1080/01621459.1995.10476615

Creemers A, Hens N, Aerts M, Molenberghs G, Verbeke G, Kenward MG. Generalized shared-parameter models and missingness at random. Stat Mod 2011; 11: 279-11. http://dx.doi.org/10.1177/1471082X1001100401 DOI: https://doi.org/10.1177/1471082X1001100401

Creemers A, Hens N, Aerts M, Molenberghs G, Verbeke G, Kenward MG. A sensitivity analysis for shared-parameter models for incomplete longitudinal outcomes. Biometrical J 2011; 52: 111-25.

Molenberghs G, Michiels B, Kenward MG, Diggle PJ. Monotone missing data and pattern-mixture models. Stat Neerl 1998; 52: 153-61. http://dx.doi.org/10.1111/1467-9574.00075 DOI: https://doi.org/10.1111/1467-9574.00075

Molenberghs G, Beunckens C, Sotto C, and Kenward MG. Every missing not at random model has got a missing at random counterpart with equal fit. J Roy Stat Soc B 2008; 70: 371-88. http://dx.doi.org/10.1111/j.1467-9868.2007.00640.x DOI: https://doi.org/10.1111/j.1467-9868.2007.00640.x

Kenward MG, Molenberghs G, Thijs H. Pattern-mixture models with proper time dependence. Biometrika 2003; 90: 53-71. http://dx.doi.org/10.1093/biomet/90.1.53 DOI: https://doi.org/10.1093/biomet/90.1.53

Jansen I, Hens N, Molenberghs G, Aerts M, Verbeke G, Kenward MG. The nature of sensitivity in missing not at random models. Comput Stat Data Analysis 2006; 50: 830-58. http://dx.doi.org/10.1016/j.csda.2004.10.009 DOI: https://doi.org/10.1016/j.csda.2004.10.009

Skrondal A, Rabe-Hesketh S. Generalized Latent Variable Modeling. London: Chapman & Hall/CRC 2004. DOI: https://doi.org/10.1201/9780203489437

Bollen K. Structural Equations with Latent Variables. New York: Wiley 1989. DOI: https://doi.org/10.1002/9781118619179

Pearl J. An introduction to causal inference. Int J Biostat 2010; 6: 1-58. http://dx.doi.org/10.2202/1557-4679.1203 DOI: https://doi.org/10.2202/1557-4679.1203

Rubin DB. Estimating causal effects of treatments in randomized and non-randomized studies. J Educat Psychol 1974; 66: 688-701. http://dx.doi.org/10.1037/h0037350 DOI: https://doi.org/10.1037/h0037350

Molenberghs G, Verbeke G. Models for Discrete Longitudinal Data. New York: Springer 2005.

Beunckens C, Sotto C, Molenberghs G, Verbeke G. A multifaceted sensitivity analysis of the Slovenian Public Opinion Survey data. Appl Stat 2009; 58: 171-96. http://dx.doi.org/10.1111/j.1467-9876.2009.00647. DOI: https://doi.org/10.1111/j.1467-9876.2009.00647.x

Chickering D, Pearl J. A clinician’s tool for analyzing non-compliance. Compu Sci Stat 1997; 29: 424-31.

Daniels MJ, Hogan JW. Missing Data in Longitudinal Studies: Strategies for Bayesian Modeling and Sensitivity Analysis. Boca Raton: Chapman Hall/CRC 2008. DOI: https://doi.org/10.1201/9781420011180

Greenland S. Multiple-bias modelling for analysis of observational data (with discussion). J Roy Stat Soc A 2005; 168: 267-306. http://dx.doi.org/10.1111/j.1467-985X.2004.00349.x DOI: https://doi.org/10.1111/j.1467-985X.2004.00349.x