From: Against reflexive recalibration: towards a causal framework for addressing miscalibration
Dataset shift domain: cause of miscalibration | Specific: variables affected | Real world example | Investigation of miscalibration |
---|---|---|---|
Difference in clinical practice | Admission policies, threshold for surgery, medications prescribed, pathology grading | van den Boogaard et al. [18] | A model predicting delirium in ICU patients had poor calibration for participants in a multinational observational study. The model’s overestimation of the risk of developing delirium could be explained by differences in ICU admission policies and treatments, specifically sedation protocols. Varied sedation practices impact the level and duration of sedation, influencing the likelihood and severity of delirium occurrence, therefore affecting the model’s performance. |
Rauh et al. [14] | A model predicting 7-year risk for chronic cardiometabolic diseases had poor calibration for participants in AusDiab, a population-based cross-sectional study. The model overestimated disease rates as it was developed with data from 1989 to 2005 whereas the AusDiab study was conducted from 2004 to 2012—a time period in which there was increased use of antihypertensives and statins. | ||
PCPTRC, a model predicting risk of prostate cancer, had poor calibration for patients in both the North American and European cohorts of the Prostate Biopsy Collaborative Group (PCBG). The PCPTRC model’s underestimation of risk may be because of the switch in clinical practice from six-core biopsy procedure to 12 cores. Additionally, there have been changes in how pathologists grade prostate cancer such that there is increased prevalence of high-grade disease in contemporary cohorts. | |||
Difference in human behavior | Diet, exercise, clinician skill | DeFilippis et al. [20] | AHA-ACC-ASCVD, a model for predicting risk of cardiovascular events, had poor calibration in MESA, a multicenter prospective community-based epidemiologic study in a sex-balanced and multiethnic cohort. The model’s overestimation of atherosclerotic cardiovascular disease risk may be explained by differences in salt and trans fat intakes for participants in AHA-ACC-ASCVD’s decades-old development data versus MESA’s modern cohort data. |
Difference in data collection techniques | Family history | Ankerst et al. [17] | PCPTRC, a model predicting risk of prostate cancer, had poor calibration for patients in both the North American and European cohorts of the Prostate Biopsy Collaborative Group (PCBG). The PCPTRC model’s underestimation of risk may be because it was built on a screening trial in which family history was a required data element for all participants. In contrast, the PCBG model, which was well-calibrated for both cohorts, was developed using data from clinical records. Clinical records might only include family history of a disease if it was more aggressive (e.g., cancer in a family member might only be noted if it led to death). This difference in family history collection would lead to different odds ratios for family history and thus differences in predicted risk. |
Difference in clinical application of the model | Case mix, patient demographics, model setting | Ashburner et al. [21] | CHARGE-AF, a model predicting atrial fibrillation (AF) risk, had poor calibration for participants in a medical record-based study in a tertiary hospital. The model’s underestimation of atrial fibrillation risk can be explained by the fact it was developed in a community setting—with lower incidence of AF—and applied in an academic setting—a population with higher underlying risk of developing AF. |
Das et al. [22] | A model for predicting 30-day mortality in patients with acute myocardial infarction was trained and externally validated on patients enrolled in a randomized controlled trial. When this model was deployed on a community-based cohort, risk was underestimated. This was attributed to differences in patient demographics. Compared to the general population, patients in cardiovascular RCTs tend to be younger, male, undergo more revascularization, and have fewer comorbid conditions. | ||
Steyerberg et al. [23] | A model predicting indolent prostate cancer that was developed on a cohort in a clinical setting underestimated risks of indolent cancer for a cohort in a screening setting. Patients presenting clinically generally do so for some reason, and therefore have a higher risk of more aggressive disease. | ||
Difference in nomenclature | Definitions, medical coding/billing | Rauh et al. [14] | A model predicting 7-year risk for chronic cardiometabolic diseases had poor calibration for participants in AusDiab, a population-based cross-sectional study. The model overestimated disease rates as its development data defined cardiovascular disease differently from the AusDiab study. |
Difference in predictor and outcome relationships | Age, BMI, cholesterol, dementia | Licher et al. [24] | Two models to predict dementia were found to underestimate low risks and overestimate high risks during external validation, even after recalibration by refitting the intercept. The miscalibration was attributed to the original model’s development on a younger cohort compared to the validation cohort, critical because associations between dementia and predictors such as BMI and cholesterol varied by age. |