Skip to main content

Table 1 Examples of investigating the root causes of miscalibration identified during external validation. These examples were identified through a literature search of PubMed and Google Scholar that included papers published on or before February 4, 2024. We focused on studies that explicitly reported calibration metrics or provided detailed discussions of miscalibration in clinical prediction models. The selection aimed to encompass diverse scenarios, such as biological differences, temporal shifts, and institutional variations, to offer a comprehensive perspective on factors influencing calibration

From: Against reflexive recalibration: towards a causal framework for addressing miscalibration

Dataset shift domain: cause of miscalibration

Specific: variables affected

Real world example

Investigation of miscalibration

Difference in clinical practice

Admission policies, threshold for surgery, medications prescribed, pathology grading

van den Boogaard et al. [18]

A model predicting delirium in ICU patients had poor calibration for participants in a multinational observational study. The model’s overestimation of the risk of developing delirium could be explained by differences in ICU admission policies and treatments, specifically sedation protocols. Varied sedation practices impact the level and duration of sedation, influencing the likelihood and severity of delirium occurrence, therefore affecting the model’s performance.

Rauh et al. [14]

A model predicting 7-year risk for chronic cardiometabolic diseases had poor calibration for participants in AusDiab, a population-based cross-sectional study. The model overestimated disease rates as it was developed with data from 1989 to 2005 whereas the AusDiab study was conducted from 2004 to 2012—a time period in which there was increased use of antihypertensives and statins.

Ankerst et al. [17], Vickers [19]

PCPTRC, a model predicting risk of prostate cancer, had poor calibration for patients in both the North American and European cohorts of the Prostate Biopsy Collaborative Group (PCBG). The PCPTRC model’s underestimation of risk may be because of the switch in clinical practice from six-core biopsy procedure to 12 cores. Additionally, there have been changes in how pathologists grade prostate cancer such that there is increased prevalence of high-grade disease in contemporary cohorts.

Difference in human behavior

Diet, exercise, clinician skill

DeFilippis et al. [20]

AHA-ACC-ASCVD, a model for predicting risk of cardiovascular events, had poor calibration in MESA, a multicenter prospective community-based epidemiologic study in a sex-balanced and multiethnic cohort. The model’s overestimation of atherosclerotic cardiovascular disease risk may be explained by differences in salt and trans fat intakes for participants in AHA-ACC-ASCVD’s decades-old development data versus MESA’s modern cohort data.

Difference in data collection techniques

Family history

Ankerst et al. [17]

PCPTRC, a model predicting risk of prostate cancer, had poor calibration for patients in both the North American and European cohorts of the Prostate Biopsy Collaborative Group (PCBG). The PCPTRC model’s underestimation of risk may be because it was built on a screening trial in which family history was a required data element for all participants. In contrast, the PCBG model, which was well-calibrated for both cohorts, was developed using data from clinical records. Clinical records might only include family history of a disease if it was more aggressive (e.g., cancer in a family member might only be noted if it led to death). This difference in family history collection would lead to different odds ratios for family history and thus differences in predicted risk.

Difference in clinical application of the model

Case mix, patient demographics, model setting

Ashburner et al. [21]

CHARGE-AF, a model predicting atrial fibrillation (AF) risk, had poor calibration for participants in a medical record-based study in a tertiary hospital. The model’s underestimation of atrial fibrillation risk can be explained by the fact it was developed in a community setting—with lower incidence of AF—and applied in an academic setting—a population with higher underlying risk of developing AF.

Das et al. [22]

A model for predicting 30-day mortality in patients with acute myocardial infarction was trained and externally validated on patients enrolled in a randomized controlled trial. When this model was deployed on a community-based cohort, risk was underestimated. This was attributed to differences in patient demographics. Compared to the general population, patients in cardiovascular RCTs tend to be younger, male, undergo more revascularization, and have fewer comorbid conditions.

Steyerberg et al. [23]

A model predicting indolent prostate cancer that was developed on a cohort in a clinical setting underestimated risks of indolent cancer for a cohort in a screening setting. Patients presenting clinically generally do so for some reason, and therefore have a higher risk of more aggressive disease.

Difference in nomenclature

Definitions, medical coding/billing

Rauh et al. [14]

A model predicting 7-year risk for chronic cardiometabolic diseases had poor calibration for participants in AusDiab, a population-based cross-sectional study. The model overestimated disease rates as its development data defined cardiovascular disease differently from the AusDiab study.

Difference in predictor and outcome relationships

Age, BMI, cholesterol, dementia

Licher et al. [24]

Two models to predict dementia were found to underestimate low risks and overestimate high risks during external validation, even after recalibration by refitting the intercept. The miscalibration was attributed to the original model’s development on a younger cohort compared to the validation cohort, critical because associations between dementia and predictors such as BMI and cholesterol varied by age.