Taking our Multi-modal discovery engine to the Next Level – Meta-Analysis Part 1: Imputation

Recap

In a previous article, we introduced our revolutionary multi-modal risk biomarker discovery engine.

Now, we are excited to share how we continually optimize and derive the best parameters for each use case. This month, we’re diving into the critical technique of imputation.

Introduction: What is Imputation and Why is it Important?

Healthcare and biological data comes with inherent missingness. This can significantly reduce the number of samples available for training powerful models, preventing us from utilizing the full potential of our datasets. Imputation is a statistical technique that can mitigate this issue by estimating and filling in missing data points, thereby increasing the power and accuracy of our analyses. 

To test the impact different imputation algorithms can have, we applied our discovery engine to the UK Biobank dataset, a dataset of 500k UK-based volunteers whose health has been followed since 2006 and can be used to train for example disease risk prediction models. We conducted an experiment for some of the most common and impactful diseases (Type 2 diabetes, lung, breast and prostate cancer, asthma, Parkinson’s Disease, Alzheimer’s Disease, acute myocardial infarction, chronic ischaemic heart disease, heart failure and cerebrovascular diseases) to compare the impact imputation can have in our platform. 

Hypothesis & Experiment

We hypothesize that machine learning can play a crucial role in improving imputation techniques for healthcare data. To test this, we conducted an experiment with the following steps:

  1. Simulate Missing Data: We intentionally simulated various levels of missing data within a subset of the UK Biobank dataset.
  2. Run Imputation Algorithms: We then applied four different imputation algorithms to impute the missing data.
    • Standard Mean + Mode: Replaces missing continuous values with the mean of observed data, and missing categorical values with the mode (most frequent category).
    • KNN + Mode: Uses K-Nearest Neighbours to impute continuous variables by averaging the most similar samples based on non-missing features. 
      • Categorical variables are filled using the most common class among those neighbours.
    • MICE + Mode: Performs Multiple Imputation by Chained Equations (MICE) with 10 iterations:
      • Treat each incomplete variable as a dependent variable in a regression model using other variables as predictors.
      • Iteratively update missing values with predictions, repeating until convergence. 
      • Continuous variables use a Bayesian Ridge Regressor, while categorical variables use mode imputation.
    • CatBoost: A model-based method using the CatBoost gradient boosting algorithm. In a single iteration, for each variable with missing data:
      • Treat it as the target.
      • Train a CatBoost model using all other variables as predictors.
      • Predict and fill in missing entries.
      • Works seamlessly for both continuous and categorical variables.
  3. Compare Results: We then analysed the outcomes of the imputation algorithms compared to the reference dataset (“NoImputation”) from which data was removed to simulate missingness comparing:
    • Imputation accuracy across data modalities (baseline = questionnaire-based, clinical = clinical measures and standard blood biomarkers, proteomics = example of complex omics data)
    • Overall model performance (e.g., risk prediction C-index)
    • Run time

Results

We observed promising results, which are illustrated in the plots below showing averages across 11 diseases. 

We demonstrated the synergistic value of combining complex modalities with models containing features across clinical + proteomics + baseline demographic parameters; outperforming simple single category models for 7 out of the 11 diseases evaluated. 

Comparing imputation algorithms we can observe that MICE + Mode and CatBoost perform best in re-imputing artificially created missingness compared to the “NoImputation” reference dataset. 

Rerunning the pipeline with Mean + Mode imputation

Rerunning the pipeline with Mean + Mode imputation (the fastest algorithm) led to significant improvements in model performance; up to 82% larger cohort size and 16% higher C-index. This is illustrated under StartWithMeanModeImputation in the figure.

Figure 2: Average model performance comparison (C-index).

 

Figure 3: Total Runtime of discovery engine per imputation algorithm configuration.

Conclusion

Our preliminary results indicate that employing machine learning-driven imputation significantly enhances the power and accuracy of our biomarker discovery engine by effectively addressing missing data in healthcare datasets. This allows us to leverage the full potential of available data, leading to more robust and reliable risk prediction models.

 

Limitations

It is important to acknowledge that different data types exhibit varying reasons for missingness, which may not always be random. Our current research focuses on general missingness patterns and will continue to explore more nuanced scenarios.

 

Looking Ahead

We are continuously working to refine our imputation strategies as well as optimise each component of the pipeline to enable optimal settings being chosen for each run. 

Our platform will enable clinicians and biopharma R&D to tailor biomarker panel composition based on the real-world economic and operational constraints of diagnostics, companion tests or adaptive trial designs.