Revolutionising Disease Prediction: Multi-Modal Risk Biomarker Discovery Engine
What Are Risk Biomarkers?
Doctors and scientists use biomarkers to understand who is at risk of disease and which treatments work best for different people. Risk biomarkers are measurable indicators that can predict an individual’s likelihood of developing a specific disease before traditional symptoms appear. Unlike diagnostic tests that confirm disease after onset, risk biomarkers enable proactive healthcare by identifying the disease’s risk for individuals during the pre-symptomatic phase.
These biomarkers can range from genetic variants and protein levels to clinical measurements and imaging features. The power lies not just in individual markers, but in how different types of biological data work together to create a comprehensive risk profile for a disease. There is a tradeoff between accuracy of the biomarkers and their implementation cost and complexity in the real-world, which dictates which markers will be adopted by healthcare systems and benefiting patients. Any improvements in risk biomarker discovery will lead to better ways of preventing and treating diseases for a wider range of people.
Our Multi-Modal Approach
Our products tackle the slow process of biomarker discovery. Biomarker discovery used to take years. Now, it can take days using our Innovate Platform. We developed an engine to perform unbiased and robust biomarker discovery for diseases. We analyse large global health datasets so that different ethnic backgrounds and populations can benefit from these discoveries, which is critical to achieve health equity.
Our biomarker discovery engine identifies biomarkers at a large scale by harnessing multiple data modalities from multiple cohorts. We use over 2.3 million sample records across diverse populations, examining more than 1,500 disease codes and over 5,000 phenotypes. A holistic view of human health is leveraged through the different data types, such as baseline clinical/demographic variables (e.g. age, sex, BMI, smoking, socioeconomic status); genetics (e.g. Polygenic Risk Scores, individual genetic variants); routine clinical biomarkers (e.g. cholesterol, HbA1c); EHR data, wearables or other omics variables (e.g. proteomics, epigenetics). Furthermore, we are planning integration of different imaging datatypes (e.g. MRI, ECG, CT) in the next months. This multi-modal integration enables our AI/ML methods to discover previously hidden patterns and relationships that single-data-type approaches might miss. Such an approach leads to more robust and clinically relevant biomarker discoveries that can highlight the tradeoff between cost, accessibility and performance of these biomarkers in the real world, so that they can be deployed in future preventative and precision medicine solutions.
Validation and Interpretability
Our computational engine leverages advanced artificial intelligence and machine learning (AI/ML) algorithms that optimise biomarker discovery across multiple dimensions. The system trains models using various different cohorts that include people with diverse genetic backgrounds and from different environments, ensuring that our tools work well for everyone. Different data types and cohorts enable an assessment of how traditional clinical markers perform compared to cutting-edge omics data.
Model interpretability is central for our system to enable identification of the most informative features for each disease by using different feature importance methods (e.g. split gain, SHapley Additive exPlanations, Local Interpretable Model-agnostic Explanations). Through systematic analysis, our biomarker discovery engine determines the optimal threshold between features, costs, data modality usage, and model performance, ensuring that biomarker identification is both scientifically robust and economically viable. All findings are validated across multiple biobank cohorts, providing confidence in the reproducibility and generalisability of discovered biomarkers.
This approach allows us to answer critical questions such as: Does adding proteomic data improve prediction over clinical markers alone? How do polygenic risk scores (PRS) compare to traditional biomarkers? Which combination provides the most actionable insights?
To date, our biomarker discovery engine has produced 1,000s of biomarker assets, with 100s of those achieving a performance that is enough to consider adding them to clinical practice. In the next section, we showcase our results for one of these assets.
Case Study: Type 1 Diabetes Mellitus Care
Type 1 diabetes is an autoimmune disease where the body’s immune system attacks and destroys the insulin-producing beta cells in the pancreas. Type 1 diabetes typically develops suddenly and requires lifelong insulin therapy. The global impact is profound: 8.4 million people worldwide live with this condition, yet tragically, 35,000 undiagnosed individuals die within 12 months of symptomatic onset each year. With cases projected to nearly double by 2040, early detection has never been more critical [ref].
Type 1 diabetes exemplifies the urgent need for better risk prediction, as the current diagnostic pathway reveals critical gaps in care delivery. The absence of routine screening for at-risk individuals means that patients typically present at the end of the prodrome, after developing acute symptoms, often in critical situations. Alarmingly, 50% of children present with life-threatening diabetic ketoacidosis (DKA), a dangerous condition that can lead to coma or death if not treated immediately. Current diagnosis relies entirely on glucose measurements that are triggered only after obvious disease symptoms appear, representing a fundamentally reactive rather than proactive approach to care.
This late-stage detection creates significant missed opportunities for both patients and healthcare systems. Up to 1 year of prodromal phase goes completely undetected by current testing, during which the autoimmune destruction of pancreatic beta cells is actively occurring but producing no obvious symptoms. Furthermore, all patients currently receive identical treatment pathways despite growing evidence for distinct disease subtypes that may respond differently to various therapeutic approaches. These early intervention opportunities, if captured, could dramatically reduce the risk of acute complications and potentially slow or prevent the complete destruction of insulin-producing cells.
Our multi-modal biomarker approach has identified a biomarker asset for Type 1 diabetes that could change the diagnosis and prevention of this condition. The metric used to quantify model performance is the concordance index (c-index), which indicates how well the model predicts whether individuals will develop the disease in 10 years. A value of 0.5 means the test is doing no better than flipping a coin (pure chance), while 1.0 means the test makes perfect predictions – every time. Anything above 0.7 is generally considered good in medical prediction models. For example, the PSA (Prostate-Specific Antigen) was associated with prostate cancer mortality with a c-index in the range of 0.628 and 0.862 based on PSA levels [ref], showing that even widely used clinical tools often fall within this range of performance. As shown in Figure 1, our model achieves a c-index score of 0.894 when including baseline clinical and demographic variables, genetics, proteomics, clinical biomarkers, and EHR data.

The best performing model combines baseline with clinical biomarkers, proteomics and polygenic risk scores. It can be noticed how each data modality improves the performance for predicting Type 1 diabetes. The top predictive features reveal fascinating biological insights, from clinical biomarkers like HbA1c, glucose, cholesterol, and LDL up to novel protein biomarkers such as GDF15, REN, and many others.
Hurdle’s Innovate Platform has the potential to transform patient care. By enabling earlier detection during the prodromal phase, our biomarker approach could dramatically reduce DKA risk through timely intervention, potentially saving thousands of lives annually. Beyond early detection, the model’s ability to identify disease subtypes opens the door to personalised treatment approaches, moving away from the current one-size-fits-all paradigm. Perhaps most exciting is the prevention opportunity for high-risk individuals. Let’s imagine identifying those developing Type 1 diabetes before any symptoms appear, it would open the door to earlier intervention that could delay or even prevent disease onset entirely.
The Bigger Picture: Scaling Across Diseases
The case study of Type 1 Diabetes demonstrates the potential of our approach for the analysis of thousands of conditions and diseases. Our platform’s ability to integrate diverse data types and identify optimal biomarker combinations positions it to transform risk prediction across: Cardiovascular diseases, Neurological disorders, Cancer subtype, Autoimmune conditions, Women’s Health, Metabolic syndromes and many other interesting therapeutic areas.
Limitations
While our results demonstrate significant promise, we acknowledge several important limitations. While some of these issues have been addressed, others remain under exploration. One of them relates to data representation and generalisability, as biobank populations may not comprehensively represent all demographic groups or diseases. Although some datasets are large and diverse, they may have limited statistical power for rarer diseases or subpopulations. This risks creating models that may underperform across healthcare systems with different population demographics, diagnostic approaches, or disease patterns. Additionally, accessing comprehensive datasets remains challenging due to the long application process, data availability constraints and limited availability of open-source health data. We encourage regulatory bodies and companies to consider how open-source health data initiatives could be streamlined and scaled, and we are committed to supporting these efforts with our expertise. We are also keen to support research that could expand coverage for rare diseases.
Technical and regulatory challenges also present significant limitations that require systematic solutions. Integrating multi-modal data presents significant quality control challenges due to inconsistencies in data resolution, standardisation, and timing across clinical records, omics data, and electronic health records. We are addressing these challenges through preprocessing pipelines to harmonise features, impute necessarily missing data, and by adopting adequate modeling strategies. Furthermore, the critical transition from research biomarkers to market regulated markers requires robust business infrastructure and regulatory expertise.
Looking Ahead
Our platform demonstrates the real world value of multi-modal data in accelerating and refining biomarker discovery. By integrating diverse data types (from clinical markers and genetics to proteomics and EHR-derived signals) we’ve built a foundation for predictive, inclusive, and personalised healthcare. The next phase is all about impact: translating these discoveries into regulated medical devices. We are actively working to validate and transition biomarker assets into real-world healthcare settings – prioritising interpretability, scalability, and accessibility. We continue to evolve our models by incorporating new data cohorts and exploring novel machine learning techniques. This journey is not just about uncovering biomarkers, but about redefining how diseases are predicted, diagnosed, and ultimately prevented. If you’re looking to streamline your biomarker discovery, accelerate clinical validation, or build robust risk models, our team of experts would love to hear from you and explore how we can help you achieve your goals. For pharmaceutical and biotech companies, we also offer licensing opportunities for the biomarker assets or patient stratification based on risk score to enhance drug development pipelines and support regulatory submissions.
About the Author

Constantin Petrescu
Constantin Cezar Petrescu holds a PhD in Computer Science, specialising in Program Analysis and Data-Driven Analysis. He has extensive experience in machine learning, program analysis, and information security. At Hurdle, Constantin contributes to the development of innovative AI-powered health solutions.