Abstract
BackgroundGiven that pancreatic cancer (PC) is typically diagnosed at an advanced stage but is often preceded by new-onset diabetes mellitus (NODM), providing a window for early detection, we sought to develop and validate an interpretable machine-learning model integrated with multi-omics profiling to identify early biomarkers of NODM-associated PC.MethodsIn a population-based cohort, individuals with NODM-associated PC and NODM without PC were identified and randomly divided (70:30) into training and validation sets after feature selection. Eight machine learning (ML) classifiers were compared using fivefold cross-validation, and model performance was evaluated in terms of discrimination, calibration, and decision curve-based clinical utility. We evaluated interpretability using the Shapley additive explanations (SHAP) analyses. Mechanistically, Olink proteomic profiling and metabolomics were analyzed through clinical classifications and model-defined risk strata.ResultsCategorical boosting achieved the best performance in the independent validation set (AUROC = 0.844). The NODM cohort was stratified into high- (n = 2,362) and low-risk (n = 5,030) groups, and internal validation together with SHAP analyses demonstrated consistent model performance and identified clinically interpretable predictors. Proteomic and metabolomic analyses under clinical and risk-based grouping identified 39 overlapping differentially expressed proteins and 145 overlapping metabolites with enriched across 11 shared KEGG pathways. Cross-platform validation highlighted PLTP, CRTAC1, and ITGAV as serum biomarkers with a strong potential for early NODM-PC detection.ConclusionsWe developed an interpretable ML framework centered on NODM enables practical risk stratification for early PC detection by multi-omics and provides a pathway of ML-based triage followed by biomarker confirmation for earlier detection and diagnosis.</p>