Abstract
Multiple imputation is well-established for handling missing data, yet its use in high-dimensional genetic datasets remains limited. Using pharmacokinetic tuberculosis simulations and SNP data (1000 Genomes Project), we compared machine learning (ML) and traditional approaches (e.g., mean imputation and complete-case analysis) for imputation and covariate selection. We developed a multiple imputation framework incorporating genotype probabilities, imputation uncertainty (INFO score), and missingness percentages. Dimensionality reduction enabled scalable random forest and penalized regression for covariate selection. In simulations, only multiple imputation achieved adequate coverage (percentage of 95% confidence intervals containing the true value) exceeding a 90% nominal threshold. For example, on the imputation server, coverage improved from 0% with single imputation to up to 94% under 10% missingness. Applied to clinical warfarin datasets (War-PATH, n = 548; IWPC, n = 316) and the UK Biobank (n = 500, 1000), multiple imputation recovered known pharmacogenomic associations (CYP2C9*8/*9/*11; VKORC1 -1639G>A), reduced false-positives, and detected signals missed by single imputation (e.g., genome-wide significant rs4697699, SLC2A9 locus). Computational costs were modest, adding only ~1.25 minutes for 10 imputations to the 22.7 minutes required by single imputation on the Michigan Imputation Server. For SNP selection, penalized regression performed best in the high-effect scenario (F1 = 0.897 ± 0.091), while GWAS followed by random forest performed best in the low-effect scenario (F1 = 0.657 ± 0.110). These findings show that multiple imputation improves reliability and discovery in high-dimensional pharmacogenomics, with ML offering promising but inconsistent benefits during SNP selection. However, generalizability beyond the studied datasets and computational scalability to larger biobank-scale analyses remain important limitations that warrant further investigation.</p>