Abstract
Penalized regression methods are widely used for variant selection and polygenic risk score (PRS) analysis in disease genome-wide association studies (GWASs). However, the existing penalized regression-based PRS methods often neglect genotype-environment interaction (GEI) and struggles with high-dimensional GWAS data. To overcome these challenges, we propose a novel machine learning-based PRS method Genotype-Environment interaction-based Polygenic Risk Score (GEiPRS). GEiPRS simultaneously models both genotype (G) and GEI effects and efficiently handle high-dimensional GWAS data in terms of variant selection and PRS construction and prediction. A novel algorithm called Group ITerative LAsso with Batch Screening (GITLABS) is developed for efficiently calculating iterative Group Lasso (GL) or Sparse Group Lasso (SGL) solutions for variant selection in GEiPRS, enabling high-dimensional variant selection and PRS construction in a computationally efficient manner. GITLABS consists of three steps: screening variants using strong rules, fitting GL/SGL model with the selected variants, and checking validity of the model solutions based on safe rules. Extensive simulations show GEiPRS outperforms existing PRS methods in terms of GEI-PRS association P-values, prediction accuracy, subgroup risk stratification, and computational efficiency. We further apply the GEiPRS method to large-scale UK Biobank GWAS data for three pairs of quantitative traits and environment variables and the results demonstrate superior performance of GEiPRS over existing PRS methods and support the main conclusions from our simulations.</p>