Genomics ⏵ Imputation
Description
Imputed genotype and phased haplotype values. Genotypes were imputed into the dataset using computationally efficient methods combined with the Haplotype Reference Consortium (HRC) and UK10K haplotype resource. This increased the number of testable variants over 100-fold to ~96 million variants, which are stored in the compressed and indexed BGENv1.2 format. The imputed genotypes are aligned to the + strand of the reference and the positions are in GRCh37 coordinates.The fields listed here are indicators and adding them to an Application basket will allow researchers to download the corresponding underlying information from the UKB repository. This information includes
- Imputation (2.6TB)
- Haplotypes (0.06TB)
The lists of SNPs in the imputed datasets can be downloaded from the Field's Resources tabs on a per-chromosome basis or as a combined tars in Resource 1965 and Resource 1671. The information scores and minor allele frequency data for the imputed genotypes (computed with QCTOOL) can also be downloaded in Resource 1967.
Questions about using the imputed genotypes can be directed to a special UK Biobank mailing list, which can be joined at https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=UKB-GENETICS.
See Resource 530 for additional information (such as the quality control) including a citable reference for publications.
Please note: there was a problem with version 2 of the UK Biobank imputed data identified in 2017. Please ensure you are using version 3 of the imputed data, which was first released in 2018. Version 2 of the non-imputed data continues to be correct and current.
The problem with version 2 of the imputed data was as follows:
The genetic data was imputed using two different reference panels. The Haplotype Reference Consortium (HRC) panel was used wherever possible, but for SNPs not in that reference panel the UK10K + 1000 Genomes panel was used. The problem arose in the second set of imputed data from the UK10K + 1000 Genomes panel. The genotypes at these SNPs were imputed correctly, but they were not recorded as having the correct genome position in the files.
The imputed data from the HRC panel was not affected and had the correct positions. This was about ~40M sites and included the majority of the common SNPs i.e. sites most likely to show genetic associations. These sites could be identified using the publicly available HRC site list at http://www.haplotype-reference-consortium.org/site.