Exome sequences - Genomics
DescriptionThe first tranche of UKBiobank whole exome sequencing (WES) was made available for 50,000 UK Biobank participants in March 2019, and a data for an additional 150,000 was made available in October 2020.
The first 50k sample set prioritized individuals with whole body MRI imaging data, enhanced baseline measurements, hospital episode statistics (HES), and/or linked primary care records. Additionally, one disease area was selected for enrichment: individuals with admission to hospital with a primary diagnosis of asthma (ICD10 J45 or J46). The 50k sequenced set included 194 parent-offspring pairs, 613 full-sibling pairs, including 26 trios, 1 monozygotic twin pair and 195 second degree genetically determined relationships.
Exomes were captured with the IDT xGen Exome Research Panel v1.0 including supplemental probes. The basic design targets 39Mbp of the human genome (19,396 genes). Multiplexed samples were sequenced with dual-indexed 75x75bp paired-end reads on the Illumina NovaSeq 6000 platform using S2 flow cells. In each sample and among targeted bases, coverage exceeds 20X at 94.6% of sites on average. Complete sequencing protocols are described in detail by the summary manuscript (add link when available). This manuscript also fully describes the "SPB" primary and secondary analysis pipeline that converts raw sequencing data to a quality-controlled set of population variation. The SPB pipeline first converted all raw sequencing data to FASTQs according to Illumina NovaSeq best practices and aligned those reads to the GRCh38 reference genome with BWA-mem to generate a CRAM file for each sample. After read-duplicate marking, SNVs and indels were called for with WeCall (GenomicsPLC), generating a gVCF per sample. These gVCFs were joint genotyped using GLnexus (https://www.biorxiv.org/content/10.1101/572347v1) to create a single, unfiltered project-level VCF (pVCF). Genotype depth filters (SNV DP>7, indel DP>10) were applied prior to variant site filters requiring at least one variant genotype passing an allele balance filter (heterozygous SNV AB>0.15, heterozygous indel<0.20), resulting in a second 'filtered' pVCF. A total of 4,735,722 variants are identified within targeted regions, with 9,693,536 variants identified across all covered bases including 100bp regions flanking the capture targets.
To maximize data utility and ease of use, an additional "Functionally Equivalent" (FE) pVCF was generated from FASTQs, following the primary analysis protocol described in the 2018 manuscript (PMID: 30279509) and then subject to GATK 3.0 variant calling and hard filtering of variants with inbreeding coefficient<-0.03 or without at least one variant genotype of DP≥10, GQ≥20 and, if heterozygous, AB≥0.20.
Note that the original release of the SPB variant (Fields 23170-23174) was found to contain errors and has been replaced by corrected values in Fields 23175-23179.
Both FE and SPB results were replaced by values from an improved and unified pipeline (OQFE) in Q4/2020 when the second tranche of data was released.