Exome sequences - Genomics
DescriptionThe first tranche of UKBiobank whole exome sequencing (WES) is now available for ~50,000 UK Biobank participants.
To ensure equality of access the individual level data is currently embargoed to allow all researchers an opportunity to download the PLINK formatted data. The VCF files will be released by early-April followed by the CRAM files. Researchers who already have access to UK Biobank genetic data do NOT have to submit new baskets to request exome data - this will be done for them automatically by the Access team.
This sample set prioritizes individuals with whole body MRI imaging data, enhanced baseline measurements, hospital episode statistics (HES), and/or linked primary care records. Additionally, one disease area was selected for enrichment: individuals with admission to hospital with a primary diagnosis of asthma (ICD10 J45 or J46). The sequenced set includes 194 parent-offspring pairs, 613 full-sibling pairs, including 26 trios, 1 monozygotic twin pair and 195 second degree genetically determined relationships.
Exomes were captured with the IDT xGen Exome Research Panel v1.0 including supplemental probes. The basic design targets 39Mbp of the human genome (19,396 genes). Multiplexed samples were sequenced with dual-indexed 75x75bp paired-end reads on the Illumina NovaSeq 6000 platform using S2 flow cells. In each sample and among targeted bases, coverage exceeds 20X at 94.6% of sites on average. Complete sequencing protocols are described in detail by the summary manuscript (add link when available). This manuscript also fully describes the "SPB" primary and secondary analysis pipeline that converts raw sequencing data to a quality-controlled set of population variation. The SPB pipeline first converted all raw sequencing data to FASTQs according to Illumina NovaSeq best practices and aligned those reads to the GRCh38 reference genome with BWA-mem to generate a CRAM file for each sample. After read-duplicate marking, SNVs and indels were called for with WeCall (GenomicsPLC), generating a gVCF per sample. These gVCFs were joint genotyped using GLnexus (https://www.biorxiv.org/content/10.1101/572347v1) to create a single, unfiltered project-level VCF (pVCF). Genotype depth filters (SNV DP>7, indel DP>10) were applied prior to variant site filters requiring at least one variant genotype passing an allele balance filter (heterozygous SNV AB>0.15, heterozygous indel<0.20), resulting in a second 'filtered' pVCF. A total of 4,735,722 variants are identified within targeted regions, with 9,693,536 variants identified across all covered bases including 100bp regions flanking the capture targets.
To maximize data utility and ease of use, an additional "Functionally Equivalent" (FE) pVCF was generated from FASTQs, following the primary analysis protocol described in the 2018 manuscript (PMID: 30279509) and then subject to GATK 3.0 variant calling and hard filtering of variants with inbreeding coefficient<-0.03 or without at least one variant genotype of DP≥10, GQ≥20 and, if heterozygous, AB≥0.20.