: Category 170

Notes

The first 50k release prioritized individuals with whole body MRI imaging data, enhanced baseline measurements, hospital episode statistics (HES), and/or linked primary care records. Additionally, one disease area was selected for enrichment: individuals with admission to hospital with a primary diagnosis of asthma (ICD10 J45 or J46). With the addition of the additional 150k samples, the 200k release includes 1,135 parent-offspring pairs, 3,855 full-sibling pairs, including 101 trios, 27 monozygotic twin pair and 7,461 second degree genetically determined relationships.

Exomes were captured with the IDT xGen Exome Research Panel v1.0 including supplemental probes. The basic design targets 39 Mbp of the human genome (19,396 genes). Multiplexed samples were sequenced with dual-indexed 75x75 bp paired-end reads on the Illumina NovaSeq 6000 platform using S2 (initial 50k release) and S4 flow cells (all subsequent samples). A different IDT v1.0 oligo lot was used in the initial 50k sequencing than was used in the sequencing of all subsequent samples. Inclusion of this information as a covariate in downstream analyses is recommended. In each sample and among targeted bases, coverage exceeds 20X at 95.2% of sites on average. Complete sequencing protocols are described in detail by the summary manuscript (https://pubmed.ncbi.nlm.nih.gov/33087929/).

Primary and secondary analysis for the UKB 200k release was performed with an updated Functional Equivalence (FE) protocol that retains original quality scores in the CRAM files (referred to as the OQFE protocol, https://www.medrxiv.org/content/10.1101/2020.11.02.20222232v1). The OQFE protocol aligns and duplicate-marks all raw sequencing data (FASTQs) to the full GRCh38 reference in an alt-aware manner as described in the original FE manuscript (https://pubmed.ncbi.nlm.nih.gov/30279509/). The OQFE CRAMs were then called for small variants with DeepVariant to generate per-sample gVCFs. These gVCFs were aggregated and joint-genotyped with GLnexus (https://www.biorxiv.org/content/10.1101/572347v1) to create a single multi-sample VCF (pVCF) for all UKB 200k samples. PLINK files were derived directly from this pVCF. Please note: to ensure that the UKB 200k data supports a broad range of analyses, no variant- or sample-level filters were pre-applied to the pVCF or PLINK files. The publicly released pVCF is the direct output of GLnexus, from which the PLINK files are generated. The pVCF contains allele-read depths and genotype qualities for all genotypes from which variant- and sample-level QC metrics can be calculated and to which analysis-specific filters can be applied. Examples of such filtering are described in the UKB 200K preprint (https://www.medrxiv.org/content/10.1101/2020.11.02.20222232v1).

Please note that the OQFE protocol differs from both previous UKB 50k releases, SPB and FE, which are described below for reference. All UKB 200k samples were processed from FASTQ with the OQFE docker (https://hub.docker.com/r/dnanexus/oqfe). Further details are provided in the WES FAQ at https://www.ukbiobank.ac.uk/media/cfulxh52/uk-biobank-exome-release-faq_v9-december-2020.pdf

In the original protocol, the SPB pipeline first converted all raw sequencing data to FASTQs according to Illumina NovaSeq best practices and aligned those reads to the GRCh38 reference genome with BWA-mem to generate a CRAM file for each sample. After read-duplicate marking, SNVs and indels were called for with WeCall (GenomicsPLC), generating a gVCF per sample. These gVCFs were joint genotyped using GLnexus (https://www.biorxiv.org/content/10.1101/572347v1) to create a single, unfiltered project-level VCF (pVCF). Genotype depth filters (SNV DP≥7, indel DP≥10) were applied prior to variant site filters requiring at least one variant genotype passing an allele balance filter (heterozygous SNV AB>0.15, heterozygous indel<0.20), resulting in a second 'filtered' pVCF.

To maximize data utility and ease of use, an additional "Functionally Equivalent" (FE) pVCF was generated from FASTQs, following the primary analysis protocol described in the 2018 manuscript (PMID: 30279509) and then subject to GATK 3.0 variant calling and hard filtering of variants with inbreeding coefficient<-0.03 or without at least one variant genotype of DP≥10, GQ≥20 and, if heterozygous, AB≥0.20.

2 Sub-Categories

Category ID	Description	Items
172	Alternative exome processing	4
171	Previous exome releases	21

8 Data-Fields

Field ID	Description
32050	Release tranche
23157	Population level exome OQFE variants, pVCF format - Final exome release	‡
23158	Population level exome OQFE variants, PLINK format - Final exome release	‡
23159	Population level exome OQFE variants, BGEN format - Final exome release	‡
23141	Exome OQFE variant call files (VCFs)	‡
23142	Exome OQFE variant call file (VCF) indices	‡
23143	Exome OQFE CRAM files	‡
23144	Exome OQFE CRAM indices	‡

7 fields marked ‡ are blob/bulk.

1 Parent Category

Category ID	Description	Items
100314	Genomics	+275

7 Resources

Preview	Name	Res ID
	Exome sequencing and characterization of coding variation	572347
	Pre-processed version of genomic reference used in Whole Exome Sequencing by FE pipeline	1000
	Pre-processed version of genomic reference used in Whole Exome Sequencing by SPB pipeline	838
	Target region used by the WES capture experiment (BED file)	3803
	Target region used by the WES capture experiment (BED file) - obsolete	3801
	Target region where at least one alternative contig is derived (BED file)	3802
	Targets within the initial 50k FE data release affected by non-alt aware mapping (BED file)	1911

Description

Notes

2 Sub-Categories

8 Data-Fields

1 Parent Category

7 Resources