Accessing Genetic Data within UK Biobank

The genetic dataset held by UK Biobank is too large to be distributed as part of a standard phenotype dataset. Instead, the ukbgene client has been developed to allow Approved researchers to download elements of it piecemeal to their local systems from secure online repositories outside the main UK Biobank showcase system. This guide explains how to use ukbgene.

To use the UK Biobank secure online repository services a researcher must

  1. be a validated UK Biobank researcher;
  2. be part of an Approved Application;
  3. have been issued a standard dataset together with the associated password credentials;
  4. have included the desired genetic fields in an approved Basket.

This webpage details the means by which large-scale genetic data held by UK Biobank can be accessed and manipulated once access has been approved.

  1. Preparation
  2. Notices
  3. Data Groups
  4. Authentication
  5. Fetching Data
  6. Fetching relatedness
  7. File versioning
  8. Standard usages

1. Preparation

Following approval of a research application, researchers will be sent a 32-character MD5 Checksum and a 64-character password. The next step is to acquire the ukbgene utility from the Downloads section of the Showcase website.

Some of the UKB utilities are supplied pre-compiled for both MS-Windows and Linux systems. The MS-Windows utilities have the suffix .exe however the explanations given in this guide omit this for generality. All the utility programs are command-line, so Windows versions are best run from a Command Prompt window, and Linux versions are best run directly from a Terminal.

The repository consists of a pair of mirrored systems each connected to the UK JANET network by independent links. The system names are:

To access genetic data from a remote computer the system that the download utility is running on must be able to make http (Port 80) connections to at least one, and preferably both, of the repository systems. If this is not possible then researchers should contact their local IT team to resolve the issue.

2. Notices

Before downloading any data, Researchers are reminded that: Note also that while it is possible to run multiple downloads in parallel, to provide fair usage the system will not permit a single Application to run more than 10 simultaneously.

3. Data Groups

The genetic data in UK Biobank can be grouped into 3 types according to how malleable and aggregated it is:
  1. Static meta-data and aggregated results;
  2. Anonymised results from sample analysis;
  3. Link files mapping pseudo-ids to other participant information.
Static data includes such datasets as Marker QC (quality control) information. This data is not affected by the withdrawal of individual participants. Similarly, aggregated research output such as GWAS results are not affected by the withdrawal of participants after the computations have been made.

Anonymised data includes such information as Calls and Imputed values related to samples. When a participant withdraws consent a small fraction of these datasets is invalidated and should not be used for analysis, however this subset cannot be identified from the information within the file.

Link data includes such information as Fam files which allow the pseudo-IDs assigned to individual participants to be linked to particular subsets (e.g. a column) of the sample results files. When a participant withdraws their consent from UK Biobank their pseudo-ID information is removed from these datasets at the earliest opportunity.

A summary of the file types and groups is given in the table below:

Data typeGroupFilename(s)How to obtain
Calls BEDAnonukb_cal_chrN_vZ.bedEGA or ukbgene cal
Calls BIMAnonukb_snp_chrN_vZ.bimResource 1963, ukb_snp_bim.tar
Calls FAMLinkukbA_cal_chrN_vZ_sP.famukbgene cal -m
Marker-QCStaticukb_snp_qc.txtResource 1955, ukb_snp_qc.txt
Sample-QCAnonukb_sqc_vZ.txtEGA or standard fields in Category 100313
RelatednessLinkukbA_rel_sP.txtukbgene rel
Imputation BGENAnonukb_imp_chrN_vZ.bgenEGA or ukbgene imp
Imputation BGIAnonukb_bgi_chrN_vZ.bgiResource 1965, ukb_imp_bgi.tar
Imputation MAF+infoAnonukb_mfi_chrN_vZ.txtResource 1967, ukb_imp_mfi.tar
Imputation sampleLinkukbA_imp_chrN_vZ_sP.sampleukbgene imp -m
Haplotypes BGENAnonukb_hap_chrN_vZ.bgenukbgene hap
Haplotypes BGIAnonukb_hbg_chrN_vZ.bgiResource 1671, ukb_hap_bgi.tar
HLA ImputationAnonukb_hla_vZ.txtEGA or Field 22182
IntensityAnonukb_int_chrN_vZ.binEGA or ukbgene int
ConfidencesAnonukb_con_chrN_vZ.txtEGA or ukbgene con
CNV log2rAnonukb_l2r_chrN_vZ.txtEGA or ukbgene l2r
CNV bafAnonukb_baf_chrN_vZ.txtEGA or ukbgene baf
SNP-posteriorStaticukb_snp_posterior_chrN.binResource 1817, ukb_snp_posterior.tar
SNP-posterior X BIMStaticukb_snp_posterior_chrX_haploid.bimResource 1817, ukb_snp_posterior.tar
BatchStaticukb_snp_posterior.batchResource 1968, ukb_snp_posterior.batch

4. Authentication

To access the repository it is necessary to prove ones identity to the system using a keyfile. See Resource 667 for detailed information on this.

5. Fetching data

Most genetic data has been divided into per-chromosome datasets and is stored in two (or more) distinct parts: To download the anonymous results files using ukbgene, enter the following at the command line:
 ukbgene typename -cchrom [flags]
where typename is the type of data being retrieved, selected from the list:

typenametype of data to be retrievedformatlink format
calgenotype callsbedfam
congenotype confidencestxtfam
intgenotype intensitiesbinfam
bafgenotype CNV b-allele frequenciestxtfam
l2rgenotype CNV log2ratiostxtfam
impimputationbgensample
haphaplotypesbgensample

and chrom is the chromosome 1,2,...,22,X,Y,XY or MT. Additional person/sample-independent elements of a dataset (e.g. index files or QC) are not regarded as confidential and may be download directly from the UKB Showcase Resource areas.

A full list of the available flags can be obtained by running ukbgene without any parameters. Particularly important are:

5.1 Examples

Hence, to fetch the Anonymous genotype calls for Chromosome 17 enter
 ukbgene cal -c17
which will produce a bed-format file.

To fetch the Link file associated with that Anonymous dataset add the -m parameter to the command line, hence

 ukbgene cal -c17 -m
will fetch the corresponding fam-format file.

5.2 Duplication

Note that many of the Link files have the same contents for different anonymous files and hence only a single instance needs to be downloaded. Specifically: See Standard Usages for help on working with this.

6. Fetching relatedness

The genotype information allows one to infer which/how different participants within UK Biobank are related. To retrieve this information run ukbgene with the "rel" parameter thus:
 ukbgene rel
This will produce a 5 column file giving a pairwise listing of related individual pseudo-IDs accompanied by the values:
  1. HetHet : the fraction of markers for which the pair both have a heterozygous genotype;
  2. IBS0 : the fraction of markers for which the pair shares zero alleles;
  3. Kinship : estimate of the kinship coefficient for pair based on the set of markers used in the kinship inference.
In any pair where one or more of the participants has withdrawn, both pseudo-IDs are replaced by negative numbers.

7. File versioning

UK Biobank is a large study involving over 500,000 members of the general UK population. As a result of its size and composition it regularly encounters issues which are rare or absent in more tightly focussed studies involving only a few hundreds or thousands of participants. In particular, in most years several participants decide to completely withdraw their consent to being in the study which means that UK Biobank and all Researchers using it have a legal duty (as detailed in the MTA signed when an Application is approved) to desist from doing any analysis work on individual-level data concerning them.

When a participant withdraws consent, UK Biobank immediately flags this in the central databases. The Link files are dynamically generated for each Researcher at the time of download and respond to this change immediately, substituting negative dummy-IDs for the pseudo-ID of any withdrawn participants. Using these new Link files for analysis work instantly removes any connection to withdrawn elements in the Anonymised data.

At manageable intervals UK Biobank will regenerate the Anonymised files to purge any accumulating unusable entries - at which point Researchers will also need new Link files due to a change in the number of rows/columns in the Anonymised files. Notices will be sent to all Researchers registered for genetic data in advance of such a purge being made.

Note that there will generally be a lower number of participants present in the imputation-derived Anonymous and Link files compared to the genotype-related ones.

7.1 File versioning illustration

To illustrate how the file versioning is performed consider an initial Fam (Link) file which works with a Confidence (Anonymised) file, choosing the latter for clarity because of the plain-text format. To further simply imagine the Version 2 dataset contained only 6 participants and 3 SNPs on Chromosome 1, in which case the initial release (for Application 99) might be:

ukb99_cal_chr1_v2_s6.famukb_con_chr1_v2.txt
3298462 3298462 0 0 2 Batch_b001
8029816 8029816 0 0 1 Batch_b007
2874520 2874520 0 0 1 UKBiLEVEAX_b11
9023752 9023752 0 0 1 UKBiLEVEAX_b11
3679861 3679861 0 0 2 Batch_b032
7397822 3679861 0 0 2 Batch_b024
0.0011 0.0012 0.0013 0.0014 0.0015 0.0016
0.0021 0.0022 0.0023 0.0024 0.0025 0.0026
0.0031 0.0032 0.0033 0.0034 0.0035 0.0036

If the participant with pseudo-ID 3679861 withdraws then the Fam file contents and name ("s6" becoming "s5") would change to:

ukb99_cal_chr1_v2_s5.famukb_con_chr1_v2.txt
3298462 3298462 0 0 2 Batch_b001
8029816 8029816 0 0 1 Batch_b007
2874520 2874520 0 0 1 UKBiLEVEAX_b11
9023752 9023752 0 0 1 UKBiLEVEAX_b11
-1 -1 0 0 0 redacted
7397822 3679861 0 0 2 Batch_b024
0.0011 0.0012 0.0013 0.0014 0.0015 0.0016
0.0021 0.0022 0.0023 0.0024 0.0025 0.0026
0.0031 0.0032 0.0033 0.0034 0.0035 0.0036

If the participant with pseudo-ID 2874520 also withdraws then the Fam file would change to:

ukb99_cal_chr1_v2_s4.famukb_con_chr1_v2.txt
3298462 3298462 0 0 2 Batch_b001
8029816 8029816 0 0 1 Batch_b007
-1 -1 0 0 0 redacted
9023752 9023752 0 0 1 UKBiLEVEAX_b11
-2 -2 0 0 0 redacted
7397822 3679861 0 0 2 Batch_b024
0.0011 0.0012 0.0013 0.0014 0.0015 0.0016
0.0021 0.0022 0.0023 0.0024 0.0025 0.0026
0.0031 0.0032 0.0033 0.0034 0.0035 0.0036

If the Anonymised confidences file is then purged and regenerated (moving from Version 2 to 3), both the files and their names would alter to become:

ukb99_cal_chr1_v3_s4.famukb_con_chr1_v3.txt
3298462 3298462 0 0 2 Batch_b001
8029816 8029816 0 0 1 Batch_b007
9023752 9023752 0 0 1 UKBiLEVEAX_b11
7397822 3679861 0 0 2 Batch_b024
0.0011 0.0012 0.0014 0.0016
0.0021 0.0022 0.0024 0.0026
0.0031 0.0032 0.0034 0.0036

If the participant with pseudo-ID 8029816 subsequently withdraws then the Fam file would change to:

ukb99_cal_chr1_v3_s3.famukb_con_chr1_v3.txt
3298462 3298462 0 0 2 Batch_b001
-1 -1 0 0 0 redacted
9023752 9023752 0 0 1 UKBiLEVEAX_b11
7397822 3679861 0 0 2 Batch_b024
0.0011 0.0012 0.0014 0.0016
0.0021 0.0022 0.0024 0.0026
0.0031 0.0032 0.0034 0.0036

8. Standard usages

This section details some commonly encountered conundrums with analysis pipelines and suggests workarounds for them.

Common Link files
Many of the Anonymous files share the same Link files. It is possible to download a separate Link file for every Anonymous file however this process would have to be repeated whenever a participant withdrew and there are a various ways of making this process more efficient starting with downloading only a pair of link files (for instance ukb_cal_chr1_vZ_sP.fam and ukb_imp_chr1_vZsP.sample) thus:

Using any of these methods only the original two Link files need to be re-downloaded when a participant withdraws.

Shared datasets
Because of the large size of the Anonymous files UK Biobank has agreed that, with permission, Researchers from multiple approved Applications within a unit may share a common copy of them. However this can create problems as some analysis programs assume specific names and/or locations for their input files and it is likely that there will be more than one set of Link files in use simultaneously. Possible remedies include:


END