Accessing Genetic Data within UK Biobank

The genetic datasets held by UK Biobank are too large to be distributed as part of a standard phenotype dataset. Instead, the gfetch client has been developed to allow Approved researchers to download elements of it piecemeal to their local systems from secure online repositories outside the main UK Biobank showcase system. This guide explains how to use gfetch.

To use the UK Biobank secure online repository services a researcher must

  1. be a validated UK Biobank researcher;
  2. be part of an Approved Application;
  3. have been issued a standard dataset together with the associated password credentials;
  4. have included the desired genetic fields in an approved Basket.

This webpage details the means by which large-scale genetic data held by UK Biobank can be accessed and manipulated once access has been approved.

  1. Preparation
  2. Notices
  3. Data Groups
  4. Authentication
  5. Fetching Data
  6. Fetching relatedness
  7. File versioning
  8. Standard usages

1. Preparation

Following approval of a research application, researchers will be sent a 32-character MD5 Checksum and a 64-character password. The next step is to acquire the gfetch utility from the Downloads section of the Showcase website.

Some of the UKB utilities are supplied pre-compiled for both MS-Windows and Linux systems. The MS-Windows utilities have the suffix .exe however the explanations given in this guide omit this for generality. All the utility programs are command-line, so Windows versions are best run from a Command Prompt window, and Linux versions are best run directly from a Terminal.

The repository consists of a pair of mirrored systems each connected to the UK JANET network by independent links. The system names are:

To access genetic data from a remote computer the system that the download utility is running on must be able to make http (Port 80) connections to at least one, and preferably both, of the repository systems. If this is not possible then researchers should contact their local IT team to resolve the issue.

2. Notices

Before downloading any data, Researchers are reminded that:

3. Data Groups

The genetic data in UK Biobank can be grouped into 3 types according to how malleable and aggregated it is:
  1. Static meta-data and aggregated results;
  2. Anonymised results from sample analysis;
  3. Link files mapping pseudo-ids to other participant information.
Static data includes such datasets as Marker QC (quality control) information. This data is not affected by the withdrawal of individual participants. Similarly, aggregated research output such as GWAS results are not affected by the withdrawal of participants after the computations have been made.

Anonymised data includes such information as Calls and Imputed values related to samples. When a participant withdraws consent a small fraction of these datasets is invalidated and should not be used for analysis, however this subset cannot be identified from the information within the file.

Link data includes such information as Fam files which allow the pseudo-IDs assigned to individual participants to be linked to particular subsets (e.g. a column) of the sample results files. When a participant withdraws their consent from UK Biobank their pseudo-ID information is removed from these datasets at the earliest opportunity.

4. Authentication

To access the repository it is necessary to prove ones identity to the system using a keyfile. See Resource 667 for detailed information on this.

5. Fetching data

Most genetic data has been divided into per-chromosome datasets for convience of downloading and use. Some is stored as paired files with one containing anonymised data and the other the Application-specific identifiers (for instance genotype calls), while in others the actual data is customised dynamically. The data for some chromosomes/formats is sufficiently large that it has been broken up into a number of separate files for a chromosome and these are indexed by a 'block' counter, which begins at 0 for each chromosome.

To download the results files using gfetch, enter the following at the command line:

 gfetch  field_id -cchrom [flags]
where field_id is the ID of the field as given in the Showcase and chrom is the chromosome 1,2,...,22,X,Y,XY or MT. Additional person/sample-independent elements of a dataset (e.g. index files or QC) are not regarded as confidential and may be download directly from the UKB Showcase Resource areas.

A full list of the available flags can be obtained by running gfetch without any parameters. Particularly important are:

Downloaded files will have names which reflect the parameters used to acquire them (and thus their contents), generally
 ukbF_cC_bB_vV.typ
where with the names of Link files also containing a final "_s" indicating the number of participants listed in them.

5.1 Examples

To fetch the Anonymous genotype calls (Field 22418) for Chromosome 17 enter
 gfetch 22418 -c17
which will produce a bed-format file. To fetch the Link file associated with that Anonymous dataset add the -m parameter to the command line, hence
 gfetch 22418 -c17 -m
will fetch the corresponding fam-format file.

To fetch the 3rd block of the OQFE pVCF (field 23156) for Chromsome 9 exomes enter

 gfetch 23156 -c9 -b2
If data has been divided into blocks then a Resource attached to the relevant field (837 for field 23156) will indicate how many blocks there are for each chromosome.

5.2 Duplication

Note that many of the Link files have the same contents for different anonymous files and hence only a single instance needs to be downloaded. Specifically: See Standard Usages for help on working with this.

6. Fetching relatedness

The genotype information allows one to infer which/how different participants within UK Biobank are related. To retrieve this information run gfetch with the "rel" parameter thus:
 gfetch rel
This will produce a 5 column file giving a pairwise listing of related individual pseudo-IDs accompanied by the values:
  1. HetHet : the fraction of markers for which the pair both have a heterozygous genotype;
  2. IBS0 : the fraction of markers for which the pair shares zero alleles;
  3. Kinship : estimate of the kinship coefficient for pair based on the set of markers used in the kinship inference.
In any pair where one or more of the participants has withdrawn, both pseudo-IDs are replaced by negative numbers.

7. File versioning

UK Biobank is a large study involving over 500,000 members of the general UK population. As a result of its size and composition it regularly encounters issues which are rare or absent in more tightly focussed studies involving only a few hundreds or thousands of participants. In particular, in most years a small number of participants decide to completely withdraw their consent to being in the study which means that UK Biobank and all Researchers using their data have a legal duty (as detailed in the MTA signed when an Application is approved) to desist from doing any analysis work on individual-level data concerning them.

When a participant withdraws consent, UK Biobank immediately flags this in the central databases. The Link files are dynamically generated for each Researcher at the time of download and respond to this change immediately, substituting negative dummy-IDs for the pseudo-ID of any withdrawn participants. Using these new Link files for analysis work instantly removes any connection to withdrawn elements in the Anonymised data.

At manageable intervals UK Biobank will regenerate the Anonymised files to purge any accumulating unusable entries - at which point Researchers will also need new Link files due to a change in the number of rows/columns in the Anonymised files. Notices will be sent to all Researchers registered for genetic data in advance of such a purge being made.

Note that there will generally be different numbers of participants present in the various types of genetic files due to different samples being used to generate them and subsequent processing and quality control.

7.1 File versioning illustration

To illustrate how the file versioning is performed consider an initial Fam (Link) file which works with a Confidence (Anonymised) file, choosing the latter for clarity because of the plain-text format. To further simply imagine the Version 2 dataset contained only 6 participants and 3 SNPs on Chromosome 1, in which case the initial release (for Application 99) might be:

ukb22419_c1_b0_v2_s6.famukb22419_c1_b0_v2.txt
3298462 3298462 0 0 2 Batch_b001
8029816 8029816 0 0 1 Batch_b007
2874520 2874520 0 0 1 UKBiLEVEAX_b11
9023752 9023752 0 0 1 UKBiLEVEAX_b11
3679861 3679861 0 0 2 Batch_b032
7397822 3679861 0 0 2 Batch_b024
0.0011 0.0012 0.0013 0.0014 0.0015 0.0016
0.0021 0.0022 0.0023 0.0024 0.0025 0.0026
0.0031 0.0032 0.0033 0.0034 0.0035 0.0036

If the participant with pseudo-ID 3679861 withdraws then the Fam file contents and name ("s6" becoming "s5") would change to:

ukb22419_c1_b0_v2_s5.famukb22419_c1_b0_v2.txt
3298462 3298462 0 0 2 Batch_b001
8029816 8029816 0 0 1 Batch_b007
2874520 2874520 0 0 1 UKBiLEVEAX_b11
9023752 9023752 0 0 1 UKBiLEVEAX_b11
-1 -1 0 0 0 redacted
7397822 3679861 0 0 2 Batch_b024
0.0011 0.0012 0.0013 0.0014 0.0015 0.0016
0.0021 0.0022 0.0023 0.0024 0.0025 0.0026
0.0031 0.0032 0.0033 0.0034 0.0035 0.0036

If the participant with pseudo-ID 2874520 also withdraws then the Fam file would change to:

ukb22419_c1_b0_v2_s4.famukb22419_c1_b0_v2.txt
3298462 3298462 0 0 2 Batch_b001
8029816 8029816 0 0 1 Batch_b007
-1 -1 0 0 0 redacted
9023752 9023752 0 0 1 UKBiLEVEAX_b11
-2 -2 0 0 0 redacted
7397822 3679861 0 0 2 Batch_b024
0.0011 0.0012 0.0013 0.0014 0.0015 0.0016
0.0021 0.0022 0.0023 0.0024 0.0025 0.0026
0.0031 0.0032 0.0033 0.0034 0.0035 0.0036

If the Anonymised confidences file is then purged and regenerated (moving from Version 2 to 3), both the files and their names would alter to become:

ukb22419_c1_b0_v3_s4.famukb22419_c1_b0_v3.txt
3298462 3298462 0 0 2 Batch_b001
8029816 8029816 0 0 1 Batch_b007
9023752 9023752 0 0 1 UKBiLEVEAX_b11
7397822 3679861 0 0 2 Batch_b024
0.0011 0.0012 0.0014 0.0016
0.0021 0.0022 0.0024 0.0026
0.0031 0.0032 0.0034 0.0036

If the participant with pseudo-ID 8029816 subsequently withdraws then the Fam file would change to:

ukb22419_c1_b0_v3_s3.famukb22419_c1_b0_v3.txt
3298462 3298462 0 0 2 Batch_b001
-1 -1 0 0 0 redacted
9023752 9023752 0 0 1 UKBiLEVEAX_b11
7397822 3679861 0 0 2 Batch_b024
0.0011 0.0012 0.0014 0.0016
0.0021 0.0022 0.0024 0.0026
0.0031 0.0032 0.0034 0.0036

8. Standard usages

This section details some commonly encountered conundrums with analysis pipelines and suggests workarounds for them.

Common Link files
Many of the Anonymous files share the same Link files. It is possible to download a separate Link file for every Anonymous file however this process would have to be repeated whenever a participant withdrew and there are a various ways of making this process more efficient starting with downloading only a small number of link files (for instance ukb22418_c1_b0_vZ_sP.fam for the genotype data) thus:

Using any of these methods only the original two Link files need to be re-downloaded when a participant withdraws.

Shared datasets
Because of the large size of the Anonymous files UK Biobank has agreed that, with permission, Researchers from multiple approved Applications within a unit may share a common copy of them. However this can create problems as some analysis programs assume specific names and/or locations for their input files and it is likely that there will be more than one set of Link files in use simultaneously. Possible remedies include:

Generally however the use of shared datasets is both discouraged and will be deprecated as UKB develops its own online access platforms.
END