This resource has been superceded by 668 and 669

Accessing Genetic Data within UK Biobank

The genetic dataset held by UK Biobank is too large to be distributed as part of a standard phenotype dataset. Instead, the ukbgene client has been developed to allow Approved researchers to download elements of it piecemeal to their local systems from secure online repositories outside the main UK Biobank showcase system. This guide explains how to use ukbgene.

To use the UK Biobank secure online repository services a researcher must

be a validated UK Biobank researcher;
be part of an Approved Application;
have been issued a standard dataset together with the associated password credentials;
have included the desired genetic fields in an approved Basket.

This webpage details the means by which large-scale genetic data held by UK Biobank can be accessed and manipulated once access has been approved.

Preparation
Notices
Data Groups
Authentication
Fetching Data
Fetching relatedness
File versioning
Standard usages

1. Preparation

Following approval of a research application, researchers will be sent a 32-character MD5 Checksum and a 64-character password. The next step is to acquire the ukbgene utility from the Downloads section of the Showcase website.

Some of the UKB utilities are supplied pre-compiled for both MS-Windows and Linux systems. The MS-Windows utilities have the suffix .exe however the explanations given in this guide omit this for generality. All the utility programs are command-line, so Windows versions are best run from a Command Prompt window, and Linux versions are best run directly from a Terminal.

The repository consists of a pair of mirrored systems each connected to the UK JANET network by independent links. The system names are:

biota.ndph.ox.ac.uk
chest.ndph.ox.ac.uk

To access genetic data from a remote computer the system that the download utility is running on must be able to make http (Port 80) connections to at least one, and preferably both, of the repository systems. If this is not possible then researchers should contact their local IT team to resolve the issue.

2. Notices

Before downloading any data, Researchers are reminded that:

All access attempts, whether successful or denied, are logged and monitored with the IP address recorded.
UK Biobank does not backup user generated data and researchers are responsible for ensuring the safety/integrity of information they produce.
Any data exported outside of the UK Biobank systems must be protected by strong (e.g. AES256) encryption when not actively in use.
The volume of data available in the repository is subject to gradual change and may not match the list supplied when an application is processed. These changes are due to participant withdrawals (which require the removal of data) and the incremental addition of new data for continuing participants.

Note also that while it is possible to run multiple downloads in parallel, to provide fair usage the system will not permit a single Application to run more than 10 simultaneously.

3. Data Groups

The genetic data in UK Biobank can be grouped into 3 types according to how malleable and aggregated it is:

Static meta-data and aggregated results;
Anonymised results from sample analysis;
Link files mapping pseudo-ids to other participant information.

Static data includes such datasets as Marker QC (quality control) information. This data is not affected by the withdrawal of individual participants. Similarly, aggregated research output such as GWAS results are not affected by the withdrawal of participants after the computations have been made.

Anonymised data includes such information as Calls and Imputed values related to samples. When a participant withdraws consent a small fraction of these datasets is invalidated and should not be used for analysis, however this subset cannot be identified from the information within the file.

Link data includes such information as Fam files which allow the pseudo-IDs assigned to individual participants to be linked to particular subsets (e.g. a column) of the sample results files. When a participant withdraws their consent from UK Biobank their pseudo-ID information is removed from these datasets at the earliest opportunity.

A summary of the file types and groups is given in the table below:

Data type	Group	Filename(s)	How to obtain
Calls BED	Anon	ukb_cal_chrN_vZ.bed	ukbgene cal
Calls BIM	Anon	ukb_snp_chrN_vZ.bim	Resource 1963, ukb_snp_bim.tar
Calls FAM	Link	ukbA_cal_chrN_vZ_sP.fam	ukbgene cal -m
Marker-QC	Static	ukb_snp_qc.txt	Resource 1955, ukb_snp_qc.txt
Sample-QC	Anon	ukb_sqc_vZ.txt	standard fields in Category 100313
Relatedness	Link	ukbA_rel_sP.txt	ukbgene rel
Imputation BGEN	Anon	ukb_imp_chrN_vZ.bgen	ukbgene imp
Imputation BGI	Anon	ukb_bgi_chrN_vZ.bgi	Resource 1965, ukb_imp_bgi.tar
Imputation MAF+info	Anon	ukb_mfi_chrN_vZ.txt	Resource 1967, ukb_imp_mfi.tar
Imputation sample	Link	ukbA_imp_chrN_vZ_sP.sample	ukbgene imp -m
Haplotypes BGEN	Anon	ukb_hap_chrN_vZ.bgen	ukbgene hap
Haplotypes BGI	Anon	ukb_hbg_chrN_vZ.bgi	Resource 1671, ukb_hap_bgi.tar
HLA Imputation	Anon	ukb_hla_vZ.txt	Field 22182
Intensity	Anon	ukb_int_chrN_vZ.bin	ukbgene int
Confidences	Anon	ukb_con_chrN_vZ.txt	ukbgene con
CNV log2r	Anon	ukb_l2r_chrN_vZ.txt	ukbgene l2r
CNV baf	Anon	ukb_baf_chrN_vZ.txt	ukbgene baf
SNP-posterior	Static	ukb_snp_posterior_chrN.bin	Resource 1817, ukb_snp_posterior.tar
SNP-posterior X BIM	Static	ukb_snp_posterior_chrX_haploid.bim	Resource 1817, ukb_snp_posterior.tar
Batch	Static	ukb_snp_posterior.batch	Resource 1968, ukb_snp_posterior.batch

Data type

Group

Filename(s)

How to obtain

Calls BED

Anon

ukb_cal_chrN_vZ.bed

ukbgene cal

Calls BIM

Anon

ukb_snp_chrN_vZ.bim

Resource 1963, ukb_snp_bim.tar

Calls FAM

Link

ukbA_cal_chrN_vZ_sP.fam

ukbgene cal -m

Marker-QC

Static

ukb_snp_qc.txt

Resource 1955, ukb_snp_qc.txt

Sample-QC

Anon

ukb_sqc_vZ.txt

standard fields in Category 100313

Relatedness

Link

ukbA_rel_sP.txt

ukbgene rel

Imputation BGEN

Anon

ukb_imp_chrN_vZ.bgen

ukbgene imp

Imputation BGI

Anon

ukb_bgi_chrN_vZ.bgi

Resource 1965, ukb_imp_bgi.tar

Imputation MAF+info

Anon

ukb_mfi_chrN_vZ.txt

Resource 1967, ukb_imp_mfi.tar

Imputation sample

Link

ukbA_imp_chrN_vZ_sP.sample

ukbgene imp -m

Haplotypes BGEN

Anon

ukb_hap_chrN_vZ.bgen

ukbgene hap

Haplotypes BGI

Anon

ukb_hbg_chrN_vZ.bgi

Resource 1671, ukb_hap_bgi.tar

HLA Imputation

Anon

ukb_hla_vZ.txt

Field 22182

Intensity

Anon

ukb_int_chrN_vZ.bin

ukbgene int

Confidences

Anon

ukb_con_chrN_vZ.txt

ukbgene con

CNV log2r

Anon

ukb_l2r_chrN_vZ.txt

ukbgene l2r

CNV baf

Anon

ukb_baf_chrN_vZ.txt

ukbgene baf

SNP-posterior

Static

ukb_snp_posterior_chrN.bin

Resource 1817, ukb_snp_posterior.tar

SNP-posterior X BIM

Static

ukb_snp_posterior_chrX_haploid.bim

Resource 1817, ukb_snp_posterior.tar

Batch

Static

ukb_snp_posterior.batch

Resource 1968, ukb_snp_posterior.batch

References to ukbgene refer to a utility program that will shortly be made available on the Showcase website.
In file names
- A = application ID (integer);
- N = chromosome = 1,...,22,X,Y,XY,MT;
- Z = version of dataset (currently 2 for all files);
- P = number of linked samples (i.e.currently consenting participants) in dataset.

4. Authentication

To access the repository it is necessary to prove ones identity to the system using a keyfile. See Resource 667 for detailed information on this.

5. Fetching data

Most genetic data has been divided into per-chromosome datasets and is stored in two (or more) distinct parts:

An anonymous file of sample analysis results, for instance genotype calls;
A link file which maps the anonymous results to the pseudonym participant IDs supplied in the standard phenotype download.

To download the anonymous results files using ukbgene, enter the following at the command line:

 ukbgene typename -cchrom [flags]

where typename is the type of data being retrieved, selected from the list:

typename	type of data to be retrieved	format	link format
cal	genotype calls	bed	fam
con	genotype confidences	txt	fam
int	genotype intensities	bin	fam
baf	genotype CNV b-allele frequencies	txt	fam
l2r	genotype CNV log2ratios	txt	fam
imp	imputation	bgen	sample
hap	haplotypes	bgen	sample

typename

type of data to be retrieved

format

link format

cal

genotype calls

bed

fam

con

genotype confidences

txt

fam

int

genotype intensities

bin

fam

baf

genotype CNV b-allele frequencies

txt

fam

l2r

genotype CNV log2ratios

txt

fam

imp

imputation

bgen

sample

hap

haplotypes

bgen

sample

and chrom is the chromosome 1,2,...,22,X,Y,XY or MT. Additional person/sample-independent elements of a dataset (e.g. index files or QC) are not regarded as confidential and may be download directly from the UKB Showcase Resource areas.

A full list of the available flags can be obtained by running ukbgene without any parameters. Particularly important are:

-m will produce the Link file associated with the Anonymised dataset;
-v will produce extra diagnostic information in case of problems.

5.1 Examples

Hence, to fetch the Anonymous genotype calls for Chromosome 17 enter

 ukbgene cal -c17

which will produce a bed-format file.

To fetch the Link file associated with that Anonymous dataset add the -m parameter to the command line, hence

 ukbgene cal -c17 -m

will fetch the corresponding fam-format file.

5.2 Duplication

Note that many of the Link files have the same contents for different anonymous files and hence only a single instance needs to be downloaded. Specifically:

The fam file is identical for all chromosomes and genotype data formats.
The sample file is identical for chromosomes 1-22 in the imputed data.

See Standard Usages for help on working with this.

6. Fetching relatedness

The genotype information allows one to infer which/how different participants within UK Biobank are related. To retrieve this information run ukbgene with the "rel" parameter thus:

 ukbgene rel

This will produce a 5 column file giving a pairwise listing of related individual pseudo-IDs accompanied by the values:

HetHet : the fraction of markers for which the pair both have a heterozygous genotype;
IBS0 : the fraction of markers for which the pair shares zero alleles;
Kinship : estimate of the kinship coefficient for pair based on the set of markers used in the kinship inference.

In any pair where one or more of the participants has withdrawn, both pseudo-IDs are replaced by negative numbers.

7. File versioning

UK Biobank is a large study involving over 500,000 members of the general UK population. As a result of its size and composition it regularly encounters issues which are rare or absent in more tightly focussed studies involving only a few hundreds or thousands of participants. In particular, in most years several participants decide to completely withdraw their consent to being in the study which means that UK Biobank and all Researchers using it have a legal duty (as detailed in the MTA signed when an Application is approved) to desist from doing any analysis work on individual-level data concerning them.

When a participant withdraws consent, UK Biobank immediately flags this in the central databases. The Link files are dynamically generated for each Researcher at the time of download and respond to this change immediately, substituting negative dummy-IDs for the pseudo-ID of any withdrawn participants. Using these new Link files for analysis work instantly removes any connection to withdrawn elements in the Anonymised data.

At manageable intervals UK Biobank will regenerate the Anonymised files to purge any accumulating unusable entries - at which point Researchers will also need new Link files due to a change in the number of rows/columns in the Anonymised files. Notices will be sent to all Researchers registered for genetic data in advance of such a purge being made.

Note that there will generally be a lower number of participants present in the imputation-derived Anonymous and Link files compared to the genotype-related ones.

7.1 File versioning illustration

To illustrate how the file versioning is performed consider an initial Fam (Link) file which works with a Confidence (Anonymised) file, choosing the latter for clarity because of the plain-text format. To further simply imagine the Version 2 dataset contained only 6 participants and 3 SNPs on Chromosome 1, in which case the initial release (for Application 99) might be:

ukb99_cal_chr1_v2_s6.fam	ukb_con_chr1_v2.txt
3298462 3298462 0 0 2 Batch_b001 8029816 8029816 0 0 1 Batch_b007 2874520 2874520 0 0 1 UKBiLEVEAX_b11 9023752 9023752 0 0 1 UKBiLEVEAX_b11 3679861 3679861 0 0 2 Batch_b032 7397822 3679861 0 0 2 Batch_b024	0.0011 0.0012 0.0013 0.0014 0.0015 0.0016 0.0021 0.0022 0.0023 0.0024 0.0025 0.0026 0.0031 0.0032 0.0033 0.0034 0.0035 0.0036

ukb99_cal_chr1_v2_s6.fam

ukb_con_chr1_v2.txt

3298462 3298462 0 0 2 Batch_b001
8029816 8029816 0 0 1 Batch_b007
2874520 2874520 0 0 1 UKBiLEVEAX_b11
9023752 9023752 0 0 1 UKBiLEVEAX_b11
3679861 3679861 0 0 2 Batch_b032
7397822 3679861 0 0 2 Batch_b024

0.0011 0.0012 0.0013 0.0014 0.0015 0.0016
0.0021 0.0022 0.0023 0.0024 0.0025 0.0026
0.0031 0.0032 0.0033 0.0034 0.0035 0.0036

If the participant with pseudo-ID 3679861 withdraws then the Fam file contents and name ("s6" becoming "s5") would change to:

ukb99_cal_chr1_v2_s5.fam	ukb_con_chr1_v2.txt
3298462 3298462 0 0 2 Batch_b001 8029816 8029816 0 0 1 Batch_b007 2874520 2874520 0 0 1 UKBiLEVEAX_b11 9023752 9023752 0 0 1 UKBiLEVEAX_b11 -1 -1 0 0 0 redacted 7397822 3679861 0 0 2 Batch_b024	0.0011 0.0012 0.0013 0.0014 0.0015 0.0016 0.0021 0.0022 0.0023 0.0024 0.0025 0.0026 0.0031 0.0032 0.0033 0.0034 0.0035 0.0036

ukb99_cal_chr1_v2_s5.fam

ukb_con_chr1_v2.txt

3298462 3298462 0 0 2 Batch_b001
8029816 8029816 0 0 1 Batch_b007
2874520 2874520 0 0 1 UKBiLEVEAX_b11
9023752 9023752 0 0 1 UKBiLEVEAX_b11
-1 -1 0 0 0 redacted
7397822 3679861 0 0 2 Batch_b024

0.0011 0.0012 0.0013 0.0014 0.0015 0.0016
0.0021 0.0022 0.0023 0.0024 0.0025 0.0026
0.0031 0.0032 0.0033 0.0034 0.0035 0.0036

If the participant with pseudo-ID 2874520 also withdraws then the Fam file would change to:

ukb99_cal_chr1_v2_s4.fam	ukb_con_chr1_v2.txt
3298462 3298462 0 0 2 Batch_b001 8029816 8029816 0 0 1 Batch_b007 -1 -1 0 0 0 redacted 9023752 9023752 0 0 1 UKBiLEVEAX_b11 -2 -2 0 0 0 redacted 7397822 3679861 0 0 2 Batch_b024	0.0011 0.0012 0.0013 0.0014 0.0015 0.0016 0.0021 0.0022 0.0023 0.0024 0.0025 0.0026 0.0031 0.0032 0.0033 0.0034 0.0035 0.0036

ukb99_cal_chr1_v2_s4.fam

ukb_con_chr1_v2.txt

3298462 3298462 0 0 2 Batch_b001
8029816 8029816 0 0 1 Batch_b007
-1 -1 0 0 0 redacted
9023752 9023752 0 0 1 UKBiLEVEAX_b11
-2 -2 0 0 0 redacted
7397822 3679861 0 0 2 Batch_b024

0.0011 0.0012 0.0013 0.0014 0.0015 0.0016
0.0021 0.0022 0.0023 0.0024 0.0025 0.0026
0.0031 0.0032 0.0033 0.0034 0.0035 0.0036

If the Anonymised confidences file is then purged and regenerated (moving from Version 2 to 3), both the files and their names would alter to become:

ukb99_cal_chr1_v3_s4.fam	ukb_con_chr1_v3.txt
3298462 3298462 0 0 2 Batch_b001 8029816 8029816 0 0 1 Batch_b007 9023752 9023752 0 0 1 UKBiLEVEAX_b11 7397822 3679861 0 0 2 Batch_b024	0.0011 0.0012 0.0014 0.0016 0.0021 0.0022 0.0024 0.0026 0.0031 0.0032 0.0034 0.0036

ukb99_cal_chr1_v3_s4.fam

ukb_con_chr1_v3.txt

3298462 3298462 0 0 2 Batch_b001
8029816 8029816 0 0 1 Batch_b007
9023752 9023752 0 0 1 UKBiLEVEAX_b11
7397822 3679861 0 0 2 Batch_b024

0.0011 0.0012 0.0014 0.0016
0.0021 0.0022 0.0024 0.0026
0.0031 0.0032 0.0034 0.0036

If the participant with pseudo-ID 8029816 subsequently withdraws then the Fam file would change to:

ukb99_cal_chr1_v3_s3.fam	ukb_con_chr1_v3.txt
3298462 3298462 0 0 2 Batch_b001 -1 -1 0 0 0 redacted 9023752 9023752 0 0 1 UKBiLEVEAX_b11 7397822 3679861 0 0 2 Batch_b024	0.0011 0.0012 0.0014 0.0016 0.0021 0.0022 0.0024 0.0026 0.0031 0.0032 0.0034 0.0036

ukb99_cal_chr1_v3_s3.fam

ukb_con_chr1_v3.txt

3298462 3298462 0 0 2 Batch_b001
-1 -1 0 0 0 redacted
9023752 9023752 0 0 1 UKBiLEVEAX_b11
7397822 3679861 0 0 2 Batch_b024

0.0011 0.0012 0.0014 0.0016
0.0021 0.0022 0.0024 0.0026
0.0031 0.0032 0.0034 0.0036

8. Standard usages

This section details some commonly encountered conundrums with analysis pipelines and suggests workarounds for them.

Common Link files
Many of the Anonymous files share the same Link files. It is possible to download a separate Link file for every Anonymous file however this process would have to be repeated whenever a participant withdrew and there are a various ways of making this process more efficient starting with downloading only a pair of link files (for instance ukb_cal_chr1_vZ_sP.fam and ukb_imp_chr1_vZsP.sample) thus:

Create multiple symlinks to act as the other Link files. For instance
```
ln   -s   ukb_cal_chr1_vZ_sP.fam   ukb_cal_chr2_vZ_sP.fam
```
sets up an alias whereby the Chromosome 1 file 'looks like' the Chromosome 2 file.
Setup a script to physically copy the initial file into any other names required.
Some analysis programs have optionalparameters allowing the Anonymous and Link files to be specified separately.

Using any of these methods only the original two Link files need to be re-downloaded when a participant withdraws.

Shared datasets
Because of the large size of the Anonymous files UK Biobank has agreed that, with permission, Researchers from multiple approved Applications within a unit may share a common copy of them. However this can create problems as some analysis programs assume specific names and/or locations for their input files and it is likely that there will be more than one set of Link files in use simultaneously. Possible remedies include:

Use symlinks to create multiple virtual copies of the Anonymous files in the same apparent location as each set of Link files.
Some analysis programs have optional parameters which allow the names and locations of their multiple input files to be specified independently.

END