Essential information > Accessing your data

Accessing UK Biobank data guide

The following document provides guidance on how to download the various different type of UK Biobank data:

Data access guide

Please see the Understanding UK Biobank page for information about the data available, which also includes further information and reports about linkage data (Cancer, Hospital Inpatient, Death).

Data Dictionaries and Encodings

  Data Dictionary of Showcase fields:   List of Data-Codings (including those for the Data Portal): The comma-separated values (csv) versions of the above are easier to view in packages such as Excel, however as the field descriptions and coding meanings sometime contain commas some of the fields are enclosed in quotes to indicate such a comma is not a separator. The tab-separated values (tsv) versions are therefore provided as an alternative.

UKB Synthetic Dataset

A synthetic version of the UK Biobank dataset has been created to allow large scale system testing using data which is comparable in size and constitution to the real dataset.

Further details regarding the methods used to create this resource and links to file downloads can be found on that page.

Size of the Core Dataset

The core dataset consists of the categories shown in the Quick Start section of Showcase when a basket is first created:

As an illustration of the potential size of a downloaded UK Biobank main dataset, the sizes of the various files generated from the core dataset are given on the table below:

File type ukbconv option Dataset size File extension
ukb - 16.6 GB .enc_ukb
tsv* txt 18.3 GB .txt
csv csv 36.0 GB .csv
R r 34.6 GB .tab
SAS sas 65.4 GB .sd2
Stata stata 65.6 GB .raw

In addition: the R .tab file by a 511 KB .r script, the SAS .sd2 file is accompanied by a 1.3 MB .sas script, and the Stata .raw file by a 501 KB .do script and a 961 KB .dct file.

Note that the large difference in size between the tsv .txt file and the (also tab-separated) R .tab file is due to empty fields being represented by the empty string in the former and by NA in the latter. Similarly, all fields are quoted in the .csv file, with empty fields appearing as "", which accounts for its additional size compared to the .txt file.

Information about the sizes of bulk data items such as MRI images can be found in section 8.4 of the "Accessing data guide" above. This document also includes links to documents providing information about the size of the Genotype data (Section 4.1) and the Exome data (Section 4.3).

* Please note that there is currently a glitch with the tsv version of the converted file, in that every row except for the first starts with an extra tab (thereby throwing off the column alignment). If you intend to use this option you will need to have the technical know-how to manipulate the file to correct the problem. Another approach is simply to use the R option instead as that also produces a tab-separated file, and to disregard the accompanying R script. (Note that empty fields will appear as NA rather than the empty string in the resulting file however.) We are currently looking into correcting this problem.

Enabling scientific discoveries that improve human health