Using UK Biobank Data

Accessing your dataset

Table of contents

Introduction
Helper programs
Encoding file
Downloading the main dataset
Decrypting the file
Converting the file
Using the output files
Additional documents

1. Introduction

1.1

This guide is intended for Researchers who have had an Application for access to UK Biobank approved and have received an email containing a 32-character MD5 Checksum and a 64-character Password.

Please note that:
  • only the Applicant is sent the MD5 checksum and password when a dataset is released;
  • on the website, the data collection hyperlink is only visible to the Applicant PI and collaborators with "edit rights";
  • if edit rights need to be assigned to other collaborators, the Applicant PI should email the Access Team.

1.2

Standard data and additional documents can be downloaded directly from the UK Biobank website using the MD5 Checksum and password. The process of accessing bulk data (large complex items, e.g. eye images, ECG fitness test results) is described on this page: Accessing Complex Data within UK Biobank.

1.3

All standard data downloads must be decrypted and converted into a suitable format before they can be used. Helper programmes to perform these tasks are provided.

1.4

Standard data and helper files are located in the Downloads section of the Showcase website, but please note, your dataset is only visible when logged in via the Application Management System.

1.5

Standard data can be downloaded multiple times, without limit. The MD5 checksum and password are only sent to the Applicant PI but anyone with edit access can download the dataset.

2. Helper programs

2.1

There are 3 helper programs required for decrypting and converting the data:

  • ukb_md5
  • ukb_unpack
  • ukb_conv

These are provided in the File Handlers tab in the Downloads section of the Showcase website (http://biobank.ctsu.ox.ac.uk/showcase/download.cgi), as detailed in Figure 1.

Fig. 1

2.2

The helper programs are supplied in two separate formats for compatibility with Windows or Linux operating systems. Download the helper programs one at a time by selecting the operating system you are using. This will open a new page, where the download can be found (Figure 2).

Fig. 2

2.3

Right-click the blue download link (i.e. ukb_md5, ukb_conv, or ukb_unpack) as shown in Figure 2, and select Save as.... To simplify the next steps of the process, we recommend you to save the 3 helper programs in the same file directory.

3. Encoding file

3.1

As part of the conversion process, the converter programme ukb_conv will need a file called encoding.ukb. encoding.ukb is used to assign coded definitions to variables in the dataset, and is compatible for use with both Windows and Linux systems.

3.2

encoding.ukb is provided in the Miscellaneous Utilities tab in the Downloads section of the Showcase website, as detailed in Figure 3. We recommend that you save encoding.ukb along with the helper programs in the same directory.

Fig. 3

3.3

As shown in Figure 4, you should now have a directory with 4 files:

  • ukb_md5
  • ukb_conv
  • ukb_unpack
  • encoding.ukb
Fig. 4

4. Downloading the main dataset (standard data)

4.1

To download your dataset, you must first login to Showcase via the Access Management System, then select the Downloads option. Your dataset is stored in the Dataset tab, as shown in Figure 5.

Fig. 5

4.2

Click on the ID number for the dataset you wish to download; this will take you to the authentication screen, where you will be asked to enter your 32-character MD5 Checksum (Figure 6). You should have received the MD5 Checksum (a long series of letters and numbers) via email. Paste it in the box and click on Generate; this will open a new page with a link to your dataset (Figure 7).

Fig. 6
Fig. 7

4.3

Once on the Download screen, click on the Fetch button to download the encrypted dataset. Save your dataset in the same file directory as the 4 other files that you have already downloaded.

5. Decrypting the file

5.1

You should now have 5 files in your directory, as shown in Figure 8. Note that the number used in the ukbXXXX.enc file will be specific to your dataset - it should be your application number minus the final digit which acts as a checksum.

Fig. 8

5.2

You are now ready to begin the decryption process. This is operated by accessing what is called a command-line interface.

If you are using Windows:

  • Windows XP: go to Start > All Programs > Accessories > Command Prompt
  • Windows Vista: go to Start > type cmd in the Search bar, and click on Command Prompt once it has appeared
  • Windows 7: go to Start > All Programs > Accessories > Command Prompt
  • Windows 8: go to Start > type cmd in the Search bar, and click on Command Prompt once it has appeared

For any version of Windows, if the Command Prompt does not appear by following the steps above, please press the following combination of keys: Windows+R. (The Windows key is located between the Ctrl and the Alt keys on your keyboard). This will open a small window named "Run". Type cmd in the "Open:" space, then click OK. This will open a Command Prompt window.

If you are using Linux:

  • Open a standard Linux terminal.

5.3

Once it is opened, the Command Prompt window should display only a bit of text at the top, and then a blinking cursor preceded by a directory address on your computer (by default, this should be C:\Users\YourName), as shown in Figure 9.

Fig. 9

5.4

The next step is to navigate to the directory in which you previously downloaded all helper files, the encoding file and your dataset. To do this, type cd followed by the path that you wish to navigate to, from the current folder.

In our example, we downloaded the files in a directory named Biobank, which is located in the home directory for the user edouardm.

All we need to do is type cd Biobank and press Enter to navigate to the Biobank directory, as shown in Figure 10.

Fig. 10

5.5

Note that you can also use cd followed by two dots (cd ..) to go back to the parent directory, as shown in Figure 11.

Fig. 11

5.6

Use the cd command to navigate to the chosen directory. Once you are in the right directory, you can use the dir command to list all the files in the current directory (Figure 12). This allows you to check that you are indeed in the right place: the dir command should display the name of the 5 files that you previously downloaded.

Fig. 12

5.7

Before decrypting the file, we recommend that you verify the integrity of the files that you have downloaded and the program ukb_md5 has been made available to assist with this. In the Command Prompt, type ukb_md5 ukbXXXX.enc, replacing ukbXXXX.enc with the right name of your dataset file, and press Enter (Figure 13).

Fig. 13

5.8

After a few seconds of processing, this command should display the MD5 Checksum of the dataset. Its value should be the same as the MD5 Checksum supplied via email. If the value is different, the file should be deleted and the data downloaded afresh.

5.9

Your dataset is supplied in a compressed encrypted format. The ukb_unpack program decrypts and uncompresses the downloaded file into a custom UK Biobank format. To use the program, type ukb_unpack ukbXXXX.enc keyvalue in the Command Prompt, replacing:

  • ukbXXXX.enc with the right name of your dataset file;
  • keyvalue with the 64 character Password from the notification email. Note that the usual keyboard shortcut to paste data (Ctrl+V) does not work in the Command Prompt. You can right-click anywhere in the black space and choose Paste in the drop-down menu to achieve the same effect.
Each .enc file has a different key that can be found as an attachment to the automated email notifying the researcher of the release of the dataset. The key files are not interchangeable, even between datasets released for the same project.

After having pressed Enter, the Command Prompt will unpack the file. This could take a few minutes - the remaining time is displayed by the Command Prompt, as shown in Figure 14.

Fig. 14

5.10

This process will create a new file in your directory, named ukbXXXX.enc_ukb.

6. Converting the file

6.1

The result of the unpacking process is a dataset in a custom UK Biobank format. The ukb_conv program transforms this dataset into various standard formats ready for use, as detailed in Table 1. The original file remains intact so the converter may be used multiple times without limit, to generate different outputs.

Note that there is no SPSS option available at the moment. If you would like to use SPSS to open your dataset, the best option is to use the csv format first, and then import the data into SPSS.

Tab. 1

csv

Simple comma-separated-variable output, all fields double-quoted. Suitable for import into packages such as Microsoft Excel.

sas

Data is converted to a format suitable for import into the SAS package.

stata

Data is converted to a format suitable for import into the Stata package.

r

Data is converted to a format suitable for import into the R package.

docs

Rather than output the data itself, this option generates an html file describing the data, listing the names and types of each field.

bulk

A list is created suitable for use with the ukbfetch bulk-access utility.

6.2

To convert your dataset, type ukb_conv ukbXXXX.enc_ukb format, where:

  • ukbXXXX.enc_ukb is the name of the decrypted file that you obtained after step 6.9;
  • format is one of the options presented in the first column of Table 1, depending on the desired output.

Optional parameters can be used to specify alternative encoding, subsets of fields to include or exclude, or the desired name of the output file. Please see below for a description of these optional parameters.

Fig. 15

6.3

Depending on the number of variables and participants in the dataset as well as the speed of the local system, the conversion process could take a considerable amount of time (possibly hours) to convert. Once complete, your dataset will appear in the file directory that you originally specified.

Optional parameters for conversion

Optional parameters can be applied to the conversion process using the Command Prompt. They are detailed in the table below. The options work by specifying the flag (i.e. -e, -i, -s, etc.), followed immediately by the file-ID defining the parameters:

Command: ukb_conv ukbXXXX.enc_ukb format [options]

Flag Meaning
-e Specify an alternative file from which to extract encoding information
-i Specify a subset of fields to include in the output
-o Specify an alternative name for the output file
-s Specify a single field (only) to include in the output
-x Specify a subset of fields to exclude from the output

Flag -e

By default ukb_conv will look for the encoding file encoding.ukb, but by using the -e option, a different file can be used as the source for the coding definitions. If no encoding file is found, the conversion process will still work, but note that categorical variables will not be coded, and hence may be less meaningful.

Flags -i and -x

When selecting subsets of fields, using the options -i or -x, the file defining the parameters (file-ID) should be in text format (.txt), with one field-ID per row. To assist with preparing this file, the converter outputs a file named field.ukb each time it is run, which lists all the available fields associated with the dataset. This can be edited to identify particular fields, which are to be included in the subset. The -s option for selecting a single data field works by specifying the field-ID immediately after the flag.

Note that running the converter twice, using the same subset file but with -i and -x on alternate runs, will split the dataset into two complementary parts.

7. Using the output files

Depending on the format chosen in step 7.2, the number of files generated by the Command Prompt will vary. Here is a description of how each of them can be used; for each format, the main file to be used is in bold letters.

Format: csv

ukbXXXX.csv

This file contains the data for all variables and participants, in a comma-separated format. It can be imported into any statistical package or spreadsheet application. However, the choice of "csv" as the conversion format means that none of the categorical values will be labelled in the dataset.

fields.ukb

This file contains a list of the fields present in the dataset. It is used by the Command Prompt to extract the different variables.

ukbXXXX.log

This log file is simply used to summarise the result of the conversion process (date & time, name of the output file, application identifier, basket identifier, number of variables, time required to convert).

Format: sas

ukbXXXX.sd2

This is the actual file containing the data as a SAS Data Set. This file could potentially be imported directly into SAS, but none of the values will be coded.

ukbXXXX.sas

This file is the SAS program that should be opened and executed. It contains a list of commands that will import the dataset (as a dataset named WORK.LABELLED_LFVPWW) and recode all categorical variables.

fields.ukb

This file contains a list of the fields present in the dataset. It is used by the Command Prompt to extract the different variables.

ukbXXXX.log

This log file is simply used to summarise the result of the conversion process (date & time, name of the output file, application identifier, basket identifier, number of variables, time required to convert).

Format: stata

ukbXXXX.raw

This is the actual file containing the data. This file could potentially be imported directly into STATA, but none of the values will be coded.

ukbXXXX.do

This file is the one that should be opened and executed in STATA. It contains a list of commands that will import the dataset and recode all categorical variables.

ukbXXXX.dct

A dictionary of values used by ukbXXXX.do to format and label variables in the imported dataset.

fields.ukb

This file contains a list of the fields present in the dataset. It is used by the Command Prompt to extract the different variables.

ukbXXXX.log

This log file is simply used to summarise the result of the conversion process (date & time, name of the output file, application identifier, basket identifier, number of variables, time required to convert).

Format: r

ukbXXXX.tab

This is the actual file containing the data, in a tab-separated format. This file could potentially be imported directly into R, but none of the values will be coded.

ukbXXXX.R

This file is the one that should be opened and executed in R (or any other R environment, such as RStudio). It contains a list of commands that will import the dataset (as a data.frame named bd) and recode all categorical variables.

fields.ukb

This file contains a list of the fields present in the dataset. It is used by the Command Prompt to extract the different variables.

ukbXXXX.log

This log file is simply used to summarise the result of the conversion process (date & time, name of the output file, application identifier, basket identifier, number of variables, time required to convert).

Format: docs

ukbXXXX.html

This HTML file can be opened by any web browser (such as Internet Explorer, Firefox, Chrome, Safari). It describes the data and lists the name, type and count for each variable in the dataset.

fields.ukb

This file contains a list of the fields present in the dataset. It is used by the Command Prompt to extract the different variables.

ukbXXXX.log

This log file is simply used to summarise the result of the conversion process (date & time, name of the output file, application identifier, basket identifier, number of variables, time required to convert).

8. Additional documents

In some instances, additional documents are sent out to researchers alongside the standard dataset. These may include bespoke data-field customisation or bridging files to link two separate UK Biobank applications together.

Additional documents have their own unique MD5 checksum and password and are downloaded, validated and unpacked using the same methods described for standard data. After being unpacked, these files are ready to be used and do not need to be converted.

Improving the health of future generations