Genotypes - Genomics
Genotype calls (based on genotyping array measurements) and related measurements. The genotypes are aligned to the + strand of the reference and positions are in GRCh37 coordinates.
The fields listed here are indicators and adding them to an Application basket will allow researchers to download the corresponding underlying information from the UKB repository. This information includes
- Calls (0.1TB)
- Confidences (2.9TB)
- Intensities (2.9TB)
- CNV B-allele frequencies (1.5TB)
- CNV log2ratios (2.3TB)
Marker quality control (QC) and Sample QC information, including population structure (see Relatedness in the Notes below) is also available. Intensity and SNP posteriors are available for cluster plotting. B-allele frequency and log2ratio are available for CNV analysis.
The lists of SNPs in the Genotype datasets can be downloaded from the Field's Resources tabs on a per-chromosome basis or as a combined tar in Resource 1963.
Researchers who need only a few specific genotyped SNPs (rather than the whole-chromosome datasets below) should list these SNP IDs on the "free-text" option section of their Application and a custom dataset will be created for them (please state whether the loci are listed as rsID or Affymetrix ID values.)
See Resource 530 for additional information, including a citable reference for publications.
The genotype Calls
are in binary PLINK format (.bed, .bim, .fam) see https://www.cog-genomics.org/plink/1.9/formats for details of the formats. The BIM file determines the order of markers in the calls and all of the other genotype data sets. The SNP_id is the rsid where it is available or the Affymetrix_SNP_id otherwise. The positions are in GRCh37 coordinates. The FAM file determines the order of samples in the calls and all of the other genotype data sets. The FAM file includes 'Batch' in the Phenotype field (6th column).
The Confidence files contain the Affymetrix 'confidence' that a genotype belongs to the call cluster. This is a plaintext file with space separated columns. Values are in the range 0-1 with 0 being most confident. Missing values are represented by -1. The order of markers and Samples are given by the BIM and FAM files.
The CNV files contain the B-Allele-Frequency (baf) and Log2Ratio (log2r) transformed intensity values for performing CNV calling. There is a separate file for baf and log2r per chromosome. These are plaintext files with space separated columns. The rows correspond to markers (ordered as the calls BIM file) and the columns correspond to samples (ordered as the calls FAM file) Missing values are represented by -1.
The Intensity files contains the A,B intensity data measured by Affymetrix. The files are in a simple custom binary format. There are two intensity values A,B for each genotype, each represented as a 4-byte float. The set of A,B values for each marker are ordered consecutively by sample (analagous to a matrix with rows=SNPs and columns=Samples) e.g. SNP_1_SAMPLE_1_A SNP_1_SAMPLE_1_B SNP_1_SAMPLE_2_A SNP_1_SAMPLE_2_B ... SNP_1_SAMPLE_N_A SNP_1_SAMPLE_N_B SNP_2_SAMPLE_1_A SNP_2_SAMPLE_1_B ... Missing pairs of intensities are represented by -1 -1. The order of the markers and Samples are given by the BIM and FAM files with the calls.
Affymetrix transform the A,B values into 'contrast' and 'strength' for their calling algorithm. If the intensity data is to be used for making cluster plots it is strongly suggested that the transformed values are plotted. The ellipses described by the snp-posterior data are only compatible with the transformed intensity values.
- contrast (X) = log2(A/B)
- strength (Y) = log2(AB)/2
Affymetrix called the genotypes in (106) batches of ~4,700 samples. To accurately examine the cluster plots, each batch should be plotted separately for a given marker. For chrX, males and females were called separately, so males and females should be plotted separately for chrX markers. Evoker (https://github.com/wtsi-medical-genomics/evoker) can be used to plot cluster plots, following the recommendations above, for the UK Biobank data.
The Relatedness of individuals is obtained using the ukbgene utility (Resource 664) which generates a plaintext file with 5 space-separated columns:
- ID for participant 1 in related pair;
- ID for participant 2 in related pair;
- HetHet : fraction of markers for which the pair both have a heterozygous genotype;
- IBS0 : fraction of markers for which the pair shares zero alleles;
- Estimate of the kinship coefficient for this pair based on the set of markers used in the kinship inference;
where the HetHet, IBS0 and Kinship values were generated by the KING software.