BGEN

BGEN in the UK Biobank
Login

BGEN has been used for release of imputed genotype probability data in the UK Biobank. This page contains technical details of the formats used.

Note: Questions about the UK Biobank genomics data releases should be directed at the UKB-GENETICS mailing list.

UK Biobank genotype and imputed data full release

The UK Biobank has released both imputed genotype and phased haplotype data for the full biobank cohort (487,409 individuals after QC, including the individuals from the interim release). This section contains details of what is found in these data. For full information on data processing, see Bycroft, Freeman, Petkova et al, "The UK Biobank resource with deep phenotyping and genomic data", Nature (2018).

Phased haplotype data

Phased haplotypes have been released in bgen files with names of the form ukb_hap_chr<chr>_v2.bgen, with corresponding .bgen.bgi index files. Here are details of what is found in these files:

A simple way to see the contents of these files is to convert to vcf format using bgenix - e.g. using the command:

bgenix -g ukb_hap_chr10_v2.bgen -vcf

This output reflects the fact that data is conceptually stored as four probabilities per individual per variant (i.e. the probability of each of the two alleles on each of the two haplotypes), and is directly convertable to a phased genotype call. See the BGEN format specification for full details of data storage.

A note on chromosome information: a processing issue means that these files have been encoded with blank chromosome information (instead, the chromosome is encoded in the variant ID field of the file). This has consequences for analysis using the bgen tools. Please see [[Using the UK Biobank full release index files]] for more information on this and a workaround.

Imputed genotype data

Imputed data files have been released in BGEN format files, with filenames of the form ukb_imp_chr<chr>_<version>.bgen, and corresponding index files. Here the version is either 'v2' (for the initial release of these data) or 'v3' (for the later release, which fixed a number of bugs in the initial release). Here are details of what is found in these files:

** !! Important **. For most purposes you should be using the final 'v3' version of these files. Please read [[Using the UK Biobank full release index files]] if you are using the UK Biobank-supplied index files supplied with the initial ('v2') version of these data.

UK Biobank genotype and imputed data interim release

In May 2015 the UK Biobank released imputed genotype data for 152,249 individuals, typed / imputed at 72,355,667 variants genome-wide. This data was released in BGEN v1.1 format. See the UK Biobank Data Showcase page for more information on these data.