BGEN: The bgenix index file format

This page documents the index file format used by bgenix. For more information on bgenix, see the bgenix documentation page.

Overview

bgenix index files are sqlite files; as such they are easy to read or manipulate using popular program languages or using the sqlite3 command-line tool. (sqlite version 3.8.2 or above is needed to work with these files).

For example, the command

sqlite3 myfile.bgen.bgi "SELECT * FROM Variant LIMIT 10"

will list the first few variants in the index.

This snippet uses the RSQLite package to load the same information into R:

library( RSQlite )
index = dbConnect( dbDriver( SQLite ), "myfile.bgen.bgi" )
variants = dbGetQuery( index, "SELECT * FROM Variant LIMIT 10" )

And this snippet does the same thing in python:

import sqlite3
index = sqlite3.connect( "myfile.bgen.bgi" )
variants = index.execute( "SELECT * FROM Variant LIMIT 10" )

Index schema

The index is stored in a single table (called Variant by default). The schema of this table is as follows:

CREATE TABLE Variant (
  chromosome TEXT NOT NULL,
  position INT NOT NULL,
  rsid TEXT NOT NULL,
  number_of_alleles INT NOT NULL,
  allele1 TEXT NOT NULL,
  allele2 TEXT NULL,
  file_start_position INT NOT NULL,
  size_in_bytes INT NOT NULL,
  PRIMARY KEY (chromosome, position, rsid, allele1, allele2, file_start_position )
) WITHOUT ROWID;

By default this table is created using the "WITHOUT ROWID" option. This means that (unlike standard tables in sqlite) the table does not have an extra, hidden rowid column. Instead, the table is stored on-disk as a sorted table, sorted in lexicographical order by the fields in the PRIMARY KEY field.

The index table stores the first two alleles of each variant in the index. Other alleles are not stored at the moment; bgenix currently does not make use of allele information.

The file_start_position and size_in_bytes columns specify the range of bytes within the indexed bgen file that contain the data. Implementations may seek to byte file_start_position in the bgen file, and read size_in_bytes bytes from the file. The resulting data will then contain the "variant data" and "genotype data" blocks for the corresponding variant.

Metadata schema

Newer versions of bgenix additionally store metadata about the bgen file in a Metadata table. When loading an index, this information is used to verify that the index file matches the bgen file it is being used for. The schema of this table is:

CREATE TABLE Metadata (
  filename TEXT NOT NULL,
  file_size INT NOT NULL,
  last_write_time INT NOT NULL,
  first_1000_bytes BLOB NOT NULL,
  index_creation_time INT NOT NULL
);

The table will have one row; the first three records reflect the name, size, and last write time of the bgen file corresponding to this index. The fourth column contains the first 1000 bytes (or fewer if the file is smaller) of the bgen file.