BGEN

Documentation
Login

bgenix is a tool to create an index of variants in a bgen file and to use that index for efficient retrieval of data for specific variants or regions.

Quick start

To use bgenix with your bgen file (say myfile.bgen) you use the following process.

Here's a quick list of common bgenix options and what they do:

Command line or option What it does
-help Print help on the various options bgenix supports
-g Specify the bgen file to operate on.
-index Create an index file for the given bgen file. It will be named file.bgen.bgi
-incl-rsids Only output variants that have one of the given rsid(s).
-excl-rsids Only output variants that don't have the given rsid(s).
-incl-range Only output variants in one of the given ranges.
-excl-range Only output variants outside the given ranges.
-vcf Transcode data to VCF format.
-v11 Transcode data to BGEN v1.1 format.
-list Don't output genotype data, just list the variants in the index.

For example, assuming the index file already exists, the command

bgenix -g file.bgen -list -incl-range 11:3500000-6500000

will print a list of all variants in the given range, while

bgenix -g file.bgen -vcf -incl-range 11:3500000-6500000
will output a VCF file for that region.

Note that bgenix always writes its output to stdout. You'll therefore usually want to capture this by redirecting the output to a file like this:

bgenix -g file.bgen -incl-range 11:3500000-6500000 > output.bgen
or piping to another command lke this:
bgenix -g file.bgen -incl-range 11:3500000-6500000 | qctool -g - -filetype bgen -snp-stats -osnp stats.txt

See bgenix -help for a full list of supported options.


Detailed usage notes

Building an index

Use

bgenix -g myfile.bgen -index

to build an index file. This is typically pretty quick but might take a few minutes on a very large file.

The index file format

bgenix index files store the chromosome, position, alleles and identifer of each variant in the bgen file, along with an byte offset into the bgen file itself so that the variant can be quickly retrieved. The index file is a sqlite3 file, which means you can inspect (or alter) it using sqlite3.

For example you can get a list of variants:

sqlite3 myfile.bgen.bgi "SELECT * FROM Variant"

(But see another way to do this below.)

It's also easy to load this data into programming languages - for example using pandas.read_sql in python, or in R:

library( RSQLite )
connection = dbConnect( RSQLite::SQLite(), "myfile.bgen.bgi" )
index = dbGetQuery( connection, "SELECT * FROM Variant" )

The full index file format is described on the wiki page The bgenix index file format.

Note: For performance reasons bgenix uses "WITHOUT ROWID" tables to implement the index. This means you need sqlite3 version 3.8.2 or greater to inspect the file - otherwise you'll get a message like "Error: malformed database schema"". As an alternative, you can use the -with-rowid option when building the index, which will then be compatible with earlier versions of sqlite3.

Listing variants

One of the first things you might want to do after indexing is get a list of variants in the file (or perhaps in a particular region). If the -list option is given, bgenix will do this, returing a list of variants. For example, using the file complex.bgen included in the example/ folder in the bgen repository, the command:

bgenix -g example/complex.bgen -list

produces this output:

 # bgenix: started 2016-07-06 09:01:15
alternate_ids	rsid	chromosome	position	number_of_alleles	first_allele
.	V1	01	1	2	A	G
V2.1	V2	01	2	2	A	G
.	V3	01	3	2	A	G
.	M4	01	4	3	A	G,T
.	M5	01	5	2	A	G
.	M6	01	7	4	A	G,GT,GTT
.	M7	01	7	6	A	G,GT,GTT,GTTT,GTTTT
.	M8	01	8	7	A	G,GT,GTT,GTTT,GTTTT,GTTTTT
.	M9	01	9	8	A	G,GT,GTT,GTTT,GTTTT,GTTTTT,GTTTTTT
.	M10	01	10	2	A	G
 # bgenix: success, total 10 variants.

(We describe below another way to list variants - by querying the index directly using sqlite3.)

Outputting genotypes

By default genotype data is output in the same format as in the input. This makes bgenix fast as it doesn't have to do any processing of the data.

E.g. in the command

bgenix -f myfile.bgen -incl-range 1:0-10

bgenix simply outputs an appropriate BGEN header, and then copies bytes from the input file to the output.

Transcoding

bgenix can also transcode data to two other formats:

Currently BGEN v1.1 output is only supported when the input data is in a specific format, namely BGEN with 'layout=2' blocks, 8-bit probability encoding, and all samples are diploid.

Querying variants

bgenix can restrict the output based on chromosome and position, or by variant identifier. In general, a variant will be output if it satisfies at least one of the inclusion (-incl-*) options, and does not satisfy any of the exclusion (-excl-*) options passed on the command-line.

The relavant options are:

Syntax Notes Example(s)
-incl-rsids Only output variants that have one of the given rsid(s). -incl-rsids rs8176719 myids.txt
-excl-rsids Only output variants that don't have the given rsid(s) -excl-rsids rs8176719 myids.txt
-incl-range Only output variants in one of the given ranges. -incl-range 11:0-1000, -incl-range ranges.txt
-excl-range Only output variants outside the given ranges. -excl-range 11:0-1000, -excl-range ranges.txt

For convenience the above options below take either values directly on the command line, or filenames. If the argument is a valid filename the file will be opened and values (IDs or ranges) read from it. bgenix expects these files to contain a whitespace-separated list of IDs or chromosome ranges.

Ranges can either be specified by


Acknowledgements

bgenix is motivated by and in some respects designed to mimic tabix, the htslib tool for indexing tab-delimited files. The key functionality of bgenix is all implemented using the sqlite3 library. Thank you, sqlite authors!