QCTOOL

HPTEST
Login

HPTEST

hptest is a program for testing correlations between two sets of genotypes - for example, between genotypes of a host and of parasites infecting those hosts. To do this, it implements a binomial logistic regression model:

logodds( outcome genotype ) = beta * ( predictor genotype, covariates...)
in which beta is a vector of parameters that is to be optimised over (the regression coefficients). HPTEST finds the parameters maximising the corresponding likelihood. By default this includes an additional regularising prior.

For example, the following command:

hptest -outcome parasite.vcf -predictor host.vcf -s samples.sample -o test.csv
regresses each parasite variant (from parasite.vcf) on each host variant (from host.vcf). The genotype counts, maximum posterior estimates and other quantities are output to the output file (test.csv). Additional options allow you to add covariates (-covariates), to adjust the prior being used (-prior), or to filter the list of samples or variants included (e.g. -incl-samples, -excl-samples-where, or -incl-outcome-range - see the output of hptest -help for more).

IMPORTANT WARNING

HPTEST is still at a very early stage of development - if you use it, please treat with caution, and carefully sanity check any results.

Building HPTEST

HPTEST is currently included as part of the QCTOOL package:

http://www.bitbucket.org/gavinband/qctool/

To get HPTEST you need the development branch (named default). Download and compile QCTOOL according to the compilation instructions; if all goes well then the compiled application, called hptest-<version> will appear in build/release/. Running

./build/release/hptest-<version> -help

will print a list of options.

Source code for HPTEST can be found here.

Running HPTEST

In the most basic form hptest is run like this:

hptest -outcome outcome.vcf -predictor predictor.vcf -s samples.sample -o test.csv

where outcome.vcf and host.vcf contain genotype calls, and samples.sample contains sample information. All three files must represent the same set of individuals and they must be in the same order in each file. The -reorder option to QCTOOL can help to set this up.

For each predictor and each outcome variant this does the following:

HPTEST prints a wealth of information to the output file (and/or to a log file specified with the -log option). A basic run looks like this:

$ hptest_v2.1-dev -outcome parasite.bgen -predictor host.bgen -s samples.sample -o test.csv

Welcome to hptest
(version: 2.1-dev, revision 0aec315)

(C) 2009-2017 University of Oxford

Loaded data for 2000 samples.
    Host data:
                     (    100 snps)  "host.bgen (bgen v1.2; 2000 named samples; zlib compression)"
                     (total 100 snps in 1 sources).
  Number of samples: 2000

Parasite data:
                     (    100 snps)  "parasite.bgen (bgen v1.2; 2000 named samples; zlib compression)"
                     (total 100 snps in 1 sources).
  Number of samples: 2000

Models are:
- Model 1 ("null"): BinomialLogistic( 2000 samples ): (outcome=1) ~ baseline/outcome=1
- Model 2 ("gen"): BinomialLogistic( 2000 samples ): (outcome=1) ~ baseline/outcome=1 + add/outcome=1 + overdominance/outcome=1
  with priors:
            add/outcome=1 ~ logF( 2, 2 ).
  overdominance/outcome=1 ~ logF( 4, 4 ).

Model design for first test:
         outcome   baseline   add overdominance 
    0   2 0          1     1             1
    1   2 0          1     1             1
    2   1 1          1     0             0
    3   2 0          1     0             0
    4   2 0          1     0             0
    5   2 0          1     0             0
    6   1 1 ~        1     0             0
                .          .     .             .
                .          .     .             .
                .          .     .             .
 1996   2 0          1     1             1
 1997   2 0          1     0             0
 1998   2 0          1     0             0
 1999   2 0          1     2             0
.

Testing                                                     : [***                           ] (1009/10000,4.0s,251.5/s)

and, when it's finished, the output looks like this:

$ head -n 20 test.tsv
# Analysis: "qctool analysis"
#  started: 2018-09-17 12:05:29
# 
# Analysis properties:
#   -o test.tsv (user-supplied)
#   -outcome parasite.bgen (user-supplied)
#   -predictor host.bgen (user-supplied)
#   -s samples.sample (user-supplied)
# 
predictor:alternate_ids	predictor:rsid	predictor:chromosome	predictor:position	predictor:alleleA	predictor:alleleB	outcome:alternate_ids	outcome:rsid	outcome:chromosome	outcome:position	outcome:alleleA	outcome:alleleB	N	missing	outcome=0	outcome=1	outcome=2	predictor=0	predictor=1	predictor=2	minimum_outcome_count	minimum_predictor_count	null:converged	null:iterations	null:fit_time	null:ll	null:degrees_of_freedom	gen:converged	gen:iterations	gen:fit_time	gen:ll	gen:degrees_of_freedom	gen:beta_1:add/outcome=1	gen:beta_2:overdominance/outcome=1	gen:sd_1	gen:sd_2gen:cov_1,2	gen:log10_bf	gen:prior_mode_1	gen:se_1	gen:pvalue_1	gen:prior_mode_2	gen:se_2	gen:pvalue_2	comment
H1	H1	H1	1	A	G	P1	P1	P1	1	C	T	2000	0	1807	188	5	1353	570	77	198	724	1	5	0.000	-788.162	0	0.000	-758.162	2	-1.59959	0.217883	0.581511	0.594956	-0.318301	12.9617	0.5	0.453938	0.00042538	0	0.449248	0.627679	NA
H1	H1	H1	1	A	G	P2	P2	P1	2	C	T	2000	0	1501	464	35	1353	570	77	534	724	1	4	0.000	-1571.95	0	0.010	-1570.7	2	-0.161529	0.046432	0.13313	0.159226	-0.0160364	-0.52065	0.5	0.13157	0.219556	0	0.156789	0.767121	NA
H1	H1	H1	1	A	G	P3	P3	P1	3	C	T	2000	0	1348	595	57	1353	570	77	709	724	1	4	0.000	-1868.79	0	0.000	-1868.13	2	0.109628	-0.0552761	0.101883	0.126939	-0.0090345	-0.949343	0.5	0.101217	0.278767	0	0.125751	0.660251	NA
H1	H1	H1	1	A	G	P4	P4	P1	4	C	T	2000	0	1635	350	15	1353	570	77	380	724	1	5	0.000	-1255.82	0	0.000	-1254.19	2	-0.124965	0.298272	0.156286	0.181825	-0.0219801	-0.236598	0.5	0.153804	0.416506	0	0.178187	0.0941459	NA
H1	H1	H1	1	A	G	P5	P5	P1	5	C	T	2000	0	1633	337	30	1353	570	77	397	724	1	5	0.000	-1293.73	0	0.010	-1293.46	2	-0.0824695	0.124881	0.145987	0.174117	-0.0190367	-0.859796	0.5	0.14396	0.566736	0	0.170939	0.465048	NA
H1	H1	H1	1	A	G	P6	P6	P1	6	C	T	2000	0	1650	337	13	1353	570	77	363	724	1	5	0.000	-1217.08	0	0.000	-1215.62	2	0.195229	-0.26446	0.123722	0.161439	-0.0129166	-0.386484	0.5	0.122585	0.111251	0	0.159099	0.0964656	NA
H1	H1	H1	1	A	G	P7	P7	P1	7	C	T	2000	0	1476	490	34	1353	570	77	558	724	1	4	0.000	-1616.22	0	0.000	-1616	2	-0.0763321	0.0472307	0.1242	0.150044	-0.0137905	-1.00981	0.5	0.12295	0.534706	0	0.148026	0.749673	NA
H1	H1	H1	1	A	G	P8	P8	P1	8	C	T	2000	0	1947	53	0	1353	570	77	53	724	1	7	0.000	-281.806	0	0.000	-280.337	2	0.114724	0.375152	0.331261	0.381853	-0.0882789	0.393751	0.5	0.310174	0.711479	0	0.348422	0.281606	NA
H1	H1	H1	1	A	G	P9	P9	P1	9	C	T	2000	0	808	860	332	1353	570	77	1524	724	1	3	0.000	-2658.21	0	0.000	-2657.64	2	0.089402	-0.080947	0.0835841	0.103132	-0.00617601	-1.17635	0.5	0.0832098	0.282636	0	0.10249	0.429642	NA
H1	H1	H1	1	A	G	P10	P10	P1	10	C	T	2000	0	1902	95	3	1353	570	77	101	724	1	6	0.000	-471.286	0	0.010	-471.271	2	0.00663179	-0.0431763	0.250728	0.305619	-0.0533158	-0.454619	0.5	0.24093	0.97804	0	0.288551	0.881056	NA

A brief description of output columns is:

column description
predictor:alternate_ids Predictor variant id(s)
predictor:rsid Primary predictor variant id
predictor:chromosome Predictor chromosome
predictor:position Predictor base pair position
predictor:alleleA predictor 1st allele
predictor:alleleB predictor 2nd allele (effect estimates are for this allele)
outcome:alternate_ids Outcome variant id(s)
outcome:rsid Primary outcome variant id
outcome:chromosome outcome chromosome
outcome:position outcome base pair position
outcome:alleleA outcome 1st allele
outcome:alleleB outcome 2nd allele (effect estimates are for this allele)
N Total number of samples
missing Number of samples with missing data
outcome=0 Count of outcome=0 genotypes
outcome=1 Count of outcome=1 genotypes
outcome=2 Count of outcome=2 genotypes
predictor=0 Count of outcome=0 genotypes
predictor=1 Count of outcome=1 genotypes
predictor=2 Count of outcome=2 genotypes
minimumoutcomecount Minimum count of outcome alleleA or alleleB
minimumpredictorcount Minimum count of predictor alleleA or alleleB
null:converged Did the null model (i.e. the model with baseline and covariates but no predictor genotypes) converge? (1=yes, 0 = no)
null:iterations How many iterations the null model took to converge
null:fit_time How much time the null model took to converge
null:ll Null model log-likelihood
null:degreesoffreedom number of paramameters in null model
gen:converged Did the full model (here named 'gen') converge?
gen:iterations How many iterations?
gen:fit_time How long did it take?
gen:ll Full model log-likelihood
gen:degreesoffreedom Number of parameters in full model
gen:beta_1:add/outcome=1 Estimate of the first parameter (beta_1 = additive effect size) in full model
gen:beta_2:overdominance/outcome=1 Estimate of the second parameter (beta_2 = overdominance effect size) in full model
gen:sd_1 Estimated posterior standard deviation for 1st parameter
gen:sd_2 Estimated posterior standard deviation for 2nd parameter
gen:cov_1,2 Estimated posterior covariance between 1st and 2nd parameter
gen:log10_bf log10 Bayes factor for full model against null model
gen:priormode1 Mode of the prior on parameter 1 (usually 0)
gen:se_1 Approximate frequentist standard error for beta_1
gen:pvalue_1 Approximate P-value for beta_1
gen:priormode2 Approximate frequentist standard error for beta_2
gen:se_2 Approximate frequentist standard error for beta_2
gen:pvalue_2 Approximate marginal P-value for beta_2
comment Comments or notifications of errors. Will be NA if nothing unusual occurred.