DEploid¶
A software that deconvolutes mixed genomes with unknown proportions.
Synopsis¶
dEploid [ -vcf file ] [ -sample string ] [ -plafFromVcf ] [ -panel file ] … [ -o string ]
Example:¶
$ ./dEploid -vcf data/testData/PG0390-C.test.vcf.gz \
-sample PG0390-C -plafFromVcf \
-exclude data/testData/labStrains.test.exclude.txt.gz \
-panel data/testData/labStrains.test.panel.txt.gz \
-o test_run \
-vcfOut -z \
-best
For help and more examples, type dEploid -help
.
Description and workflow¶
dEploid
is designed for deconvoluting mixed genomes with unknown proportions. Traditional ‘phasing’ programs are limited to diploid organisms. Our method modifies Li and Stephen’s [Li2003] algorithm with Markov chain Monte Carlo (MCMC) approaches, and builds a generic framework that allows haplotype searches in a multiple infection setting.
dEploid
is primarily developed as part of the Pf3k project, from which this documentation will take examples from for demonstration. The Pf3k project is a global collaboration using the latest sequencing technologies to provide a high-resolution view of natural variation in the malaria parasite Plasmodium falciparum. Parasite DNA are extracted from patient blood sample, which often contains more than one parasite strain, with unknown proportions. DEploid
is used for deconvoluting mixed haplotypes, and reporting the mixture proportions from each sample.

DEploid-Bestpractice workflow¶
Enabled by the flag -best, Similar to the first version, the main workflow consist with two steps:
- Use dEploid on clonal samples, and build a reference panel.
- For the mixed samples, infer both proportions and haplotypes at the same time with the reference panel provided.
The following key actions are now hidden internally in step 2:
- Use high-quality sites and reference panel to learn the number of strains (k) in malaria mixed infection
- Fix k, infer strain proportion and within sample relatedness
- Fix k and proportion, use a lasso algorithm to select the effective reference panel.
Most of the parameters (set to default) have now been tuned (see DEploid-Bestpractices-github-actions for details).
DEploid-LASSO workflow (third version)¶
This version aims to improve the haplotype inference with reference panel of any size. It was quickly converted into the Best Practice workflow.
DEploid-IBD workflow (second version)¶
The previous workflow struggle for samples with high within-host relatedness. Hence we introduced new procedures and workflow:
- Use dEploid on clonal samples, and build a reference panel.
- Use the IBD method to infer the proportions without a reference panel.
- Tune the haplotype with the given reference panel with fixed strain proportions

Black boxes indicate the key deconvolution steps when our program DEploid is used. Boxes in blue and purple represent the input and output respectively at each step. Steps Deconvolution with IBD and Deconvolution with a reference panel can be combined by using the flag -ibd.
Caveat: need to run the program multiple times, because some models are harder than the others.
DEploid workflow (first version)¶
Our main workflow consist with two steps:
- Use dEploid on clonal samples, and build a reference panel.
- For the mixed samples, infer both proportions and haplotypes at the same time with the reference panel provided.
Caveat: need to run the program multiple times, because some models are harder than the others.
[Li2003] | Li, N. and M. Stephens (2003). Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165(4), 2213–2233. |
Installation¶
dEploid
is written in C++. We use autotools for compiling and installing.
Development Version From GitHub¶
You can also install dEploid
directly from the git repository (tar, zip). Here, you will need autoconf
, check whether this is already installed by running:
$ which autoconf
On Debian/Ubuntu based systems:
$ apt-get install build-essential autoconf autoconf-archive libcppunit-dev
On Mac OS:
$ port install automake autoconf autoconf-archive cppunit
Afterwards you can clone the code from the github repository,
$ git clone git@github.com:mcveanlab/DEploid.git
$ cd DEploid
and build the binary using
$ ./bootstrap
$ make
or install with make install
. This will also enable the manual and Tab auto-completion.
How it works?¶
Program parameters and options¶
Mostly used¶
- -best
- To enable the BestPractice.
- -vcf [file]
- File path of the isolate vcf. Assume all variants are
PASS
in theQUAL
column, the VCF file also reqires theAD
field. - -sample [sample ID]
- Sample ID.
- -plafFromVcf
- Extract allele frequency information for the VCF file.
Note
If sample is unspecified, dEploid
only take the first sample in the VCF file.
DEploid
DO NOT handle multi-allelic variants, nor indels. The FILTER
column will not be used.
- -panel [file]
- File path of the reference panel (tab-delimited plain text file), for example
CHROM | POS | 3D7 | Dd2 | Hb3 | 7G8 |
---|---|---|---|---|---|
Pf3D7_01_v3 | 93157 | 0 | 0 | 0 | 1 |
Pf3D7_01_v3 | 94422 | 0 | 0 | 0 | 1 |
Pf3D7_01_v3 | 94459 | 0 | 0 | 0 | 1 |
Pf3D7_01_v3 | 94487 | 0 | 0 | 0 | 1 |
- -exclude [file]
- File path of sites to be excluded (tab-delimited plain text file).
- -o [string]
- Specify the file name prefix of the output.
- -k [int]
- Number of strain (default value 5).
- -seed [int]
- Random seed.
- -nSample [int]
- Number of MCMC samples (default value 800).
- -rate [int]
- MCMC sample rate (default value 5).
- -burn [float]
- MCMC burn rate (default value 0.5).
- -ibd
- Use IBD segment to infer the proportion, then infer the haplotype (see Pf3k work-flow for more details).
- -painting [file]
- Paint the posterior probability of the given haplotypes.
- -inbreeding
- Calculate the inbreeding probabilities.
- -initialP [float …]
- Initialize proportions.
- -ibdPainting
- IBD painting, compute posterior probabilities of IBD configurations of given strain proportions. This option must be used with flags -initialP.
- -h , -help
- Help.
- -v , -version
- DEploid version.
- -vcfOut
- Save final halpotypes into a VCF file.
You may also try¶
- -ref [file]
- File path of reference allele count (tab-delimited plain text file).
Note
In early dEploid
versions (prior to v0.2-release), allele counts extracted from the vcf file are placed in two files, and parsed by flags -ref [file] and -alt [file]. Tab-delimited plain text for input. First and second columns record chromosome and position labels respectively. Third columns records the reference allele count or alternative allele count. For example,
CHROM | POS | PG0390.C |
---|---|---|
Pf3D7_01_v3 | 93157 | 85 |
Pf3D7_01_v3 | 94422 | 77 |
Pf3D7_01_v3 | 94459 | 90 |
Pf3D7_01_v3 | 94487 | 79 |
- -alt [file]
- File path of alternative allele count (tab-delimited plain text file).
CHROM | POS | PG0390.C |
---|---|---|
Pf3D7_01_v3 | 93157 | 0 |
Pf3D7_01_v3 | 94422 | 0 |
Pf3D7_01_v3 | 94459 | 0 |
Pf3D7_01_v3 | 94487 | 0 |
Warning
Flags -ref and -alt should not be used with -vcf.
- -plaf [file]
- File path of population level allele frequencies (tab-delimited plain text file), for example
CHROM | POS | PLAF |
---|---|---|
Pf3D7_01_v3 | 93157 | 0.0190612159917058 |
Pf3D7_01_v3 | 94422 | 0.135502358766423 |
Pf3D7_01_v3 | 94459 | 0.156294363760064 |
Pf3D7_01_v3 | 94487 | 0.143439298925837 |
- -noPanel
- Use population level allele frequency as prior.
Warning
Flags -panel and -noPanel should not be used together.
- -forbidUpdateProp
- Forbid MCMC moves to update proportions.
- -forbidUpdateSingle
- Forbid MCMC moves to update single haplotype.
- -forbidUpdatePair
- Forbid MCMC moves to update pair haplotypes.
- -exportPostProb
- Save the posterior probabilities of the final iteration of all strains.
- -miss [float]
- Miss copying probability.
- -recomb [float]
- Constant recombination probability.
- -p [int]
- Output precision (default value 8).
- -c [float]
- Specify scaling parameter c, which reflects how much data is available (default value 100.0).
- -G [float]
- Specify scaling parameter for genetic map (default value of 20.0).
- -sigma [float]
- Specify the variance parameter for proportion estimation (default value of 5.0).
- -ibdSigma [flat]
- Specify the variance parameter for proportion estimation when IBD method is used (default value of 20.0).
- -initialHap [file]
- Specify initial haplotypes of deconvolution.
R utilities¶
Flags -vcf, -plaf, -ref, -alt, -exclude, -o usage are the same as DEploid
. Additionally, we have the following flags:
- -dEprefix [string]
- Prefix of
DEploid
output. - -inbreeding
- Painting haplotype inbreeding posterior probabilities.
- -ADFieldIndex
- The index of
AD
field (2 by default). - -filter.threshold [float]
- Filtering threshold (0.995 by default).
- -filter.window [int]
- Filtering window (10 by default).
- Produce figures in pdf rather than png.
- -ibd
- Produce figures for IBD process.
- -ring
- Produce circular genome plots for WSAF and haplotype posterior painting probabilities.
Example of data exploration¶
Use our data exploration tools to investigate the data.
$ utilities/dataExplore.r -vcf data/exampleData/PG0390-C.eg.vcf.gz \
-plaf data/exampleData/labStrains.eg.PLAF.txt \
-o PG0390-C

- Figure on the top plot total allele counts across all markers. We use the threshold (red line) to identify markers with extremely high allele counts. Red crosses indicate markers that are filtered out.
- Figure on the left plots the alternative allele count against the reference allele count. As P. falciparum genomes are haploid, in clonal samples, one woule expect to see either alternative or reference allele at any sites. Heterozygous sites are indications of mixed infection.
- Figure in the middle is the histogram of the allele frequency within sample. Note that we exclude markers with WSAF strictly equal to 0s and 1s in the histogram.
- Figure on the right show allele frequency within sample, compare against the population average.
Making sense of the output¶
Output files¶
dEploid
outputs text files with user-specified prefix with flag -o.
Log file records dEploid
version, input file paths, parameter used and proportion estimates at the final iteration.
Log likelihood of the MCMC chain.
MCMC updates of the proportion estimates.
Haplotypes at the final iteration in plain text file.
When flag -vcfOut
is turned on, haplotypes are saved at the final iteration in VCF format.
When flag -exportPostProb
is turned on, posterior probabilities of the final iteration of strain [i].
DEploid-IBD¶
When “flag” -ibd
is used. ‘DEploid’ executes first learns the number of strain and their proportions with an identity by descent model (‘DEploid-IBD’). Then it fixes the number of strains and proportions and train the haplotypes, and train the haplotypes using the original DEploid algorithm (‘DEploid-classic’). The staged output are labelled with “.ibd” and “.classic” respectively, and followed by the prefix.
DEploid-BEST¶
When “flag” -best
is used. ‘DEploid-BEST’ executes the deconvolution algorithms in an optimised sequence to best report the number of strains, proportions and haplotypes. The program (‘DEploid-Lasso’) learns the number of strain with optimised reference panel; “.chooseK” is appended to the prefix for these output (NOTE: likelihood is not tracked in this case). It (‘DEploid-IBD’) then fixes the number of strains and tune the strain proportions with an identity by descent model; “.ibd” is appended to the prefix for these output. Finally, the program (‘DEploid-Lasso’) fixes the number of strains and proportions, and uses the optimised reference panel again to train and report the haplotypes; “.final” is appended to the prefix for these output. When -vcfOut
is applied, this will only be the final haplotypes.
Example of output interpretation¶
Example 1. Standard deconvolution output¶
$ ./dEploid -vcf data/exampleData/PG0390-C.eg.vcf.gz \
-plaf data/exampleData/labStrains.eg.PLAF.txt \
-noPanel -o PG0390-CNopanel -seed 1
$ utilities/interpretDEploid.r -vcf data/exampleData/PG0390-C.eg.vcf.gz \
-plaf data/exampleData/labStrains.eg.PLAF.txt \
-dEprefix PG0390-CNopanel \
-o PG0390-CNopanel -ring
interpretDEploidFigure.1
The top three figures are the same as figures show in :ref:data example <sec-eg>
, with a small addition of inferred WSAF marked in blue, in the top right figure.
- The bottom left figure show the relative proportion change history of the MCMC chain.
- The middle figure show the correlation between the expected and observed allele frequency in sample.
- The right figure shows changes in MCMC likelihood .
interpretDEploidFigure.2
This panel figure shows all allele frequencies within sample across all 14 chromosomes. Expected and observed WSAF are marked in blue and red respectively.
Example 2. Haplotype painting from a given panel¶
dEploid
can take its output haplotypes, and calculate the posterior probability of each deconvoluted strain with the reference panel. In this example, the reference panel includes four lab strains: 3D7 (red), Dd2 (dark orange), HB3 (orange) and 7G8 (yellow).
$ ./dEploid -vcf data/exampleData/PG0390-C.eg.vcf.gz \
-plaf data/exampleData/labStrains.eg.PLAF.txt \
-panel data/exampleData/labStrains.eg.panel.txt \
-o PG0390-CPanel -seed 1 -k 3
$ ./dEploid -vcf data/exampleData/PG0390-C.eg.vcf.gz \
-plaf data/exampleData/labStrains.eg.PLAF.txt \
-panel data/exampleData/labStrains.eg.panel.txt \
-o PG0390-CPanel \
-painting PG0390-CPanel.hap \
-initialP 0.8 0 0.2 -k 3
$ utilities/interpretDEploid.r -vcf data/exampleData/PG0390-C.eg.vcf.gz \
-plaf data/exampleData/labStrains.eg.PLAF.txt \
-dEprefix PG0390-CPanel \
-o PG0390-CPanel -ring
PG0390fwdBwdRing
Example 3. Deconvolution followed by IBD painting¶
In addition to lab mixed samples, here we show example of dEploid
deconvolute field sample PD0577-C.
PD0577inbreeding
Frequently asked questions¶
Data filtering¶
Data filtering is an important step for deconvolution.
utilities/dataExplore.r -vcf data/exampleData/PG0415-C.eg.vcf.gz \
-plaf data/exampleData/labStrains.eg.PLAF.txt \
-o PG0415-C
PG0415_data
We observe a small number of heterozygous sites with high coverage (marked as crosses above), which can potentially mislead our model to over-fit the data with additional strains.
./dEploid -vcf data/exampleData/PG0415-C.eg.vcf.gz \
-plaf data/exampleData/labStrains.eg.PLAF.txt \
-noPanel -o PG0415-CNopanel -seed 2
initialProp=$( cat PG0415-CNopanel.prop | tail -1 | sed -e "s/\t/ /g" )
./dEploid -vcf data/exampleData/PG0415-C.eg.vcf.gz \
-plaf data/exampleData/labStrains.eg.PLAF.txt \
-panel data/exampleData/labStrains.eg.panel.txt \
-o PG0415-CNopanel \
-initialP ${initialProp} \
-painting PG0415-CNopanel.hap
utilities/interpretDEploid.r -vcf data/exampleData/PG0415-C.eg.vcf.gz \
-plaf data/exampleData/labStrains.eg.PLAF.txt \
-dEprefix PG0415-CNopanel \
-o PG0415-CNopanel \
-ring
PG0415_noFilter
The data exploration utility utilities/dataExplore.r
identifies a list of potential outliers. After filtering, we correctly identify the number of strains and proportion.
./dEploid -vcf data/exampleData/PG0415-C.eg.vcf.gz \
-plaf data/exampleData/labStrains.eg.PLAF.txt \
-noPanel -o PG0415-CNopanel.filtered -seed 2 \
-exclude PG0415-CPotentialOutliers.txt
initialProp=$( cat PG0415-CNopanel.filtered.prop | tail -1 | sed -e "s/\t/ /g" )
./dEploid -vcf data/exampleData/PG0415-C.eg.vcf.gz \
-plaf data/exampleData/labStrains.eg.PLAF.txt \
-panel data/exampleData/labStrains.eg.panel.txt \
-exclude PG0415-CPotentialOutliers.txt \
-o PG0415-CNopanel.filtered \
-initialP ${initialProp} \
-painting PG0415-CNopanel.filtered.hap
utilities/interpretDEploid.r -vcf data/exampleData/PG0415-C.eg.vcf.gz \
-plaf data/exampleData/labStrains.eg.PLAF.txt \
-dEprefix PG0415-CNopanel.filtered \
-o PG0415-CNopanel.filtered \
-exclude PG0415-CPotentialOutliers.txt \
-ring
PG0415_filtered
Over-fitting¶
For extremely unbalanced samples, DEploid tends to over-fit the minor strain with an additional component. We recommend adjusting the value of sigma for the prior to improve inference. In this example PG0400-C is a mixture of lab strains 7G8 and HB3 with mixing proportions of 95/5%. The parameter sigma takes value of 5 by default, which over fits the minor strain (see example 1), and with proportions 0.0276862, 0.945509 and 0.0267463. Example 1 paints the deconvolved strains (proportions in increasing order towards the centre) to the reference panel. We resolve the over-fitting issue by rerun this example, and set sigma with value of 10, it correctly infer the proportions as 0.0313755 and 0.968599 (see example 2). Note that the radius are not in scale with strain proportions.
./dEploid -vcf data/exampleData/PG0400-C.eg.vcf.gz \
-plaf data/exampleData/labStrains.eg.PLAF.txt \
-panel data/exampleData/labStrains.eg.panel.txt \
-o PG0400-Csigma5 -seed 2 -sigma 5 \
-exclude exclude.txt
initialProp=$( cat PG0400-Csigma5.prop | tail -1 | sed -e "s/\t/ /g" )
./dEploid -vcf data/exampleData/PG0400-C.eg.vcf.gz \
-plaf data/exampleData/labStrains.eg.PLAF.txt \
-panel data/exampleData/labStrains.eg.panel.txt \
-exclude exclude.txt \
-o PG0400-Csigma5 \
-initialP ${initialProp} \
-painting PG0400-Csigma5.hap
utilities/interpretDEploid.r -vcf data/exampleData/PG0400-C.eg.vcf.gz \
-plaf data/exampleData/labStrains.eg.PLAF.txt \
-dEprefix PG0400-Csigma5 \
-o PG0400-Csigma5 \
-exclude exclude.txt \
-reverseRing -transformP
./dEploid -vcf data/exampleData/PG0400-C.eg.vcf.gz \
-plaf data/exampleData/labStrains.eg.PLAF.txt \
-panel data/exampleData/labStrains.eg.panel.txt \
-o PG0400-Csigma10 -seed 2 -sigma 10\
-exclude exclude.txt
initialProp=$( cat PG0400-Csigma10.prop | tail -1 | sed -e "s/\t/ /g" )
./dEploid -vcf data/exampleData/PG0400-C.eg.vcf.gz \
-plaf data/exampleData/labStrains.eg.PLAF.txt \
-panel data/exampleData/labStrains.eg.panel.txt \
-exclude exclude.txt \
-o PG0400-Csigma10 \
-initialP ${initialProp} \
-painting PG0400-Csigma10.hap
utilities/interpretDEploid.r -vcf data/exampleData/PG0400-C.eg.vcf.gz \
-plaf data/exampleData/labStrains.eg.PLAF.txt \
-dEprefix PG0400-Csigma10 \
-o PG0400-Csigma10 \
-exclude exclude.txt \
-reverseRing -transformP
Benchmark¶
Please refer to our paper Zhu et.al (2017) section 3 Validation and performance for benchmarking inference results on number of strains, proportions and haplotype quality.
For the enhanced version – DEploid-IBD, we compared our results against Zhu et.al (2017), and conducted more experiments and validations Zhu et.al (2019).
Reporting Bugs¶
If you encounter any problem when using dEploid
, please file a short bug report by using the issue tracker
on GitHub or email joe.zhu (at) well.ox.ac.uk.
Please include the output of dEploid -v
and the platform you are using dEploid
on in the report. If the problem occurs while executing dEploid
, please also include the command you are using and the random seed.
Thank you!
Citing DEploid¶
If you use dEploid
with the flag -ibd
, please cite the following paper:
Zhu, J. S., J. A. Hendry, J. Almagro-Garcia, R. D. Pearson, R. Amato, A. Miles, D. J. Weiss, T. C. D. Lucas, M. Nguyen, P. W. Gething, D. Kwiatkowski, G. McVean, and for the Pf3k Project. (2018) The origins and relatedness structure of mixed infections vary with local prevalence of P. falciparum malaria. biorxiv, doi: https://doi.org/10.1101/387266.
Bibtex record::
@article {Zhu387266,
author = {Zhu, Sha Joe and Hendry, Jason A. and Almagro-Garcia, Jacob and Pearson, Richard D. and Amato, Roberto and Miles, Alistair and Weiss, Daniel J. and Lucas, Tim C.D. and Nguyen, Michele and Gething, Peter W. and Kwiatkowski, Dominic and McVean, Gil and ,},
title = {The origins and relatedness structure of mixed infections vary with local prevalence of P. falciparum malaria},
year = {2018},
doi = {10.1101/387266},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2018/08/09/387266},
eprint = {https://www.biorxiv.org/content/early/2018/08/09/387266.full.pdf},
journal = {bioRxiv}
}
If you use dEploid
in your work, please cite the program:
Zhu, J. S. J. A. Garcia G. McVean. (2017) Deconvolution of multiple infections in Plasmodium falciparum from high throughput sequencing data. Bioinformatics btx530. doi: https://doi.org/10.1093/bioinformatics/btx530.
Bibtex record::
@article {Zhubtx530,
author = {Zhu, Sha Joe and Almagro-Garcia, Jacob and McVean, Gil},
title = {Deconvolution of multiple infections in {{\em Plasmodium falciparum}} from high throughput sequencing data},
year = {2017},
doi = {10.1093/bioinformatics/btx530},
URL = {https://doi.org/10.1093/bioinformatics/btx530},
journal = {Bioinformatics}
}