Quickstart
In this minimal example we show how to use gctree commands to rank parsimony trees from PHYLIP’s dnapars program. See the CLI and API docs for complete documentation of command options, included many not used here.
The data in this example were published in Tas et al. 2016. Visualizing Antibody Affinity Maturation in Germinal Centers. Science 351 (6277) and shown in Fig. 4 (lymph node 2, germinal center 1).
Input data
We start with a fasta alignment file.
The file ../example/150228_Clone_3-8.fasta
contains heavy chain V gene sequences from
germinal B cells sorted from a brainbow mouse using multicolor fate mapping.
$ tail -30 ../example/150228_Clone_3-8.fasta
>VIBM1S4HBJ
ggacctagcctcgtgaaaccttctcagactctgtccctcacctgttctgtcactggcgac
tccatcaccagtggttactggaactggatccggaagttcccagggaatagacttgagtac
atggggtacataagcttcagtggtagcacttactacaatccatctctcaaaagtcgaatc
tccatcactcgagacacatccaagaaccagtactacctgcagttgaattctgtgactact
gaggacacagccacatattactgt
>VIBM1S4HKJ
ggacctagcctcgtgaaaccttctcagactctgtccctcacctgttctgtcactggcgac
tccatcaccagtggttactggaactggatccggaagttcccagggaatagacttgagtac
atggggtacataagcttcagtggtagcacttactacaatccatctctcaaaagtcgaatc
tccatcactcgagacacatccaagaaccagtactacctgcagttgaattctgtgactact
gaggacacagccacatattactgt
>VIBM1S4HDJ
ggacctagcctcgtgaaaccttctcagactctgtccctcacctgttctgtcactggcgac
tccatcaccagtggttactggaactggatccggaagttcccagggaatagacttgagtac
atggggtacataagcttcagtggtagcacttactacaatccatctctcaaaagtcgaatc
tccatcactcgagacacatccaagaaccagtactacctgcagttgaattctgtgactact
gaggacacagccacatattactgt
>VIBM1S4HCJ
ggacctagcctcgtgaaaccttctcagactctgtccctcacctgttctgtcactggcgac
tccatcaccagtggttactggaactggatccggaagttcccagggaatagacttgagtac
atggggtacataagcttcagtggtagcacttactacaatccatctctcaaaagtcgaatc
tccatcactcgagacacatccaagaaccagtactacctgcagttgaattctgtgactact
gaggacacagccacatattactgt
>GL
ggacctagcctcgtgaaaccttctcagactctgtccctcacctgttctgtcactggcgac
tccatcaccagtggttactggaactggatccggaaattcccagggaataaacttgagtac
atggggtacataagctacagtggtagcacttactacaatccatctctcaaaagtcgaatc
tccatcactcgagacacatccaagaaccagtactacctgcagttgaattctgtgactact
gaggacacagccacatattactgt
In this file the sequence with id GL
is the naive germline sequence, and represents the root of the tree.
It does not refer to an observed sequence, but is included to outgroup root the tree!
This input fasta file should also contain all observed sequences with duplication (some of which may be identical to the root/outgroup sequence).
Deduplication and sequence abundances
First we deduplicate the sequences and convert to phylip alignment format, and also determine sequence abundances.
The deduplicate
command writes the phylip alignment of unique sequences to stdout (which we redirect to a file here).
The argument --root
indicates the root id.
The flag --id_abundances
can be used to indicate that integer sequence ids should be interepreted as abundances.
The argument --abundance_file
indicates that sequence abundance information should be written to the specified csv
file.
The argument --idmapfile
allows us to specify a filename for the output
file containing a mapping of new, unique sequence IDs to original sequence IDs
from the fasta file.
$ deduplicate ../example/150228_Clone_3-8.fasta --root GL --abundance_file abundances.csv --idmapfile idmap.txt > deduplicated.phylip
We now have files deduplicated.phylip
and abundances.csv
:
$ head deduplicated.phylip
43 264
GL GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq1 GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq2 GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq3 GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq4 GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq5 GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq6 GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq7 GGACCTAGCC TCGTGAAATC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq8 GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
$ head abundances.csv
GL,0
seq1,1
seq2,1
seq3,1
seq4,1
seq5,1
seq6,1
seq7,1
seq8,1
seq9,3
Parsimony trees
We use PHYLIP’s dnapars
program (see install docs page) to generate a set of maximum parsimony trees for the deduplicated sequences.
PHYLIP is an interactive command line tool, so we will automatically generate a config file to feed it input, instead of using it interactively.
We generate a config file for dnapars
based on our deduplicated alignment using the mkconfig
command.
$ mkconfig deduplicated.phylip dnapars > dnapars.cfg
Run dnapars
using this config file, redirecting stdout to a log file.
$ dnapars < dnapars.cfg > dnapars.log
You now have two new files, outfile
and outtree
.
Note: if you want to rerun the above dnapars
command, you must delete these two files first!
gctree
We’re now ready to run gctree infer
to use abundance data (in abundances.csv
) to rank the equally parsimonious trees (in outfile
).
We can use the optional argument --frame
to indicate the coding frame of the sequence start, so that amino acid substitutions can be annotated on our trees.
Note
If working in a headless environment, gctree infer
must be run with a tool
like XVFB to provide an X server for rendering the output trees.
Prepend the gctree command with xvfb-run -a
.
Alternatively, we have had success setting the following environment variables instead of using XVFB:
export QT_QPA_PLATFORM=offscreen
,
export XDG_RUNTIME_DIR=/tmp/runtime-runner
.
You may also want to tell matplotlib to use a headless backend with export MPLBACKEND=agg
.
$ gctree infer outfile abundances.csv --root GL --frame 1 --verbose
number of trees with integer branch lengths: 703
Branching process parameters to be used for ranking: (0.49618320692429585, 0.36445206621349757)
Ranking trees to maximize LogBPLikelihood then minimize Alleles
Stats for optimal trees:
tree LogBPLikelihood Alleles
1 -78.00393661 48
Degenerate ranking criteria: trimmed history DAG contains 3 unique trees, with 3 unique collapsed topologies.
A representative of each topology will be sampled randomly for rendering.
...
A large number of output files with the basename gctree.out.*
are also created.
The SVG image file gctree.out.inference.abundance_rank.svg
shows a distribution of genotype abundances in the original data:
Then there are files gctree.out.inference.1.svg
and gctree.out.inference.1.nk
containing an SVG tree image and newick tree file. If more than one parsimony tree is optimal, then up to ten optimal trees will be sampled randomly, and the corresponding files will be numbered arbitrarily.
For example here is the top ranked tree gctree.out.inference.1.svg
:
You will also see Python pickle files gctree.out.inference.[1,2,...].p
containing a gctree.CollapsedTree
object for each optimal tree, which can be loaded and manipulated via the API (e.g. plotted in various ways using gctree.CollapsedTree.render()
).
All parsimony trees found by dnapars, as well as branching process parameters
are saved in the file gctree.out.inference.parsimony_forest.p
, containing a gctree.CollapsedForest
object.
This file may be manipulated using gctree infer
, instead of providing
a dnapars outfile
.
Note
- Although described below, using context likelihood, mutability parsimony, or isotype parsimony
as ranking criteria is experimental, and has not yet been shown in a careful validation to improve tree inference. Only the default branching process likelihood is recommended for tree ranking!
Criteria other than branching process likelihoods can be used to break ties
between trees. Providing arguments --isotype_mapfile
and
--idmapfile
will allow trees to be ranked by isotype parsimony. Providing
arguments --mutability
and --substitution
allows trees to be ranked
according to a context-sensitive mutation model. By default, trees are ranked
lexicographically, first maximizing likelihood, then minimizing isotype
parsimony, and finally maximizing a context-based poisson likelihood, if such information is provided.
Ranking priorities can be adjusted using the argument --ranking_coeffs
.
For example, to find the optimal tree according to a linear combination of likelihood, isotype parsimony, mutabilities, and alleles:
$ gctree infer gctree.out.inference.parsimony_forest.p --frame 1 --idmap idmap.txt --isotype_mapfile ../example/isotypemap.txt --mutability ../HS5F_Mutability.csv --substitution ../HS5F_Substitution.csv --ranking_coeffs 1 0.1 0 --outbase newranking --summarize_forest --tree_stats --verbose
/usr/share/miniconda/envs/gctree/lib/python3.9/site-packages/gctree/branching_processes.py:1282: UserWarning: Higher values for LogContextLikelihood are generally better, but the provided ranking coefficient is positive, so trees with lower values will be preferred.
warnings.warn(
Loading provided parsimony forest. If forest has fit parameters, parameter fitting will be skipped.
Branching process parameters to be used for ranking: (0.49618320692429585, 0.36445206621349757)
Ranking trees to minimize a linear combination of -1(LogBPLikelihood) + 1.0(Isotype Pars.) + 0.1(LogContextLikelihood)
Stats for optimal trees:
tree LogBPLikelihood Isotype Pars. LogContextLikel Alleles treescore
1 -78.00393661 28 -511.8011816 48 54.82381845
The files HS5F_Mutability.csv
and HS5F_Substitution.csv
are a context
sensitive mutation model which can be downloaded from the Shazam Project <https://bitbucket.org/kleinstein/shazam/src/master/data-raw/>_.
By default, only the files listed above will be generated, with the optional argument --outbase
specifying how the output files should be named.
For detailed information about each tree used for ranking, as well as a pairplot like the one below comparing the highest ranked tree to all other ranked trees,use the argument --tree_stats
.
Sometimes ranked trees are too numerous, and generating the output of --tree_stats
would require too many resources. For a summary of the collection of trees used for ranking, the argument --summarize_forest
is provided. Most importantly, this option summarizes how much less likely the top ranked tree is, compared to the most likely tree being ranked, for example to validate coefficients passed to --ranking_coeffs
.
$ cat newranking.forest_summary.log
Overall LogBPLikelihood range -92.15528744 to -78.00393661.
Among trees with max LogBPLikelihood of: -78.00393661
Isotype Pars. range: 28 to 28
LogContextLikelihood range: -511.8011815783986 to -509.3519837478428
Alleles range: 48 to 48
range: 0 to 0
Overall Isotype Pars. range 23 to 28.
Among trees with min Isotype Pars. of: 23
LogBPLikelihood range: -90.94295234 to -87.98896459
LogContextLikelihood range: -520.4288242353949 to -515.8186160410505
Alleles range: 51 to 52
range: 0 to 0
Overall LogContextLikelihood range -524.6994201844342 to -508.85478662354575.
Among trees with max LogContextLikelihood of: -508.85478662354575
LogBPLikelihood range: -80.46148748 to -80.46148748
Isotype Pars. range: 27 to 27
Alleles range: 49 to 49
range: 0 to 0
Overall Alleles range 48 to 53.
Among trees with min Alleles of: 48
LogBPLikelihood range: -82.14344157 to -78.00393661
Isotype Pars. range: 26 to 28
LogContextLikelihood range: -516.8335781401432 to -509.3519837478428
range: 0 to 0
Overall range 0 to 0.
Among trees with min of: 0
LogBPLikelihood range: -92.15528744 to -78.00393661
Isotype Pars. range: 23 to 28
LogContextLikelihood range: -524.6994201844342 to -508.85478662354575
Alleles range: 48 to 53
Highest ranked tree: loss from best value:
tree LogBPLikelihood Isotype Pars. LogContextLikel Alleles
1 0.0 5 -2.946394955 0
isotype
If we would like to add observed isotype data to trees output by gctree
inference, we can now do so.
In addition to the outputs from gctree, a file mapping original IDs of observed
sequences to their observed isotypes (like example/isotypemap.txt
) is required.
$ isotype idmap.txt ../example/isotypemap.txt --trees gctree.out.inference.1.p newranking.inference.1.p --out_directory isotyped
name original node count isotype parsimony new node count
gctree.out.inference.1.p 48 28.0 50
newranking.inference.1.p 48 28.0 50
...
Trees originally output by gctree are re-rendered with revised labels and node colors corresponding to observed or inferred isotypes. For example, here is the top ranked tree above, with isotypes added:
A note about node names
The names associated with unobserved nodes (for example, in trees rendered with –idlabel) are arbitrarily chosen, but are guaranteed to correspond bijectively with unobserved sequences. However, if gctree output contains multiple trees, two unobserved nodes which share the same name but occur in different output trees will not in general possess the same unobserved sequence.
A note about ambiguous sequence data
Gctree can handle input sequences which contain ambiguous bases, but handling of such sequences is experimental. That is, there may be issues with both the implementation and the underlying methods.
If you use gctree to build trees from ambiguous sequence data, it’s probably a good idea to follow at least these guidelines:
Make sure that all sites are unambiguous in at least one of the input sequences. If a site contains an ambiguous base in all input sequences, gctree will choose an arbitrary base for that site. The bases chosen for such sites should not be interpreted as meaningful in any way.
Understand that for any observed sequence disambiguated by gctree, another choice of disambiguation for that sequence may exist, which results in a better tree with respect to the ranking criteria. The only guarantee is that disambiguations of observed sequences are maximally parsimonious given the tree topology in which they appear.
Gctree can handle ambiguous input sequences because dnapars
can accept
ambiguous input sequences. Each tree output by dnapars
is then
disambiguated. It is possible that multiple observed sequences may be
disambiguated in identical ways, in which case their corresponding leaf nodes,
and abundances, are merged. The deduplicated sequence ids corresponding to each
node in the final tree output by gctree are retained in the original_ids
node
attribute.
Here’s a bit more discussion about how much the disambiguated observed sequences can be trusted:
Why not to trust leaf disambiguation:
As mentioned above, if the same site(s) contain ambiguous bases in all sequences, the disambiguation is completely arbitrary at those sites, but could be mistakenly interpreted as informed. Also alluded to above, if multiple possible disambiguated leaf sequences exist for a particular dnapars topology, one will be chosen arbitrarily, even though another may be more plausible IRL, or with respect to the ranking criteria that gctree uses.
Why to trust leaf disambiguation:
Disambiguation is informed by sequence placement in the tree (by dnapars), which takes into account all that is known about the sequences. Disambiguation minimizes parsimony score, meaning that ambiguous sites will be filled in using bases in the most closely related sequence in which those sites are unambiguous.
Also, different trees, with possibly different disambiguated leaf sequences, are ranked according to all data (isotype, abundance, mutability) provided to gctree, and the disambiguated leaf sequences influence that ranking. This means that ranking may influence the chosen disambiguation of observed sequences to be more plausible.