Quickstart

In this minimal example we show how to use gctree commands to rank parsimony trees from PHYLIP’s dnapars program. See the CLI and API docs for complete documentation of command options, included many not used here.

The data in this example were published in Tas et al. 2016. Visualizing Antibody Affinity Maturation in Germinal Centers. Science 351 (6277) and shown in Fig. 4 (lymph node 2, germinal center 1).

Input data

We start with a fasta alignment file. The file ../example/150228_Clone_3-8.fasta contains heavy chain V gene sequences from germinal B cells sorted from a brainbow mouse using multicolor fate mapping.

$ tail -30 ../example/150228_Clone_3-8.fasta
>VIBM1S4HBJ
ggacctagcctcgtgaaaccttctcagactctgtccctcacctgttctgtcactggcgac
tccatcaccagtggttactggaactggatccggaagttcccagggaatagacttgagtac
atggggtacataagcttcagtggtagcacttactacaatccatctctcaaaagtcgaatc
tccatcactcgagacacatccaagaaccagtactacctgcagttgaattctgtgactact
gaggacacagccacatattactgt
>VIBM1S4HKJ
ggacctagcctcgtgaaaccttctcagactctgtccctcacctgttctgtcactggcgac
tccatcaccagtggttactggaactggatccggaagttcccagggaatagacttgagtac
atggggtacataagcttcagtggtagcacttactacaatccatctctcaaaagtcgaatc
tccatcactcgagacacatccaagaaccagtactacctgcagttgaattctgtgactact
gaggacacagccacatattactgt
>VIBM1S4HDJ
ggacctagcctcgtgaaaccttctcagactctgtccctcacctgttctgtcactggcgac
tccatcaccagtggttactggaactggatccggaagttcccagggaatagacttgagtac
atggggtacataagcttcagtggtagcacttactacaatccatctctcaaaagtcgaatc
tccatcactcgagacacatccaagaaccagtactacctgcagttgaattctgtgactact
gaggacacagccacatattactgt
>VIBM1S4HCJ
ggacctagcctcgtgaaaccttctcagactctgtccctcacctgttctgtcactggcgac
tccatcaccagtggttactggaactggatccggaagttcccagggaatagacttgagtac
atggggtacataagcttcagtggtagcacttactacaatccatctctcaaaagtcgaatc
tccatcactcgagacacatccaagaaccagtactacctgcagttgaattctgtgactact
gaggacacagccacatattactgt
>GL
ggacctagcctcgtgaaaccttctcagactctgtccctcacctgttctgtcactggcgac
tccatcaccagtggttactggaactggatccggaaattcccagggaataaacttgagtac
atggggtacataagctacagtggtagcacttactacaatccatctctcaaaagtcgaatc
tccatcactcgagacacatccaagaaccagtactacctgcagttgaattctgtgactact
gaggacacagccacatattactgt

In this file the sequence with id GL is the naive germline sequence, and represents the root of the tree. It does not refer to an observed sequence, but is included to outgroup root the tree! This input fasta file should also contain all observed sequences with duplication (some of which may be identical to the root/outgroup sequence).

Deduplication and sequence abundances

First we deduplicate the sequences and convert to phylip alignment format, and also determine sequence abundances. The deduplicate command writes the phylip alignment of unique sequences to stdout (which we redirect to a file here). The argument --root indicates the root id. The flag --id_abundances can be used to indicate that integer sequence ids should be interepreted as abundances. The argument --abundance_file indicates that sequence abundance information should be written to the specified csv file. The argument --idmapfile allows us to specify a filename for the output file containing a mapping of new, unique sequence IDs to original sequence IDs from the fasta file.

$ deduplicate ../example/150228_Clone_3-8.fasta --root GL --abundance_file abundances.csv --idmapfile idmap.txt > deduplicated.phylip

We now have files deduplicated.phylip and abundances.csv:

$ head deduplicated.phylip
 43 264
GL         GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq1       GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq2       GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq3       GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq4       GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq5       GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq6       GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq7       GGACCTAGCC TCGTGAAATC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq8       GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT

$ head abundances.csv
GL,0
seq1,1
seq2,1
seq3,1
seq4,1
seq5,1
seq6,1
seq7,1
seq8,1
seq9,3

Parsimony trees

We use PHYLIP’s dnapars program (see install docs page) to generate a set of maximum parsimony trees for the deduplicated sequences. PHYLIP is an interactive command line tool, so we will automatically generate a config file to feed it input, instead of using it interactively.

We generate a config file for dnapars based on our deduplicated alignment using the mkconfig command.

$ mkconfig deduplicated.phylip dnapars > dnapars.cfg

Run dnapars using this config file, redirecting stdout to a log file.

$ dnapars < dnapars.cfg > dnapars.log

You now have two new files, outfile and outtree. Note: if you want to rerun the above dnapars command, you must delete these two files first!

Gctree Ranking

We’re now ready to run gctree infer to use abundance data (in abundances.csv) to rank the equally parsimonious trees (in outfile). We can use the optional argument --frame to indicate the coding frame of the sequence start, so that amino acid substitutions can be annotated on our trees.

Note

If working in a headless environment, gctree infer must be run with a tool like XVFB to provide an X server for rendering the output trees. Prepend the gctree command with xvfb-run -a. Alternatively, we have had success setting the following environment variables instead of using XVFB: export QT_QPA_PLATFORM=offscreen, export XDG_RUNTIME_DIR=/tmp/runtime-runner. You may also want to tell matplotlib to use a headless backend with export MPLBACKEND=agg.

$ gctree infer outfile abundances.csv --root GL --frame 1 --verbose
number of trees with integer branch lengths: 703
Branching process parameters to be used for ranking: (np.float64(0.49618320692429585), np.float64(0.36445206621349757))
Ranking trees to minimize BPLikelihoodLogLoss then minimize Alleles
Degenerate ranking criteria: filtered forest contains 3 unique trees, with 3 unique collapsed topologies.
A representative of each topology will be sampled randomly for rendering.
  Tree  BPLikelihoodLogLoss  Alleles
     1            78.003937       48
...

A large number of output files with the basename gctree.out.* are also created. The SVG image file gctree.out.inference.abundance_rank.svg shows a distribution of genotype abundances in the original data:

_images/gctree.out.inference.abundance_rank.svg

Then there are files gctree.out.inference.1.svg and gctree.out.inference.1.nk containing an SVG tree image and newick tree file. If more than one parsimony tree is optimal, then up to ten optimal trees will be sampled randomly, and the corresponding files will be numbered arbitrarily. For example here is the top ranked tree gctree.out.inference.1.svg:

You will also see Python pickle files gctree.out.inference.[1,2,...].p containing a gctree.CollapsedTree object for each optimal tree, which can be loaded and manipulated via the API (e.g. plotted in various ways using gctree.CollapsedTree.render()).

All parsimony trees found by dnapars, as well as branching process parameters are saved in the file gctree.out.inference.parsimony_forest.p, containing a gctree.CollapsedForest object. This file may be manipulated using gctree infer, instead of providing a dnapars outfile.

Note

Although described below, using context likelihood log loss, mutability parsimony, or isotype parsimony as ranking criteria is experimental, and has not yet been shown in a careful validation to improve tree inference. Only the default branching process likelihood is recommended for tree ranking!

Criteria other than branching process likelihoods can be used to break ties between trees. Providing arguments --isotype_mapfile and --idmapfile will allow trees to be ranked by isotype parsimony. Providing arguments --mutability and --substitution allows trees to be ranked according to a context-sensitive Poisson model. By default, trees are ranked lexicographically, first minimizing branching process log loss (negative log-likelihood), then minimizing isotype parsimony, minimizing context-based poisson log loss, and finally minimizing number of alleles, if required information for all criteria is provided. Criteria for which required information is not provided will be skipped.

Ranking priorities can be adjusted using the argument --ranking_strategy. This argument accepts a string describing either an alternative lexicographic ordering, or a linear combination of criteria to be minimized. The following identifiers are used to specify available ranking criteria:

B - branching process log loss
I - isotype parsimony
C - context-based Poisson log loss
M - old mutability parsimony
A - number of alleles
R - sitewise reversions to naive sequence

An alternative lexicographic ordering can be specified using a comma-separated list of identifiers. For example by passing --ranking_strategy "C, B, R" to gctree infer, we will minimize context-based Poisson log loss, then branching process log-loss, and finally naive reversions. If for some reason the user wishes to maximize instead of minimizing a criterion, a negative coefficient can be provided, as in "C, B, -R". To compute the value of a criterion without using it for ranking, a coefficient of zero can be prepended to its identifier.

Alternatively, the --ranking_strategy option can be used to rank trees to minimize a linear combination of criteria. For example, to find the optimal tree according to a linear combination of branching process log loss, isotype parsimony, context-based Poisson log loss, and alleles:

$ gctree infer gctree.out.inference.parsimony_forest.p \
                             --frame 1 \
                             --idmap idmap.txt \
                             --isotype_mapfile ../example/isotypemap.txt \
                             --mutability ../HS5F_Mutability.csv \
                             --substitution ../HS5F_Substitution.csv \
                             --ranking_strategy "B + I + 0.1C + 0.01A" \
                             --outbase newranking --summarize_forest \
                             --tree_stats \
                             --verbose
Loading provided parsimony forest. If forest has fit parameters, parameter fitting will be skipped.
number of trees with integer branch lengths: 703
Branching process parameters to be used for ranking: (np.float64(0.49618320692429585), np.float64(0.36445206621349757))
Ranking trees to minimize a linear combination of 1(BPLikelihoodLogLoss) + 1(IsotypeParsimony) + 0.1(ContextLikelihoodLogLoss) + 0.01(Alleles)
  Tree  BPLikelihoodLogLoss  IsotypeParsimony  ContextLikelihoodLogLoss  Alleles   TreeScore
     1            78.003937                28                509.351984       48  157.419135

The files HS5F_Mutability.csv and HS5F_Substitution.csv are a context sensitive mutation model which can be downloaded from the Shazam Project, and are required here to compute context-based Poisson log loss.

By default, only the files listed above will be generated, with the optional argument --outbase specifying how the output files should be named.

For detailed information about each tree used for ranking, as well as a pairplot like the one below comparing the highest ranked tree to all other ranked trees, use the argument --tree_stats.

_images/newranking.tree_stats.pairplot.svg

Sometimes ranked trees are too numerous to generate the output of --tree_stats in a reasonable amount of time. For a summary of the collection of trees used for ranking, the option --summarize_forest is provided. Most importantly, this option summarizes how much less likely the top ranked tree is, compared to the most likely tree being ranked, for example to validate coefficients passed to --ranking_strategy.

$ cat newranking.forest_summary.log

Overall BPLikelihoodLogLoss range 78.00393661 to 92.15528744.
Among trees with min BPLikelihoodLogLoss of: 78.00393661
	IsotypeParsimony range: 28 to 28
	ContextLikelihoodLogLoss range: 509.35198374784284 to 511.8011815783985
	Alleles range: 48 to 48

Overall IsotypeParsimony range 23 to 28.
Among trees with min IsotypeParsimony of: 23
	BPLikelihoodLogLoss range: 87.98896459 to 90.94295234
	ContextLikelihoodLogLoss range: 515.8186160410505 to 520.4288242353949
	Alleles range: 51 to 52

Overall ContextLikelihoodLogLoss range 508.85478662354575 to 524.6994201844342.
Among trees with min ContextLikelihoodLogLoss of: 508.85478662354575
	BPLikelihoodLogLoss range: 80.46148748 to 80.46148748
	IsotypeParsimony range: 27 to 27
	Alleles range: 49 to 49

Overall Alleles range 48 to 53.
Among trees with min Alleles of: 48
	BPLikelihoodLogLoss range: 78.00393661 to 82.14344157
	IsotypeParsimony range: 26 to 28
	ContextLikelihoodLogLoss range: 509.35198374784284 to 516.8335781401432


  Tree  BPLikelihoodLogLoss  IsotypeParsimony  ContextLikelihoodLogLoss  Alleles
     1                  0.0                 5                  0.497197        0

Isotypes

If we would like to add observed isotype data to trees output by gctree inference, we can now do so. In addition to the outputs from gctree, a file mapping original IDs of observed sequences to their observed isotypes (like example/isotypemap.txt) is required.

$ isotype idmap.txt ../example/isotypemap.txt --trees gctree.out.inference.1.p newranking.inference.1.p --out_directory isotyped
name			 original node count	 isotype parsimony	 new node count
gctree.out.inference.1.p		 48			 28.0			 50
newranking.inference.1.p		 48			 28.0			 50
...

Trees originally output by gctree are re-rendered with revised labels and node colors corresponding to observed or inferred isotypes. For example, here is the top ranked tree above, with isotypes added:

_images/gctree.out.inference.1.p.isotype_parsimony.28.svg

A note about node names

The names associated with unobserved nodes (for example, in trees rendered with –idlabel) are arbitrarily chosen, but are guaranteed to correspond bijectively with unobserved sequences. However, if gctree output contains multiple trees, two unobserved nodes which share the same name but occur in different output trees will not in general possess the same unobserved sequence.

A note about ambiguous sequence data

Gctree can handle input sequences which contain ambiguous bases, but handling of such sequences is experimental. That is, there may be issues with both the implementation and the underlying methods.

If you use gctree to build trees from ambiguous sequence data, it’s probably a good idea to follow at least these guidelines:

Make sure that all sites are unambiguous in at least one of the input sequences. If a site contains an ambiguous base in all input sequences, gctree will choose an arbitrary base for that site. The bases chosen for such sites should not be interpreted as meaningful in any way.
Understand that for any observed sequence disambiguated by gctree, another choice of disambiguation for that sequence may exist, which results in a better tree with respect to the ranking criteria. The only guarantee is that disambiguations of observed sequences are maximally parsimonious given the tree topology in which they appear.

Gctree can handle ambiguous input sequences because dnapars can accept ambiguous input sequences. Each tree output by dnapars is then disambiguated. It is possible that multiple observed sequences may be disambiguated in identical ways, in which case their corresponding leaf nodes, and abundances, are merged. The deduplicated sequence ids corresponding to each node in the final tree output by gctree are retained in the original_ids node attribute.

Here’s a bit more discussion about how much the disambiguated observed sequences can be trusted:

Why not to trust leaf disambiguation:

As mentioned above, if the same site(s) contain ambiguous bases in all sequences, the disambiguation is completely arbitrary at those sites, but could be mistakenly interpreted as informed. Also alluded to above, if multiple possible disambiguated leaf sequences exist for a particular dnapars topology, one will be chosen arbitrarily, even though another may be more plausible IRL, or with respect to the ranking criteria that gctree uses.

Why to trust leaf disambiguation:

Disambiguation is informed by sequence placement in the tree (by dnapars), which takes into account all that is known about the sequences. Disambiguation minimizes parsimony score, meaning that ambiguous sites will be filled in using bases in the most closely related sequence in which those sites are unambiguous.

Also, different trees, with possibly different disambiguated leaf sequences, are ranked according to all data (isotype, abundance, mutability) provided to gctree, and the disambiguated leaf sequences influence that ranking. This means that ranking may influence the chosen disambiguation of observed sequences to be more plausible.