Quickstart

In this minimal example we show how to use gctree commands to rank parsimony trees from PHYLIP’s dnapars program. See the CLI and API docs for complete documentation of command options, included many not used here.

The data in this example were published in Tas et al. 2016. Visualizing Antibody Affinity Maturation in Germinal Centers. Science 351 (6277) and shown in Fig. 4 (lymph node 2, germinal center 1).

_images/gc1.png

Input data

We start with a fasta alignment file. The file ../example/150228_Clone_3-8.fasta contains heavy chain V gene sequences from germinal B cells sorted from a brainbow mouse using multicolor fate mapping.

$ tail -30 ../example/150228_Clone_3-8.fasta
>VIBM1S4HBJ
ggacctagcctcgtgaaaccttctcagactctgtccctcacctgttctgtcactggcgac
tccatcaccagtggttactggaactggatccggaagttcccagggaatagacttgagtac
atggggtacataagcttcagtggtagcacttactacaatccatctctcaaaagtcgaatc
tccatcactcgagacacatccaagaaccagtactacctgcagttgaattctgtgactact
gaggacacagccacatattactgt
>VIBM1S4HKJ
ggacctagcctcgtgaaaccttctcagactctgtccctcacctgttctgtcactggcgac
tccatcaccagtggttactggaactggatccggaagttcccagggaatagacttgagtac
atggggtacataagcttcagtggtagcacttactacaatccatctctcaaaagtcgaatc
tccatcactcgagacacatccaagaaccagtactacctgcagttgaattctgtgactact
gaggacacagccacatattactgt
>VIBM1S4HDJ
ggacctagcctcgtgaaaccttctcagactctgtccctcacctgttctgtcactggcgac
tccatcaccagtggttactggaactggatccggaagttcccagggaatagacttgagtac
atggggtacataagcttcagtggtagcacttactacaatccatctctcaaaagtcgaatc
tccatcactcgagacacatccaagaaccagtactacctgcagttgaattctgtgactact
gaggacacagccacatattactgt
>VIBM1S4HCJ
ggacctagcctcgtgaaaccttctcagactctgtccctcacctgttctgtcactggcgac
tccatcaccagtggttactggaactggatccggaagttcccagggaatagacttgagtac
atggggtacataagcttcagtggtagcacttactacaatccatctctcaaaagtcgaatc
tccatcactcgagacacatccaagaaccagtactacctgcagttgaattctgtgactact
gaggacacagccacatattactgt
>GL
ggacctagcctcgtgaaaccttctcagactctgtccctcacctgttctgtcactggcgac
tccatcaccagtggttactggaactggatccggaaattcccagggaataaacttgagtac
atggggtacataagctacagtggtagcacttactacaatccatctctcaaaagtcgaatc
tccatcactcgagacacatccaagaaccagtactacctgcagttgaattctgtgactact
gaggacacagccacatattactgt

In this file the sequence with id GL is the naive germline sequence, and represents the root of the tree. It does not refer to an observed sequence, but is included to outgroup root the tree! This input fasta file should also contain all observed sequences with duplication (some of which may be identical to the root/outgroup sequence).

Deduplication and sequence abundances

First we deduplicate the sequences and convert to phylip alignment format, and also determine sequence abundances. The deduplicate command writes the phylip alignment of unique sequences to stdout (which we redirect to a file here). The argument --root indicates the root id. The flag --id_abundances can be used to indicate that integer sequence ids should be interepreted as abundances. The argument --abundance_file indicates that sequence abundance information should be written to the specified csv file. The argument --idmapfile allows us to specify a filename for the output file containing a mapping of new, unique sequence IDs to original sequence IDs from the fasta file.

$ deduplicate ../example/150228_Clone_3-8.fasta --root GL --abundance_file abundances.csv --idmapfile idmap.txt > deduplicated.phylip

We now have files deduplicated.phylip and abundances.csv:

$ head deduplicated.phylip
 43 264
GL         GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq1       GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq2       GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq3       GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq4       GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq5       GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq6       GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq7       GGACCTAGCC TCGTGAAATC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
seq8       GGACCTAGCC TCGTGAAACC TTCTCAGACT CTGTCCCTCA CCTGTTCTGT
$ head abundances.csv
GL,0
seq1,1
seq2,1
seq3,1
seq4,1
seq5,1
seq6,1
seq7,1
seq8,1
seq9,3

Parsimony trees

We use PHYLIP’s dnapars program (see install docs page) to generate a set of maximum parsimony trees for the deduplicated sequences. PHYLIP is an interactive command line tool, so we will automatically generate a config file to feed it input, instead of using it interactively.

We generate a config file for dnapars based on our deduplicated alignment using the mkconfig command.

$ mkconfig deduplicated.phylip dnapars > dnapars.cfg

Run dnapars using this config file, redirecting stdout to a log file.

$ dnapars < dnapars.cfg > dnapars.log

You now have two new files, outfile and outtree. Note: if you want to rerun the above dnapars command, you must delete these two files first!

gctree

We’re now ready to run gctree infer to use abundance data (in abundances.csv) to rank the equally parsimonious trees (in outfile). We can use the optional argument --frame to indicate the coding frame of the sequence start, so that amino acid substitutions can be annotated on our trees.

Note

If working in a headless environment, gctree infer must be run with a tool like XVFB to provide an X server for rendering the output trees. Prepend the gctree command with xvfb-run -a. Alternatively, we have had success setting the following environment variables instead of using XVFB: export QT_QPA_PLATFORM=offscreen, export XDG_RUNTIME_DIR=/tmp/runtime-runner. You may also want to tell matplotlib to use a headless backend with export MPLBACKEND=agg.

$ gctree infer outfile abundances.csv --root GL --frame 1 --verbose
number of trees with integer branch lengths: 703
Branching process parameters to be used for ranking: (0.49618320692429585, 0.36445206621349757)
Ranking trees to maximize LogBPLikelihood then minimize Alleles

Stats for optimal trees:
tree     	LogBPLikelihood	Alleles        
1         	-78.00393661	48        
Degenerate ranking criteria: trimmed history DAG contains 3 unique trees, with 3 unique collapsed topologies.
A representative of each topology will be sampled randomly for rendering.
...

A large number of output files with the basename gctree.out.* are also created. The SVG image file gctree.out.inference.abundance_rank.svg shows a distribution of genotype abundances in the original data:

_images/gctree.out.inference.abundance_rank.svg

Then there are files gctree.out.inference.1.svg and gctree.out.inference.1.nk containing an SVG tree image and newick tree file. If more than one parsimony tree is optimal, then up to ten optimal trees will be sampled randomly, and the corresponding files will be numbered arbitrarily. For example here is the top ranked tree gctree.out.inference.1.svg:

_images/gctree.out.inference.1.svg

You will also see Python pickle files gctree.out.inference.[1,2,...].p containing a gctree.CollapsedTree object for each optimal tree, which can be loaded and manipulated via the API (e.g. plotted in various ways using gctree.CollapsedTree.render()).

All parsimony trees found by dnapars, as well as branching process parameters are saved in the file gctree.out.inference.parsimony_forest.p, containing a gctree.CollapsedForest object. This file may be manipulated using gctree infer, instead of providing a dnapars outfile.

Note

Although described below, using context likelihood, mutability parsimony, or isotype parsimony

as ranking criteria is experimental, and has not yet been shown in a careful validation to improve tree inference. Only the default branching process likelihood is recommended for tree ranking!

Criteria other than branching process likelihoods can be used to break ties between trees. Providing arguments --isotype_mapfile and --idmapfile will allow trees to be ranked by isotype parsimony. Providing arguments --mutability and --substitution allows trees to be ranked according to a context-sensitive mutation model. By default, trees are ranked lexicographically, first maximizing likelihood, then minimizing isotype parsimony, and finally maximizing a context-based poisson likelihood, if such information is provided. Ranking priorities can be adjusted using the argument --ranking_coeffs.

For example, to find the optimal tree according to a linear combination of likelihood, isotype parsimony, mutabilities, and alleles:

$ gctree infer gctree.out.inference.parsimony_forest.p --frame 1 --idmap idmap.txt --isotype_mapfile ../example/isotypemap.txt --mutability ../HS5F_Mutability.csv --substitution ../HS5F_Substitution.csv --ranking_coeffs 1 0.1 0 --outbase newranking --summarize_forest --tree_stats --verbose
/usr/share/miniconda/envs/gctree/lib/python3.9/site-packages/gctree/branching_processes.py:1282: UserWarning: Higher values for LogContextLikelihood are generally better, but the provided ranking coefficient is positive, so trees with lower values will be preferred.
  warnings.warn(
Loading provided parsimony forest. If forest has fit parameters, parameter fitting will be skipped.
Branching process parameters to be used for ranking: (0.49618320692429585, 0.36445206621349757)
Ranking trees to minimize a linear combination of -1(LogBPLikelihood) + 1.0(Isotype Pars.) + 0.1(LogContextLikelihood)

Stats for optimal trees:
tree     	LogBPLikelihood	Isotype Pars.  	LogContextLikel	Alleles        	treescore
1         	-78.00393661	28        	-511.8011816	48        	54.82381845

The files HS5F_Mutability.csv and HS5F_Substitution.csv are a context sensitive mutation model which can be downloaded from the Shazam Project <https://bitbucket.org/kleinstein/shazam/src/master/data-raw/>_.

By default, only the files listed above will be generated, with the optional argument --outbase specifying how the output files should be named.

_images/newranking.inference.1.svg

For detailed information about each tree used for ranking, as well as a pairplot like the one below comparing the highest ranked tree to all other ranked trees,use the argument --tree_stats.

newranking.tree_stats.pairplot.png

Sometimes ranked trees are too numerous, and generating the output of --tree_stats would require too many resources. For a summary of the collection of trees used for ranking, the argument --summarize_forest is provided. Most importantly, this option summarizes how much less likely the top ranked tree is, compared to the most likely tree being ranked, for example to validate coefficients passed to --ranking_coeffs.

$ cat newranking.forest_summary.log

Overall LogBPLikelihood range -92.15528744 to -78.00393661.
Among trees with max LogBPLikelihood of: -78.00393661
	Isotype Pars. range: 28 to 28
	LogContextLikelihood range: -511.8011815783986 to -509.3519837478428
	Alleles range: 48 to 48
	 range: 0 to 0

Overall Isotype Pars. range 23 to 28.
Among trees with min Isotype Pars. of: 23
	LogBPLikelihood range: -90.94295234 to -87.98896459
	LogContextLikelihood range: -520.4288242353949 to -515.8186160410505
	Alleles range: 51 to 52
	 range: 0 to 0

Overall LogContextLikelihood range -524.6994201844342 to -508.85478662354575.
Among trees with max LogContextLikelihood of: -508.85478662354575
	LogBPLikelihood range: -80.46148748 to -80.46148748
	Isotype Pars. range: 27 to 27
	Alleles range: 49 to 49
	 range: 0 to 0

Overall Alleles range 48 to 53.
Among trees with min Alleles of: 48
	LogBPLikelihood range: -82.14344157 to -78.00393661
	Isotype Pars. range: 26 to 28
	LogContextLikelihood range: -516.8335781401432 to -509.3519837478428
	 range: 0 to 0

Overall  range 0 to 0.
Among trees with min  of: 0
	LogBPLikelihood range: -92.15528744 to -78.00393661
	Isotype Pars. range: 23 to 28
	LogContextLikelihood range: -524.6994201844342 to -508.85478662354575
	Alleles range: 48 to 53



Highest ranked tree: loss from best value:
tree     	LogBPLikelihood	Isotype Pars.  	LogContextLikel	Alleles        
1         	       0.0	5         	-2.946394955	0

isotype

If we would like to add observed isotype data to trees output by gctree inference, we can now do so. In addition to the outputs from gctree, a file mapping original IDs of observed sequences to their observed isotypes (like example/isotypemap.txt) is required.

$ isotype idmap.txt ../example/isotypemap.txt --trees gctree.out.inference.1.p newranking.inference.1.p --out_directory isotyped
name			 original node count	 isotype parsimony	 new node count
gctree.out.inference.1.p		 48			 28.0			 50
newranking.inference.1.p		 48			 28.0			 50
...

Trees originally output by gctree are re-rendered with revised labels and node colors corresponding to observed or inferred isotypes. For example, here is the top ranked tree above, with isotypes added:

_images/gctree.out.inference.1.p.isotype_parsimony.28.svg

A note about node names

The names associated with unobserved nodes (for example, in trees rendered with –idlabel) are arbitrarily chosen, but are guaranteed to correspond bijectively with unobserved sequences. However, if gctree output contains multiple trees, two unobserved nodes which share the same name but occur in different output trees will not in general possess the same unobserved sequence.

A note about ambiguous sequence data

Gctree can handle input sequences which contain ambiguous bases, but handling of such sequences is experimental. That is, there may be issues with both the implementation and the underlying methods.

If you use gctree to build trees from ambiguous sequence data, it’s probably a good idea to follow at least these guidelines:

  • Make sure that all sites are unambiguous in at least one of the input sequences. If a site contains an ambiguous base in all input sequences, gctree will choose an arbitrary base for that site. The bases chosen for such sites should not be interpreted as meaningful in any way.

  • Understand that for any observed sequence disambiguated by gctree, another choice of disambiguation for that sequence may exist, which results in a better tree with respect to the ranking criteria. The only guarantee is that disambiguations of observed sequences are maximally parsimonious given the tree topology in which they appear.

Gctree can handle ambiguous input sequences because dnapars can accept ambiguous input sequences. Each tree output by dnapars is then disambiguated. It is possible that multiple observed sequences may be disambiguated in identical ways, in which case their corresponding leaf nodes, and abundances, are merged. The deduplicated sequence ids corresponding to each node in the final tree output by gctree are retained in the original_ids node attribute.

Here’s a bit more discussion about how much the disambiguated observed sequences can be trusted:

Why not to trust leaf disambiguation:

As mentioned above, if the same site(s) contain ambiguous bases in all sequences, the disambiguation is completely arbitrary at those sites, but could be mistakenly interpreted as informed. Also alluded to above, if multiple possible disambiguated leaf sequences exist for a particular dnapars topology, one will be chosen arbitrarily, even though another may be more plausible IRL, or with respect to the ranking criteria that gctree uses.

Why to trust leaf disambiguation:

Disambiguation is informed by sequence placement in the tree (by dnapars), which takes into account all that is known about the sequences. Disambiguation minimizes parsimony score, meaning that ambiguous sites will be filled in using bases in the most closely related sequence in which those sites are unambiguous.

Also, different trees, with possibly different disambiguated leaf sequences, are ranked according to all data (isotype, abundance, mutability) provided to gctree, and the disambiguated leaf sequences influence that ranking. This means that ranking may influence the chosen disambiguation of observed sequences to be more plausible.