gctree.branching_processes

This module contains classes for simulation and inference for a binary branching process with mutation in which the tree is collapsed to nodes that count the number of clonal leaves of each type.

Classes

`CollapsedForest`	A collection of trees.
`CollapsedTree`	A collapsed tree, modeled as an infinite type Galton-Watson process run to extinction.

Exceptions

RankingCriterionError

exception gctree.branching_processes.RankingCriterionError[source]

class gctree.branching_processes.CollapsedTree(tree=None, allow_repeats=False)[source]

A collapsed tree, modeled as an infinite type Galton-Watson process run to extinction.

tree: ete3.TreeNode object with abundance node features

Parameters:

tree (Optional[TreeNode]) – ete3 tree with abundance node features. If uncollapsed, it will be collapsed along branches with no mutations. Can be ommitted on initializaion, and later simulated. If a tree is provided, names of nodes with abundance 0 will not be preserved.
allow_repeats (bool) – tolerate the existence of nodes with the same genotype after collapse, e.g. in sister clades.

__init__(tree=None, allow_repeats=False)[source]

ll(p, q)[source]

Log likelihood of branching process parameters \((p, q)\) given tree topology \(T\) and genotype abundances \(A\).

\[\ell(p, q; T, A) = \log\mathbb{P}(T, A \mid p, q)\]

Parameters:

p (float64) – branching probability
q (float64) – mutation probability

Return type:

Tuple[float64, ndarray]

Returns:

Log likelihood \(\ell(p, q; T, A)\) and its gradient \(\nabla\ell(p, q; T, A)\)

mle(**kwargs)[source]

Maximum likelihood estimate of \((p, q)\).

\[(p, q) = \arg\max_{p,q\in [0,1]}\ell(p, q)\]

Parameters:: kwargs – keyword arguments passed along to the branching process likelihood CollapsedTree.ll()
Return type:: Tuple[float64, float64]
Returns:: Tuple \((p, q)\) with estimated branching probability and estimated mutation probability

simulate(p, q, root=True)[source]

Simulate a collapsed tree as an infinite type Galton-Watson process run to extintion, with branching probability \(p\) and mutation probability \(q\). Overwrites existing tree attribute.

Parameters:

p (float64) – branching probability
q (float64) – mutation probability
root (bool) – flag indicating simulation is being run from the root of the tree, so we should update tree attributes (should usually be True)

render(outfile, scale=None, branch_margin=0, node_size=None, idlabel=False, colormap=None, frame=None, position_map=None, chain_split=None, frame2=None, position_map2=None, show_support=False, show_nuc_muts=False)[source]

Render to tree image file.

Parameters:

outfile (str) – file name to render to, filetype inferred from suffix, .svg for color
scale (Optional[float]) – branch length scale in pixels (set automatically if None)
branch_margin (float) – additional leaf branch separation margin, in pixels, to scale tree width
node_size (Optional[float]) – size of nodes in pixels (set according to abundance if None)
idlabel (bool) – label nodes with seq ids, and write sequences of all nodes to a fasta file with same base name as outfile
colormap (Optional[Dict]) – dictionary mapping node names to color names or to dictionaries of color frequencies
frame (Optional[int]) – coding frame for annotating amino acid substitutions
position_map (Optional[List]) – mapping of position names for sequence indices, to be used with substitution annotations and the frame argument
chain_split (Optional[int]) – if sequences are a concatenation two gene sequences, this is the index at which the 2nd one starts (requires frame and frame2 arguments)
frame2 (Optional[int]) – coding frame for 2nd sequence when using chain_split
position_map2 (Optional[List]) – like position_map, but for 2nd sequence when using chain_split
show_support (bool) – annotate bootstrap support if available
show_nuc_muts (bool) – If True, annotate branches with nucleotide mutations. If False, and frame is provided, then branches will be annotated with amino acid mutations.

feature_colormap(feature, cmap='viridis', vmin=None, vmax=None, scale='linear', **kwargs)[source]

Generate a colormap based on a continuous tree feature.

Parameters:

feature (str) – feature name (all nodes in tree attribute must have this feature)
cmap (str) – any matplotlib color palette name
vmin (Optional[float]) – minimum value for colormap (default to minimum of the feature over the tree)
vmax (Optional[float]) – maximum value for colormap (default to maximum of the feature over the tree)
scale (str) – linear (default), log, or symlog (must also provide linthresh kwarg)
kwargs – additional keyword arguments for scale transformation

Return type:

Dict[str, str]

Returns:

Dictionary of node names to hex color strings, which may be used as the colormap in gctree.CollapsedTree.render()

write(file_name)[source]

Serialize to pickle file.

Parameters:: file_name (str) – file name (.p suffix recommended)

newick(file_name)[source]

Write to newick file.

Parameters:: file_name (str) – file name (.nk suffix recommended)

compare(tree2, method='identity')[source]

Compare this tree to the other tree.

Parameters:

tree2 (CollapsedTree) – another object of this type
method (str) – comparison type (identity, MRCA, or RF)

Return type:

Union[bool, float64]

Returns:

tree difference

support(bootstrap_trees_list, weights=None, compatibility=False)[source]

Compute support from a list of bootstrap CollapsedTree objects, and add to tree attibute.

Parameters:

bootstrap_trees_list (List[CollapsedTree]) – List of trees
weights (Optional[List[float64]]) – weights for each tree, perhaps for weighting parsimony degenerate trees
compatibility (bool) – counts trees that don’t disconfirm the split.

local_branching(tau=1, tau0=1, infinite_root_branch=True, nan_root_lbr=False)[source]

Add local branching statistics (Neher et al. 2014) as tree node features to the ETE tree attribute. After execution, all nodes will have new features LBI (local branching index) and LBR (local branching ratio, below Vs above the node)

Parameters:

tau – decay timescale for exponential filter
tau0 – effective branch length for branches with zero mutations
infinite_root_branch – calculate assuming the root node has an infinite branch
nan_root_lbr – replace the root LBR value with np.nan

class gctree.branching_processes.CollapsedForest(forest=None)[source]

A collection of trees.

We can intialize with a list of trees, each an instance of ete3.Tree or CollapsedTree, or we can simulate the forest later.

n_trees: number of trees in forest

parameters: fit branching process parameters, if mle has been run, otherwise None

Parameters:: forest (Optional[List[Union[CollapsedTree, TreeNode]]]) – list of ete3.Tree

__init__(forest=None)[source]

simulate(p, q, n_trees)[source]

Simulate a forest of collapsed trees. Overwrites existing forest attribute.

Parameters:

p (float64) – branching probability
q (float64) – mutation probability
n_trees (int) – number of trees

ll(p, q, marginal=False)[source]

Log likelihood of branching process parameters \((p, q)\) given tree topologies \(T_1, \dots, T_n\) and corresponding genotype abundances vectors \(A_1, \dots, A_n\) for each of \(n\) trees in the forest.

If marginal=False (the default), compute the joint log likelihood

\[\ell(p, q; T, A) = \sum_{i=1}^n\log\mathbb{P}(T_i, A_i \mid p, q),\]

otherwise compute the marginal log likelihood

\[\ell(p, q; T, A) = \log\left(\sum_{i=1}^n\mathbb{P}(T_i, A_i \mid p, q)\right).\]

Parameters:

p (float64) – branching probability
q (float64) – mutation probability
marginal (bool) – compute the marginal likelihood over trees, otherwise compute the joint likelihood of trees

Return type:

Tuple[float64, ndarray]

Returns:

Log branching process likelihood \(\ell(p, q; T, A)\) and its gradient \(\nabla\ell(p, q; T, A)\)

mle(**kwargs)[source]

Maximum likelihood estimate of \((p, q)\).

\[(p, q) = \arg\max_{p,q\in [0,1]}\ell(p, q)\]

Parameters:: kwargs – keyword arguments passed along to the branching process likelihood CollapsedForest.ll()
Return type:: Tuple[float64, float64]
Returns:: Tuple \((p, q)\) with estimated branching probability and estimated mutation probability

filter_trees(ranking_strategy=None, mutability_file=None, substitution_file=None, ignore_isotype=False, chain_split=None, verbose=False, outbase='gctree.out', summarize_forest=False, tree_stats=False, img_type='svg', ranking_coeffs=None, branching_process_ranking_coeff=-1, use_old_mut_parsimony=False)[source]

Filter trees according to specified criteria.

By default, the forest will be trimmed to maximize branching process likelihood, then minimize isotype parsimony, then maximize context-based Poisson likelihood, and finally minimize number of alleles. Any criteria for which the necessary arguments are not provided will be automatically ignored.

For other ranking strategies, see the ranking_strategy argument.

Parameters:

ranking_strategy (Optional[str]) – A string expression describing how to rank trees. See docs for command line argument –ranking_strategy for description.
mutability_file (Optional[str]) – A mutability model
substitution_file (Optional[str]) – A substitution model
ignore_isotype (bool) – Ignore isotype parsimony when ranking. By default, isotype information added with :meth:add_isotypes will be used to compute isotype parsimony, which is used in ranking.
chain_split (Optional[int]) – The index at which non-adjacent sequences are concatenated, for calculating context-based Poisson likelihood.
verbose (bool) – print information about trimming
outbase (str) – file name stem for a file with information for each tree in the DAG.
summarize_forest (bool) – whether to write a summary of the forest to file [outbase].forest_summary.log
tree_stats (bool) – whether to write stats for each tree in the forest to file [outbase].tree_stats.log
img_type (str) – format for output plots.
ranking_coeffs (Optional[Sequence[float]]) – (Deprecated. Use ranking_strategy instead) A list or tuple of coefficients for prioritizing tree weights. The order of coefficients is: isotype parsimony score, context poisson likelihood, and number of alleles. A coefficient of -1 will be applied to branching process likelihood by default, unless a different value is provided to the keyword argument branching_process_ranking_coeff. Trees are chosen to minimize this linear combination of tree weights, so weights for which larger values are more optimal (such as likelihoods) should have negative coefficients.
branching_process_ranking_coeff (float) – (Deprecated. Use ranking_strategy instead) Ranking coefficient to use for branching process likelihood. Value is ignored unless ranking_coeffs argument is provided.
use_old_mut_parsimony (bool) – (Deprecated. Use ranking_strategy instead) Whether to use the deprecated ‘mutability parsimony’ instead of context-based poisson likelihood (only applicable if mutability and substitution files are provided.

Return type:

CollapsedForest

Returns:

The trimmed forest, containing all optimal trees according to the specified criteria, and a tuple of data about the trees in that forest, with format (branching process likelihood, isotype parsimony, context-based Poisson likelihood, alleles).

likelihood_rankplot(outbase, p, q, img_type='svg')[source]: Save a rank plot of likelihoods to the file [outbase].inference.likelihood_rank.[img_type].

n_topologies()[source]

Count the number of topology classes, ignoring internal node sequences.

Return type:: int

iter_topology_classes()[source]

Sort trees by topology class.

Returns:

A generator of CollapsedForest objects, each containing trees with the same topology,: ignoring internal node labels. CollapsedForests will be yielded in reverse-order of the number of trees in each topology class, so that each CollapsedForest will contain at least as many trees as the one that follows.

add_isotypes(isotypemap=None, isotypemap_file=None, idmap=None, idmap_file=None, isotype_names=None)[source]: Adds isotype annotations, including inferred ancestral isotypes, to all nodes in stored trees.

sample_tree()[source]

Sample a random CollapsedTree from the forest.

Return type:: CollapsedTree