gctree.branching_processes

This module contains classes for simulation and inference for a binary branching process with mutation in which the tree is collapsed to nodes that count the number of clonal leaves of each type.

Classes

CollapsedForest

A collection of trees.

CollapsedTree

A collapsed tree, modeled as an infinite type Galton-Watson process run to extinction.

class gctree.branching_processes.CollapsedTree(tree=None, allow_repeats=False)[source]

A collapsed tree, modeled as an infinite type Galton-Watson process run to extinction.

tree

ete3.TreeNode object with abundance node features

Parameters:
  • tree (Optional[TreeNode]) – ete3 tree with abundance node features. If uncollapsed, it will be collapsed along branches with no mutations. Can be ommitted on initializaion, and later simulated. If a tree is provided, names of nodes with abundance 0 will not be preserved.

  • allow_repeats (bool) – tolerate the existence of nodes with the same genotype after collapse, e.g. in sister clades.

__init__(tree=None, allow_repeats=False)[source]
ll(p, q)[source]

Log likelihood of branching process parameters \((p, q)\) given tree topology \(T\) and genotype abundances \(A\).

\[\ell(p, q; T, A) = \log\mathbb{P}(T, A \mid p, q)\]
Parameters:
  • p (float64) – branching probability

  • q (float64) – mutation probability

Return type:

Tuple[float64, ndarray]

Returns:

Log likelihood \(\ell(p, q; T, A)\) and its gradient \(\nabla\ell(p, q; T, A)\)

mle(**kwargs)[source]

Maximum likelihood estimate of \((p, q)\).

\[(p, q) = \arg\max_{p,q\in [0,1]}\ell(p, q)\]
Parameters:

kwargs – keyword arguments passed along to the branching process likelihood CollapsedTree.ll()

Return type:

Tuple[float64, float64]

Returns:

Tuple \((p, q)\) with estimated branching probability and estimated mutation probability

simulate(p, q, root=True)[source]

Simulate a collapsed tree as an infinite type Galton-Watson process run to extintion, with branching probability \(p\) and mutation probability \(q\). Overwrites existing tree attribute.

Parameters:
  • p (float64) – branching probability

  • q (float64) – mutation probability

  • root (bool) – flag indicating simulation is being run from the root of the tree, so we should update tree attributes (should usually be True)

render(outfile, scale=None, branch_margin=0, node_size=None, idlabel=False, colormap=None, frame=None, position_map=None, chain_split=None, frame2=None, position_map2=None, show_support=False)[source]

Render to tree image file.

Parameters:
  • outfile (str) – file name to render to, filetype inferred from suffix, .svg for color

  • scale (Optional[float]) – branch length scale in pixels (set automatically if None)

  • branch_margin (float) – additional leaf branch separation margin, in pixels, to scale tree width

  • node_size (Optional[float]) – size of nodes in pixels (set according to abundance if None)

  • idlabel (bool) – label nodes with seq ids, and write sequences of all nodes to a fasta file with same base name as outfile

  • colormap (Optional[Dict]) – dictionary mapping node names to color names or to dictionaries of color frequencies

  • frame (Optional[int]) – coding frame for annotating amino acid substitutions

  • position_map (Optional[List]) – mapping of position names for sequence indices, to be used with substitution annotations and the frame argument

  • chain_split (Optional[int]) – if sequences are a concatenation two gene sequences, this is the index at which the 2nd one starts (requires frame and frame2 arguments)

  • frame2 (Optional[int]) – coding frame for 2nd sequence when using chain_split

  • position_map2 (Optional[List]) – like position_map, but for 2nd sequence when using chain_split

  • show_support (bool) – annotate bootstrap support if available

feature_colormap(feature, cmap='viridis', vmin=None, vmax=None, scale='linear', **kwargs)[source]

Generate a colormap based on a continuous tree feature.

Parameters:
  • feature (str) – feature name (all nodes in tree attribute must have this feature)

  • cmap (str) – any matplotlib color palette name

  • vmin (Optional[float]) – minimum value for colormap (default to minimum of the feature over the tree)

  • vmax (Optional[float]) – maximum value for colormap (default to maximum of the feature over the tree)

  • scale (str) – linear (default), log, or symlog (must also provide linthresh kwarg)

  • kwargs – additional keyword arguments for scale transformation

Return type:

Dict[str, str]

Returns:

Dictionary of node names to hex color strings, which may be used as the colormap in gctree.CollapsedTree.render()

write(file_name)[source]

Serialize to pickle file.

Parameters:

file_name (str) – file name (.p suffix recommended)

newick(file_name)[source]

Write to newick file.

Parameters:

file_name (str) – file name (.nk suffix recommended)

compare(tree2, method='identity')[source]

Compare this tree to the other tree.

Parameters:
  • tree2 (CollapsedTree) – another object of this type

  • method (str) – comparison type (identity, MRCA, or RF)

Return type:

Union[bool, float64]

Returns:

tree difference

support(bootstrap_trees_list, weights=None, compatibility=False)[source]

Compute support from a list of bootstrap CollapsedTree objects, and add to tree attibute.

Parameters:
  • bootstrap_trees_list (List[CollapsedTree]) – List of trees

  • weights (Optional[List[float64]]) – weights for each tree, perhaps for weighting parsimony degenerate trees

  • compatibility (bool) – counts trees that don’t disconfirm the split.

local_branching(tau=1, tau0=1, infinite_root_branch=True, nan_root_lbr=False)[source]

Add local branching statistics (Neher et al. 2014) as tree node features to the ETE tree attribute. After execution, all nodes will have new features LBI (local branching index) and LBR (local branching ratio, below Vs above the node)

Parameters:
  • tau – decay timescale for exponential filter

  • tau0 – effective branch length for branches with zero mutations

  • infinite_root_branch – calculate assuming the root node has an infinite branch

  • nan_root_lbr – replace the root LBR value with np.nan

class gctree.branching_processes.CollapsedForest(forest=None)[source]

A collection of trees.

We can intialize with a list of trees, each an instance of ete3.Tree or CollapsedTree, or we can simulate the forest later.

n_trees

number of trees in forest

parameters

fit branching process parameters, if mle has been run, otherwise None

Parameters:

forest (Optional[List[Union[CollapsedTree, TreeNode]]]) – list of ete3.Tree

__init__(forest=None)[source]
simulate(p, q, n_trees)[source]

Simulate a forest of collapsed trees. Overwrites existing forest attribute.

Parameters:
  • p (float64) – branching probability

  • q (float64) – mutation probability

  • n_trees (int) – number of trees

ll(p, q, marginal=False)[source]

Log likelihood of branching process parameters \((p, q)\) given tree topologies \(T_1, \dots, T_n\) and corresponding genotype abundances vectors \(A_1, \dots, A_n\) for each of \(n\) trees in the forest.

If marginal=False (the default), compute the joint log likelihood

\[\ell(p, q; T, A) = \sum_{i=1}^n\log\mathbb{P}(T_i, A_i \mid p, q),\]

otherwise compute the marginal log likelihood

\[\ell(p, q; T, A) = \log\left(\sum_{i=1}^n\mathbb{P}(T_i, A_i \mid p, q)\right).\]
Parameters:
  • p (float64) – branching probability

  • q (float64) – mutation probability

  • marginal (bool) – compute the marginal likelihood over trees, otherwise compute the joint likelihood of trees

Return type:

Tuple[float64, ndarray]

Returns:

Log branching process likelihood \(\ell(p, q; T, A)\) and its gradient \(\nabla\ell(p, q; T, A)\)

mle(**kwargs)[source]

Maximum likelihood estimate of \((p, q)\).

\[(p, q) = \arg\max_{p,q\in [0,1]}\ell(p, q)\]
Parameters:

kwargs – keyword arguments passed along to the branching process likelihood CollapsedForest.ll()

Return type:

Tuple[float64, float64]

Returns:

Tuple \((p, q)\) with estimated branching probability and estimated mutation probability

filter_trees(ranking_coeffs=None, mutability_file=None, substitution_file=None, ignore_isotype=False, chain_split=None, verbose=False, outbase='gctree.out', summarize_forest=False, tree_stats=False, branching_process_ranking_coeff=-1, use_old_mut_parsimony=False)[source]

Filter trees according to specified criteria.

Trim the forest to minimize a linear combination of branching process likelihood, isotype parsimony score, context/mutability-based Poisson likelihood, and number of alleles, with coefficients provided in the argument ``ranking_coeffs`, in that order.

Parameters:
  • ranking_coeffs (Optional[Sequence[float]]) – A list or tuple of coefficients for prioritizing tree weights. The order of coefficients is: isotype parsimony score, context poisson likelihood, and number of alleles. A coefficient of -1 will be applied to branching process likelihood by default, unless a different value is provided to the keyword argument branching_process_ranking_coeff. Trees are chosen to minimize this linear combination of tree weights, so weights for which larger values are more optimal (such as likelihoods) should have negative coefficients. If ranking_coeffs is not provided, trees will be ranked lexicographically by likelihood, then by other traits, in the same order.

  • mutability_file (Optional[str]) – A mutability model

  • substitution_file (Optional[str]) – A substitution model

  • ignore_isotype (bool) – Ignore isotype parsimony when ranking. By default, isotype information added with :meth:add_isotypes will be used to compute isotype parsimony, which is used in ranking.

  • chain_split (Optional[int]) – The index at which non-adjacent sequences are concatenated, for calculating context-based Poisson likelihood.

  • verbose (bool) – print information about trimming

  • outbase (str) – file name stem for a file with information for each tree in the DAG.

  • summarize_forest (bool) – whether to write a summary of the forest to file [outbase].forest_summary.log

  • tree_stats (bool) – whether to write stats for each tree in the forest to file [outbase].tree_stats.log

  • branching_process_ranking_coeff (float) – Ranking coefficient to use for branching process likelihood. Value is ignored unless ranking_coeffs argument is provided.

  • use_old_mut_parsimony (bool) – Whether to use the deprecated ‘mutability parsimony’ instead of context-based poisson likelihood (only applicable if mutability and substitution files are provided.

Return type:

CollapsedForest

Returns:

The trimmed forest, containing all optimal trees according to the specified criteria, and a tuple of data about the trees in that forest, with format (branching process likelihood, isotype parsimony, context-based Poisson likelihood, alleles).

likelihood_rankplot(outbase, p, q, img_type='svg')[source]

Save a rank plot of likelihoods to the file [outbase].inference.likelihood_rank.[img_type].

n_topologies()[source]

Count the number of topology classes, ignoring internal node sequences.

Return type:

int

iter_topology_classes()[source]

Sort trees by topology class.

Returns:

A generator of CollapsedForest objects, each containing trees with the same topology,

ignoring internal node labels. CollapsedForests will be yielded in reverse-order of the number of trees in each topology class, so that each CollapsedForest will contain at least as many trees as the one that follows.

add_isotypes(isotypemap=None, isotypemap_file=None, idmap=None, idmap_file=None, isotype_names=None)[source]

Adds isotype annotations, including inferred ancestral isotypes, to all nodes in stored trees.

sample_tree()[source]

Sample a random CollapsedTree from the forest.

Return type:

CollapsedTree