gctree.branching_processes.CollapsedForest

class gctree.branching_processes.CollapsedForest(forest=None)[source]

Bases: object

A collection of trees.

We can intialize with a list of trees, each an instance of ete3.Tree or CollapsedTree, or we can simulate the forest later.

n_trees

number of trees in forest

parameters

fit branching process parameters, if mle has been run, otherwise None

Parameters:

forest (Optional[List[Union[CollapsedTree, TreeNode]]]) – list of ete3.Tree

Methods

add_isotypes

Adds isotype annotations, including inferred ancestral isotypes, to all nodes in stored trees.

filter_trees

Filter trees according to specified criteria.

iter_topology_classes

Sort trees by topology class.

likelihood_rankplot

Save a rank plot of likelihoods to the file [outbase].inference.likelihood_rank.[img_type].

ll

Log likelihood of branching process parameters \((p, q)\) given tree topologies \(T_1, \dots, T_n\) and corresponding genotype abundances vectors \(A_1, \dots, A_n\) for each of \(n\) trees in the forest.

mle

Maximum likelihood estimate of \((p, q)\).

n_topologies

Count the number of topology classes, ignoring internal node sequences.

sample_tree

Sample a random CollapsedTree from the forest.

simulate

Simulate a forest of collapsed trees.

simulate(p, q, n_trees)[source]

Simulate a forest of collapsed trees. Overwrites existing forest attribute.

Parameters:
  • p (float64) – branching probability

  • q (float64) – mutation probability

  • n_trees (int) – number of trees

ll(p, q, marginal=False)[source]

Log likelihood of branching process parameters \((p, q)\) given tree topologies \(T_1, \dots, T_n\) and corresponding genotype abundances vectors \(A_1, \dots, A_n\) for each of \(n\) trees in the forest.

If marginal=False (the default), compute the joint log likelihood

\[\ell(p, q; T, A) = \sum_{i=1}^n\log\mathbb{P}(T_i, A_i \mid p, q),\]

otherwise compute the marginal log likelihood

\[\ell(p, q; T, A) = \log\left(\sum_{i=1}^n\mathbb{P}(T_i, A_i \mid p, q)\right).\]
Parameters:
  • p (float64) – branching probability

  • q (float64) – mutation probability

  • marginal (bool) – compute the marginal likelihood over trees, otherwise compute the joint likelihood of trees

Return type:

Tuple[float64, ndarray]

Returns:

Log branching process likelihood \(\ell(p, q; T, A)\) and its gradient \(\nabla\ell(p, q; T, A)\)

mle(**kwargs)[source]

Maximum likelihood estimate of \((p, q)\).

\[(p, q) = \arg\max_{p,q\in [0,1]}\ell(p, q)\]
Parameters:

kwargs – keyword arguments passed along to the branching process likelihood CollapsedForest.ll()

Return type:

Tuple[float64, float64]

Returns:

Tuple \((p, q)\) with estimated branching probability and estimated mutation probability

filter_trees(ranking_strategy=None, mutability_file=None, substitution_file=None, ignore_isotype=False, chain_split=None, verbose=False, outbase='gctree.out', summarize_forest=False, tree_stats=False, img_type='svg', ranking_coeffs=None, branching_process_ranking_coeff=-1, use_old_mut_parsimony=False)[source]

Filter trees according to specified criteria.

By default, the forest will be trimmed to maximize branching process likelihood, then minimize isotype parsimony, then maximize context-based Poisson likelihood, and finally minimize number of alleles. Any criteria for which the necessary arguments are not provided will be automatically ignored.

For other ranking strategies, see the ranking_strategy argument.

Parameters:
  • ranking_strategy (Optional[str]) – A string expression describing how to rank trees. See docs for command line argument –ranking_strategy for description.

  • mutability_file (Optional[str]) – A mutability model

  • substitution_file (Optional[str]) – A substitution model

  • ignore_isotype (bool) – Ignore isotype parsimony when ranking. By default, isotype information added with :meth:add_isotypes will be used to compute isotype parsimony, which is used in ranking.

  • chain_split (Optional[int]) – The index at which non-adjacent sequences are concatenated, for calculating context-based Poisson likelihood.

  • verbose (bool) – print information about trimming

  • outbase (str) – file name stem for a file with information for each tree in the DAG.

  • summarize_forest (bool) – whether to write a summary of the forest to file [outbase].forest_summary.log

  • tree_stats (bool) – whether to write stats for each tree in the forest to file [outbase].tree_stats.log

  • img_type (str) – format for output plots.

  • ranking_coeffs (Optional[Sequence[float]]) – (Deprecated. Use ranking_strategy instead) A list or tuple of coefficients for prioritizing tree weights. The order of coefficients is: isotype parsimony score, context poisson likelihood, and number of alleles. A coefficient of -1 will be applied to branching process likelihood by default, unless a different value is provided to the keyword argument branching_process_ranking_coeff. Trees are chosen to minimize this linear combination of tree weights, so weights for which larger values are more optimal (such as likelihoods) should have negative coefficients.

  • branching_process_ranking_coeff (float) – (Deprecated. Use ranking_strategy instead) Ranking coefficient to use for branching process likelihood. Value is ignored unless ranking_coeffs argument is provided.

  • use_old_mut_parsimony (bool) – (Deprecated. Use ranking_strategy instead) Whether to use the deprecated ‘mutability parsimony’ instead of context-based poisson likelihood (only applicable if mutability and substitution files are provided.

Return type:

CollapsedForest

Returns:

The trimmed forest, containing all optimal trees according to the specified criteria, and a tuple of data about the trees in that forest, with format (branching process likelihood, isotype parsimony, context-based Poisson likelihood, alleles).

likelihood_rankplot(outbase, p, q, img_type='svg')[source]

Save a rank plot of likelihoods to the file [outbase].inference.likelihood_rank.[img_type].

n_topologies()[source]

Count the number of topology classes, ignoring internal node sequences.

Return type:

int

iter_topology_classes()[source]

Sort trees by topology class.

Returns:

A generator of CollapsedForest objects, each containing trees with the same topology,

ignoring internal node labels. CollapsedForests will be yielded in reverse-order of the number of trees in each topology class, so that each CollapsedForest will contain at least as many trees as the one that follows.

add_isotypes(isotypemap=None, isotypemap_file=None, idmap=None, idmap_file=None, isotype_names=None)[source]

Adds isotype annotations, including inferred ancestral isotypes, to all nodes in stored trees.

sample_tree()[source]

Sample a random CollapsedTree from the forest.

Return type:

CollapsedTree