historydag.utils
Utility functions and classes for working with HistoryDag objects.
Module Attributes
Provides functions to count the number of nodes in trees. |
|
Provides functions to count the probabilities of histories in a DAG, according to the natural distribution induced by the DAG topology. |
Functions
|
A decorator for conveniently accessing a field in a label. |
|
A decorator for accessing label fields on a HistoryDagNode. |
|
Calculate the fraction of binary trees on total_leaves containing a particular clade containing clade_size leaves. |
|
The cartesian product of iterables in a list. |
Collapse nonleaf nodes that have the same sequence. |
|
Returns the number of binary topologies on n labeled leaves. |
|
|
Returns a generator on clade size, support pairs, for clades which would result from binary resolution of a node posessing child clade sets in node_child_clades. |
|
Provides functions to compute the number of edges in a history which do not appear in a reference HistoryDag. |
|
A decorator to make it easier to expand a Label by a certain field. |
|
Pretty prints a counter Normalizing counts using the number of samples, passed as the argument samples. |
|
A decorator to return a default value if any argument is a UANode. |
|
Return whether the provided tree is collapsed. |
|
Returns a generator on clade, support pairs, for clades which would result from binary resolution of a node posessing child clade sets in node_child_clades. |
|
Load a fasta file as a dictionary, with sequence ids as keys and sequences as values. |
|
A numerically stable implementation of logsumexp, similar to Scipy's. |
|
Provides functions to count newick strings. |
|
Provides functions to compute Robinson-Foulds (RF) distances of trees in a DAG, relative to a fixed reference tree. |
|
Return the downward-conditional edge probability of the edge from parent to child. |
|
Produce all subsets of iterable (as tuples of elements), with sizes starting at start_size and ending at end_size (inclusive), or the size of the passed iterable if end_size is None. |
|
Return product of elements of the input list. |
|
Load a fasta file as a generator which yields (sequence ID, sequence) pairs. |
|
Provides functions to compute the sum over all histories in the provided reference DAG, of rooted RF distances to those histories. |
Classes
|
Container for function keyword arguments to |
|
A subclass of |
|
A subclass of float, with arbitrary, mutable state. |
|
Container for |
|
A subclass of int, with arbitrary, mutable state. |
|
A subclass of string, with arbitrary, mutable state. |
|
Exceptions
- historydag.utils.access_nodefield_default(fieldname, default)[source]
A decorator for accessing label fields on a HistoryDagNode. Converts a function taking some label field’s values as positional arguments, to a function taking HistoryDagNodes as positional arguments.
- Parameters:
- Return type:
For example, instead of lambda n1, n2: default if n1.is_ua_node() or n2.is_ua_node() else func(n1.label.fieldname, n2.label.fieldname), this wrapper allows one to write access_nodefield_default(fieldname, default)(func).
- historydag.utils.access_field(fieldname)[source]
A decorator for conveniently accessing a field in a label.
To be used instead of something like lambda l1, l2: func(l1.fieldname, l2.fieldname). Instead just write access_field(fieldname)(func). Supports arbitrarily many positional arguments, which are all expected to be labels (namedtuples) with field fieldname.
- historydag.utils.ignore_uanode(default)[source]
A decorator to return a default value if any argument is a UANode.
For instance, to allow distance between two nodes to be zero if one is UANode
- historydag.utils.explode_label(labelfield)[source]
A decorator to make it easier to expand a Label by a certain field.
- Parameters:
labelfield (
str
) – the name of the field whose contents the wrapped function is expected to explode- Returns:
A decorator which converts a function which explodes a field value, into a function which explodes the whole label at that field.
- historydag.utils.cartesian_product(optionlist, accum=tuple())[source]
The cartesian product of iterables in a list.
Takes a list of functions which each return a fresh generator on options at that site, and returns a generator yielding tuples, which are elements of the cartesian product of the passed generators’ contents.
- historydag.utils.hist(c, samples=1)[source]
Pretty prints a counter Normalizing counts using the number of samples, passed as the argument samples.
- historydag.utils.is_collapsed(tree)[source]
Return whether the provided tree is collapsed.
Collapsed means that any edge whose target is not a leaf node connects nodes with different sequences.
- Return type:
- historydag.utils.collapse_adjacent_sequences(tree)[source]
Collapse nonleaf nodes that have the same sequence.
- Return type:
TreeNode
- class historydag.utils.AddFuncDict(initialdata, name=None, names=None)[source]
Container for function keyword arguments to
historydag.HistoryDag.weight_count()
. This is primarily useful because it allows instances to be added. Passing the result to weight_count as keyword arguments counts the weights jointly. Ahistorydag.utils.AddFuncDict
which may be passed as keyword arguments tohistorydag.HistoryDag.weight_count()
,historydag.HistoryDag.trim_optimal_weight()
, orhistorydag.HistoryDag.optimal_weight_annotate()
methods to trim or annotate ahistorydag.HistoryDag()
according to the weight that the contained functions implement.For example, dag.weight_count(**(parsimony_utils.hamming_distance_countfuncs + make_newickcountfuncs())) would return a Counter object in which the weights are tuples containing hamming parsimony and newickstrings.
- Parameters:
initialdata – A dictionary containing functions keyed by “start_func”, “edge_weight_func”, and “accum_func”. “start_func” specifies weight assigned to leaf HistoryDagNodes. “edge_weight_func” specifies weight assigned to an edge between two HistoryDagNodes, with the first argument the parent node, and the second argument the child node. “accum_func” specifies how to ‘add’ a list of weights. See
historydag.HistoryDag.weight_count()
for more details.name (
Optional
[str
]) – A string containing a name for the weight to be counted. If a tuple of weights will be returned, usenames
instead.names (
Optional
[Tuple
[str
]]) – A tuple of strings containing names for the weights to be counted, if a tuple of weights will be returned by passed functions. If only a single weight will be returned, usename
instead.
- linear_combination(coeffs, significant_digits=8)[source]
Convert an AddFuncDict implementing a tuple of weights to a linear combination of those weights.
This only works when the weights computed by the AddFuncDict use plain sum as their accum_func. Otherwise, although the resulting AddFuncDict may be usable without errors, its behavior is undefined.
- Parameters:
coeffs – The coefficients to be multiplied with each weight before summing.
significant_digits – To combat floating point errors, only this many digits after the decimal will be significant in comparisons between weights.
- Returns:
A new AddFuncDict object which computes the specified linear combination of weights.
- class historydag.utils.HistoryDagFilter(weight_funcs, optimal_func, ordering_name=None, eq_func=operator.eq)[source]
Container for
historydag.utils.AddFuncDict
and an optimality function optimal_func- Parameters:
weight_func – An
AddFuncDict
objectoptimal_func – A function that specifies how to choose the optimal result from weight_func. e.g. min or max
- historydag.utils.node_countfuncs = {'start_func': <function <lambda>>, 'edge_weight_func': <function <lambda>>, 'accum_func': <built-in function sum>}
Provides functions to count the number of nodes in trees.
For use with
historydag.HistoryDag.weight_count()
.
- historydag.utils.natural_edge_probability(parent, child)[source]
Return the downward-conditional edge probability of the edge from parent to child.
This is defined as 1/n, where n is the number of edges descending from the same child clade of
parent
as this edge.
- historydag.utils.log_natural_probability_funcs = {'start_func': <function <lambda>>, 'edge_weight_func': <function <lambda>>, 'accum_func': <built-in function sum>}
Provides functions to count the probabilities of histories in a DAG, according to the natural distribution induced by the DAG topology.
- historydag.utils.sum_rfdistance_funcs(reference_dag, rooted=True, one_sided=None, one_sided_coefficients=(1, 1))[source]
Provides functions to compute the sum over all histories in the provided reference DAG, of rooted RF distances to those histories.
- Parameters:
reference_dag (
HistoryDag
) – The reference DAG. The sum will be computed over all RF distances to histories in this DAGrooted (
bool
) – If False, use edges’ splits for RF distance computation. Otherwise, use the clade below each edge.one_sided (
Optional
[str
]) – May be ‘left’, ‘right’, or None. ‘left’ means that we count splits (or clades, in the rooted case) which are in the reference trees but not in the DAG tree, especially useful if trees in the DAG might be resolutions of multifurcating trees in the reference DAG. ‘right’ means that we count splits or clades in the DAG tree which are not in the reference trees, useful if the reference trees are possibly resolutions of multifurcating trees in the DAG. If not None, one_sided_coefficients are ignored.one_sided_coefficients (
Tuple
[float
,float
]) – coefficients for non-standard symmetric difference calculations (explained in notes below)
The reference DAG must have the same taxa as all the trees in the DAG on which these count functions are used. If this is not true, methods using the keyword arguments produced by this function may fail silently, returning values which mean nothing.
This function allows computation of sums of a Robinson-Foulds distance generalized by the coefficients
(s, t)
provided to theone_sided_coefficients
argument (or implicitly set by theone_sided
argument). Given a tree in the DAG with set of clades (or splits) A, and a tree in the reference DAG with set of clades B, this distance is given by:d_{s,t}(A, B) = s|B - A| + t|A - B|
Notice that when s and t are both 1, this is the symmetric difference of A and B, the standard RF distance.
For each tree A in a DAG, the AddFuncDict returned by this function computes the sum of this distance over all trees B in the reference DAG.
Note that when computing unrooted weights, the sums are over all rooted trees in the reference DAG, so a single unrooted tree contained twice in the reference DAG with different rootings will be counted twice.
Weights are represented by an IntState object and are shifted by a constant K, which is the sum of number of clades in each tree in the DAG.
- historydag.utils.make_rfdistance_countfuncs(ref_tree, rooted=False, one_sided=None, one_sided_coefficients=(1, 1))[source]
Provides functions to compute Robinson-Foulds (RF) distances of trees in a DAG, relative to a fixed reference tree.
We use
ete3.TreeNode.robinson_foulds()
as the reference implementation for unrooted RF distance.Rooted Robinson-Foulds is simply the cardinality of the symmetric difference of the clade sets of two trees, including the root clade. Since we include the root clade in this calculation, our rooted RF distance does not match the rooted
ete3.TreeNode.robinson_foulds()
implementation.- Parameters:
ref_tree (
HistoryDag
) – A tree with respect to which Robinson-Foulds distance will be computed.rooted (
bool
) – If False, use edges’ splits for RF distance computation. Otherwise, use the clade below each edge.one_sided (
Optional
[str
]) – May be ‘left’, ‘right’, or None. ‘left’ means that we count splits (or clades, in the rooted case) which are in the reference tree but not in the DAG tree, especially useful if trees in the DAG might be resolutions of a multifurcating reference. ‘right’ means that we count splits or clades in the DAG tree which are not in the reference tree, useful if the reference tree is possibly a resolution of multifurcating trees in the DAG. If not None, one_sided_coefficients are ignored.one_sided_coefficients (
Tuple
[float
,float
]) – coefficients for non-standard symmetric difference calculations (explained in notes below)
The reference tree must have the same taxa as all the trees in the DAG.
This calculation relies on the observation that the symmetric distance between the splits (or clades, in the rooted case) A in a tree in the DAG, and the splits (or clades) B in the reference tree, can be computed as:
|B ^ A| = |B - A| + |A - B| = |B| - |A n B| + |A - B|
As long as tree edges are in bijection with splits, this can be computed without constructing the set A by considering each edge’s split independently.
In order to accommodate multiple edges with the same split in a tree with root bifurcation, we keep track of the contribution of such edges separately.
One-sided RF distances are computed in this framework by introducing a pair of
one_sided_coefficients
(s, t)
, which affect how much weight is given to the right and left differences in the RF distance calculation:d_{s,t}(A, B) = s|B - A| + t|A - B| = s(|B| - |A n B|) + t|A - B|
When both
s
andt
are 1, we get the standard RF distance. Whens=1
andt=0
, then we have a one-sided “left” RF difference, counting the number of splits in the reference tree which are not in each DAG tree. Whenone_sided
is set to left, then these coefficients will be used, regardless of the values passed. Whens=0
andt=1
, then we have a one-sided “right” RF difference, counting the number of splits in each DAG tree which are not in the reference. Whenone_sided
is set to right, these coefficients will be used, regardless of the values passed.The weight type is a tuple wrapped in an IntState object. The first tuple value a is the contribution of edges which are not part of a root bifurcation, where edges whose splits are in B contribute -1, and edges whose splits are not in B contribute 1, and the second tuple value b is the contribution of the edges which are part of a root bifurcation. The value of the IntState is computed as a + sign(b) + |B|, which on the UA node of the hDAG gives RF distance.
- historydag.utils.make_newickcountfuncs(name_func=lambda n: ..., features=None, feature_funcs={}, internal_labels=True, collapse_leaves=False)[source]
Provides functions to count newick strings. For use with
historydag.HistoryDag.weight_count()
.Arguments are the same as for
historydag.HistoryDag.to_newick()
.
- historydag.utils.edge_difference_funcs(reference_dag, key=lambda n: ...)[source]
Provides functions to compute the number of edges in a history which do not appear in a reference HistoryDag.
This is useful for taking history-wise intersections of DAGs, or counting the number of histories which would appear in such an intersection.
- Parameters:
reference_dag (
HistoryDag
) – The reference DAG. These functions will count the number of edges in a history which do not appear in this DAG.- Returns:
utils.AddFuncDict
object for use with HistoryDag methods for trimming and weight counting/annotation.
- historydag.utils.prod(ls)[source]
Return product of elements of the input list.
if passed list is empty, returns 1.
- historydag.utils.logsumexp(ls)[source]
A numerically stable implementation of logsumexp, similar to Scipy’s.
- class historydag.utils.IntState(*args, **kwargs)[source]
A subclass of int, with arbitrary, mutable state.
State is provided to the constructor as the keyword argument
state
. All other arguments will be passed toint
constructor. Instances should be functionally indistinguishable fromint
.
- class historydag.utils.FloatState(*args, **kwargs)[source]
A subclass of float, with arbitrary, mutable state.
State is provided to the constructor as the keyword argument
state
. All other arguments will be passed tofloat
constructor. Instances should be functionally indistinguishable fromfloat
.
- class historydag.utils.DecimalState(*args, **kwargs)[source]
A subclass of
decimal.Decimal
, with arbitrary, mutable state.State is provided to the constructor as the keyword argument
state
. All other arguments will be passed toDecimal
constructor. Instances should be functionally indistinguishable fromDecimal
.
- class historydag.utils.StrState(*args, **kwargs)[source]
A subclass of string, with arbitrary, mutable state.
State is provided to the constructor as the keyword argument
state
. All other arguments will be passed tostr
constructor. Instances should be functionally indistinguishable fromstr
.
- historydag.utils.count_labeled_binary_topologies(n)[source]
Returns the number of binary topologies on n labeled leaves.
In these topologies, left and right branches are not distinguished, and internal nodes are not ranked.
- historydag.utils.powerset(iterable, start_size=0, end_size=None)[source]
Produce all subsets of iterable (as tuples of elements), with sizes starting at start_size and ending at end_size (inclusive), or the size of the passed iterable if end_size is None.
- historydag.utils.binary_support(clade_size, total_leaves, normalized=True)[source]
Calculate the fraction of binary trees on total_leaves containing a particular clade containing clade_size leaves.
If normalized is False, instead returns the number of binary topologies which would contain a particular clade of size clade_size.
- historydag.utils.count_resolved_clade_supports(n_child_clades, threshold=-1, min_size=1, max_size=None)[source]
Returns a generator on clade size, support pairs, for clades which would result from binary resolution of a node posessing child clade sets in node_child_clades.
Clade size means number of children of this node which are grouped below a node.
Summing over the first element of yielded tuples gives the number of elements which would be yielded by
iter_resolved_clade_supports()
provided with n_child_clades child clades and the same threshold value.- Parameters:
n_child_clades – The number of children of the multifurcating node
threshold – If a resolved node’s clade support value is below this threshold, that clade will not be counted.
min_size – The minimum size of a clade to be counted.
max_size – The (inclusive) maximum size of a clade to be counted. The maximum value is
len(node_child_clades)
, which is equivalent to the default value.
Note that by default, the root clade, including all leaves contained in node_child_clades, as well as all the clades contained in node_child_clades, are included and each have a support of 1. To exclude leaves, pass
min_size=2
.
- historydag.utils.iter_resolved_clade_supports(node_child_clades, threshold=-1, min_size=1, max_size=None)[source]
Returns a generator on clade, support pairs, for clades which would result from binary resolution of a node posessing child clade sets in node_child_clades. All clades with support > threshold are yielded, avoiding iteration through too many clades on large multifurcations.
- Parameters:
node_child_clades – A set of frozensets containing the clades of the children of a multifurcating node.
threshold – If a resolved node’s clade support value is below this threshold, that clade will not be yielded.
min_size – The minimum size of a clade to be yielded. Note this is NOT simply the size of the clade set, but rather the number of children of the multifurcating node which are below the resolved node corresponding to the clade.
max_size – The (inclusive) maximum size of a clade to be yielded. See min_size for a description of what size means. The maximum value is
len(node_child_clades)
, which is equivalent to the default value.
Note that by default, the root clade, including all leaves contained in node_child_clades, as well as all the clades contained in node_child_clades, are included and each have a support of 1. To exclude leaves, pass
min_size=2
.
- historydag.utils.read_fasta(fastapath, sequence_type=str)[source]
Load a fasta file as a generator which yields (sequence ID, sequence) pairs.
The function
sequence_type
will be called on each sequence as it is read from the fasta file, and the resulting object will be yielded as the second item in each sequence record pair.
- historydag.utils.load_fasta(fastapath, sequence_type=str)[source]
Load a fasta file as a dictionary, with sequence ids as keys and sequences as values.
The function
sequence_type
will be called on each sequence as it is read from the fasta file, and the returned objects will be the values in the resulting alignment dictionary.