historydag.parsimony_utils
This module provides tools for describing and computing parsimony and weighted parsimony, and for describing allowed characters and ambiguity codes.
AmbiguityMap
stores a collection of Characters
(for example the strings
'A'
, 'G'
, 'C'
, and 'T'
) that are considered valid states, as well as
a mapping between other Characters
(‘ambiguity codes’) and subsets of these allowed
characters. This type allows forward- and backward-lookup of ambiguity codes.
TransitionModel
describes transition weights between an ordered collection of
Characters
, and provides methods for computing weighted parsimony. Functions in
historydag.parsimony
accept TransitionModel
objects to customize the implementation
of Sankoff, for example. This class has two subclasses: UnitTransitionModel
,
which describes unit transition costs between non-identical Characters
,
and SitewiseTransitionModel
, which allows transition costs to depend on the
location in a sequence in which a transition occurs.
Module Attributes
The standard IUPAC nucleotide ambiguity map, in which 'N', '?', and '-' are all considered fully ambiguous. |
|
The standard IUPAC nucleotide ambiguity map, in which '-' is treated as a fifth base, '?' is fully ambiguous, and the ambiguity 'N' excludes '-'. |
|
A transition model for nucleotides using unit transition weights, and the standard IUPAC ambiguity codes, with base order |
|
A transition model for nucleotides using unit transition weights, and the standard IUPAC nucleotide ambiguity map, with '-' is treated as a fifth base, '?' fully ambiguous, and the ambiguity 'N' excluding only '-'. |
|
|
An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the sequences stored in the |
|
An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the sequences stored in the |
|
An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the compact genomes stored in the |
An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the compact genomes stored in the |
|
Provides functions to count hamming distance parsimony when leaf sequences may be ambiguous. |
|
Provides functions to count hamming distance parsimony when leaf sequences may be ambiguous. |
|
Provides functions to count hamming distance parsimony when sequences are stored as CompactGenomes. |
|
Provides functions to count hamming distance parsimony when leaf compact genomes may be ambiguous. |
Functions
|
An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the compact genomes stored in the |
An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the compact genomes stored in the |
|
|
An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the sequences stored in the |
|
An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the sequences stored in the |
Classes
|
Implements a bijection between a subset of the power set of an alphabet of bases, and an expanded alphabet of ambiguity codes. |
A subclass of frozendict for holding the reversed map in a |
|
|
A subclass of |
|
A class describing a collection of states, and the transition costs between them, for weighted parsimony. |
|
A subclass of |
- class historydag.parsimony_utils.AmbiguityMap(ambiguity_character_map, bases=None, reversed_defaults={})[source]
Implements a bijection between a subset of the power set of an alphabet of bases, and an expanded alphabet of ambiguity codes.
To look up the set of bases represented by an ambiguity code, use the object like a dictionary. To look up an ambiguity code representing a set of bases, use the
reversed
attribute.- Parameters:
ambiguity_character_map (
Mapping
[Any
,Iterable
[Any
]]) – A mapping from ambiguity codes to collections of bases represented by that code.bases (
Optional
[Iterable
[Any
]]) – A collection of bases. If not provided, this will be inferred from the ambiguity_character_map.reversed_defaults (
Mapping
[Iterable
[Any
],Any
]) – When ambiguity_character_map is not injective (i.e. multiple ambiguous characters map to the same set of bases) one might want to specify which character that set of bases maps to in the reversed map. If not provided, the choice will be made arbitrarily.
- is_ambiguous(sequence)[source]
Returns whether the provided sequence contains any non-base characters (such as ambiguity codes).
- Return type:
- sequence_resolutions(sequence)[source]
Iterates through possible disambiguations of sequence, recursively.
Recursion-depth-limited by number of ambiguity codes in sequence, not sequence length.
- get_sequence_resolution_func(field_name)[source]
Returns a function which takes a Label, and returns a generator on labels containing all possible resolutions of the sequence in that node’s label’s
field_name
attribute.- Return type:
Callable
[[Union
[NamedTuple
,UALabel
]],Generator
[Union
[NamedTuple
,UALabel
],None
,None
]]
- sequence_resolution_count(sequence)[source]
Count the number of possible sequence resolutions.
Equivalent to the number of items yielded by
sequence_resolutions()
.- Return type:
- class historydag.parsimony_utils.ReversedAmbiguityMap[source]
A subclass of frozendict for holding the reversed map in a
AmbiguityMap
instance.
- historydag.parsimony_utils.standard_nt_ambiguity_map = <historydag.parsimony_utils.AmbiguityMap object>
The standard IUPAC nucleotide ambiguity map, in which ‘N’, ‘?’, and ‘-’ are all considered fully ambiguous.
- historydag.parsimony_utils.standard_nt_ambiguity_map_gap_as_char = <historydag.parsimony_utils.AmbiguityMap object>
The standard IUPAC nucleotide ambiguity map, in which ‘-’ is treated as a fifth base, ‘?’ is fully ambiguous, and the ambiguity ‘N’ excludes ‘-‘.
- class historydag.parsimony_utils.TransitionModel(bases=tuple('AGCT'), transition_weights=None, ambiguity_map=None)[source]
A class describing a collection of states, and the transition costs between them, for weighted parsimony.
In addition to them methods defined below, there are also attributes which are created by the constructor, which should be ensured to be correct in any subclass of TransitionModel:
base_indices
is a dictionary mapping bases inself.bases
to their indicesmask_vectors
is a dictionary mapping ambiguity codes (including non-ambiguous bases)to vectors of floats which are 0 at indices compatible with the ambiguity code, and infinity otherwise.
bases
is a tuple recording the correct order of unambiguous bases
- Parameters:
bases (
Sequence
[Any
]) – An ordered collection of valid character statestransition_weights (
Optional
[Sequence
[Sequence
[float
]]]) – A matrix describing transition costs between bases, with index order (from_base, to_base)ambiguity_map (
Optional
[AmbiguityMap
]) – AnAmbiguityMap
object describing ambiguity codes. If not provided, the default mapping depends on the provided bases. If provided bases areAGCT
, thenstandard_nt_ambiguity_map
is used. If bases areAGCT-
, thenstandard_nt_ambiguity_map_gap_as_char
is used. Otherwise, it is assumed that no ambiguity codes are allowed. To override these defaults, provide anAmbiguityMap
explicitly.
- get_adjacency_array(seq_len)[source]
Return an adjacency array for a sequence of length
seq_len
.This is a matrix containing transition costs, with index order (site, from_base, to_base)
- Return type:
array
- get_ambiguity_from_tuple(tup)[source]
Retrieve an ambiguity code encoded by tup, which encodes a set of bases with a tuple of 0’s and 1’s.
For example, if
self.bases
is ‘AGCT’, then ``(0, 1, 1, 1)
would return A.- Return type:
- character_distance(parent_char, child_char, site=None)[source]
Return the transition cost from parent_char to child_char, two unambiguous characters.
keyword argument
site
is ignored in this base class.- Return type:
- weighted_hamming_distance(parent_seq, child_seq)[source]
Return the sum of sitewise transition costs, from parent_seq to child_seq.
Both parent_seq and child_seq are expected to be unambiguous.
- Return type:
- weighted_cg_hamming_distance(parent_cg, child_cg)[source]
Return the sum of sitewise transition costs, from parent_cg to child_cg.
Both parent_seq and child_seq are expected to be unambiguous, but sites where parent_cg and child_cg both match their reference sequence are ignored, so this method is not suitable for weighted parsimony for transition matrices that contain nonzero entries along the diagonal.
- Return type:
- min_character_mutation(parent_char, child_char, site=None, **kwargs)[source]
Allowing child_char to be ambiguous, returns the minimum possible transition weight between parent_char and child_char, preceded in a tuple by the unambiguous child character that achieves that weight.
Keyword argument
site
is ignored in this base class.- Return type:
- min_character_distance(parent_char, child_char, site=None)[source]
Allowing child_char to be ambiguous, returns the minimum possible transition weight between parent_char and child_char.
Keyword argument
site
is ignored in this base class.- Return type:
- min_weighted_hamming_distance(parent_seq, child_seq)[source]
Assuming the child_seq may contain ambiguous characters, returns the minimum possible transition weight between parent_seq and child_seq.
- Return type:
- min_weighted_cg_hamming_distance(parent_cg, child_cg)[source]
Return the sum of sitewise transition costs, from parent_cg to child_cg.
child_cg may contain ambiguous characters, and in this case those sites will contribute the minimum possible transition cost to the returned value.
Sites where parent_cg and child_cg both match their reference sequence are ignored, so this method is not suitable for weighted parsimony for transition matrices that contain nonzero entries along the diagonal.
- Return type:
- weighted_hamming_edge_weight(field_name)[source]
Returns a function for computing weighted hamming distance between two nodes’ sequences.
- Parameters:
field_name (
str
) – The name of the node label field which contains sequences.- Return type:
- Returns:
A function accepting two
HistoryDagNode
objectsn1
andn2
and returning a float: the transition cost fromn1.label.<field_name>
ton2.label.<field_name>
, or 0 if n1 is the UA node.
- weighted_cg_hamming_edge_weight(field_name, count_root_muts=True)[source]
Returns a function for computing weighted hamming distance between two nodes’ compact genomes.
- Parameters:
field_name (
str
) – The name of the node label field which contains compact genomes.count_root_muts (
bool
) – If True, the reference sequence of root nodes’ compact genomes is considered to be the ancestral sequence of each history, and mutations from the reference to the history root CG are counted.
- Return type:
- Returns:
A function accepting two
HistoryDagNode
objectsn1
andn2
and returning a float: the transition cost fromn1.label.<field_name>
ton2.label.<field_name>
, or the number of mutations between the reference and the CG of n2 if n1 is the UA node.
Sites where parent_cg and child_cg both match their reference sequence are ignored, so this method is not suitable for weighted parsimony for transition matrices that contain nonzero entries along the diagonal.
- min_weighted_hamming_edge_weight(field_name)[source]
Returns a function for computing weighted hamming distance between two nodes’ sequences.
If the child node is a leaf node, and its sequence contains ambiguities, the minimum possible transition cost from parent to child will be returned.
- Parameters:
field_name (
str
) – The name of the node label field which contains sequences.- Return type:
- Returns:
A function accepting two
HistoryDagNode
objectsn1
andn2
and returning a float: the transition cost fromn1.label.<field_name>
ton2.label.<field_name>
, or 0 if n1 is the UA node.
- min_weighted_cg_hamming_edge_weight(field_name, count_root_muts=True)[source]
Returns a function for computing weighted hamming distance between two nodes’ compact genomes.
If the child node is a leaf node, and its compact genome contains ambiguities, the minimum possible transition cost from parent to child will be returned by this function.
- Parameters:
field_name (
str
) – The name of the node label field which contains compact genomes.count_root_muts (
bool
) – If True, the reference sequence of root nodes’ compact genomes is considered to be the ancestral sequence of each history, and mutations from the reference to the history root CG are counted.
- Return type:
- Returns:
A function accepting two
HistoryDagNode
objectsn1
andn2
and returning a float: the transition cost fromn1.label.<field_name>
ton2.label.<field_name>
, or 0 if n1 is the UA node.
- get_weighted_cg_parsimony_countfuncs(field_name, leaf_ambiguities=False, count_root_muts=True, name='WeightedParsimony')[source]
Create a
historydag.utils.AddFuncDict
object for counting weighted parsimony in a HistoryDag with labels containing compact genomes.- Parameters:
field_name (
str
) – the label field name in which compact genomes can be foundleaf_ambiguities (
bool
) – if True, leaf compact genomes are permitted to contain ambiguity codescount_root_muts (
bool
) – If True, the reference sequence of root nodes’ compact genomes is considered to be the ancestral sequence of each history, and mutations from the reference to the history root CG are counted.name (
str
) – the name for the returned AddFuncDict object
- Return type:
- get_weighted_parsimony_countfuncs(field_name, leaf_ambiguities=False, name='WeightedParsimony')[source]
Create a
historydag.utils.AddFuncDict
object for counting weighted parsimony in a HistoryDag with labels containing sequences.
- class historydag.parsimony_utils.UnitTransitionModel(bases='AGCT', ambiguity_map=None)[source]
A subclass of
TransitionModel
to be used when all non-identical transitions have unit cost.- Parameters:
bases (
Sequence
[Any
]) – An ordered container of allowed character statesambiguity_map (
Optional
[AmbiguityMap
]) – AAmbiguityMap
object describing allowed ambiguity codes. If not provided, a default will be chosen by the same method as the constructor forTransitionModel
.
- character_distance(parent_char, child_char, site=None)[source]
Return the transition cost from parent_char to child_char, two unambiguous characters.
keyword argument
site
is ignored in this subclass.- Return type:
- min_character_mutation(parent_char, child_char, site=None, randomize=False, **kwargs)[source]
Allowing child_char to be ambiguous, returns the minimum possible transition weight between parent_char and child_char, preceded in a tuple by the unambiguous child character that achieves that weight.
Keyword argument
site
is ignored in this base class.- Return type:
- get_weighted_cg_parsimony_countfuncs(field_name, leaf_ambiguities=False, name='HammingParsimony')[source]
Create a
historydag.utils.AddFuncDict
object for counting parsimony in a HistoryDag with labels containing compact genomes.- Parameters:
field_name (
str
) – the label field name in which compact genomes can be foundleaf_ambiguities (
AmbiguityMap
) – if True, leaf compact genomes are permitted to contain ambiguity codesname (
str
) – the name for the returned AddFuncDict object
- Return type:
- get_weighted_parsimony_countfuncs(field_name, leaf_ambiguities=False, name='HammingParsimony')[source]
Create a
historydag.utils.AddFuncDict
object for counting weighted parsimony in a HistoryDag with labels containing sequences.- Parameters:
field_name (
str
) – the label field name in which sequences can be foundleaf_ambiguities (
AmbiguityMap
) – if True, leaf sequences are permitted to contain ambiguity codesname (
str
) – the name for the returned AddFuncDict object
- Return type:
- class historydag.parsimony_utils.SitewiseTransitionModel(bases='AGCT', transition_matrix=None, ambiguity_map=None)[source]
A subclass of
TransitionModel
to be used when transition costs depend on site.- Parameters:
bases (
Sequence
[Any
]) – An ordered container of allowed character statestransition_matrix (
Optional
[Sequence
[Sequence
[Sequence
[float
]]]]) – A matrix of transition costs between provided bases, in which index order is (site, from_base, to_base)ambiguity_map (
Optional
[AmbiguityMap
]) – AnAmbiguityMap
object describing ambiguous characters. If not provided a default will be chosen as in the constructor forTransitionModel
.
- character_distance(parent_char, child_char, site)[source]
Returns the transition cost from
parent_char
tochild_char
, assuming each character is at the one-basedsite
in their respective sequences.Both parent_char and child_char are expected to be unambiguous.
- Return type:
- min_character_mutation(parent_char, child_char, site=None, **kwargs)[source]
Allowing child_char to be ambiguous, returns the minimum possible transition weight between parent_char and child_char, preceded in a tuple by the unambiguous child character that achieves that weight.
Keyword argument
site
is ignored in this base class.- Return type:
- historydag.parsimony_utils.default_nt_transitions = <historydag.parsimony_utils.UnitTransitionModel object>
A transition model for nucleotides using unit transition weights, and the standard IUPAC ambiguity codes, with base order
AGCT
andN
,?
and-
treated as fully ambiguous characters.
- historydag.parsimony_utils.default_nt_gaps_transitions = <historydag.parsimony_utils.UnitTransitionModel object>
A transition model for nucleotides using unit transition weights, and the standard IUPAC nucleotide ambiguity map, with ‘-’ is treated as a fifth base, ‘?’ fully ambiguous, and the ambiguity ‘N’ excluding only ‘-‘.
- historydag.parsimony_utils.hamming_edge_weight(parent, child)
An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the sequences stored in the
sequence
attributes of their labels.
- historydag.parsimony_utils.hamming_edge_weight_ambiguous_leaves(parent, child)
An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the sequences stored in the
sequence
attributes of their labels.This edge weight function allows leaf nodes to have ambiguous sequences.
- historydag.parsimony_utils.hamming_cg_edge_weight(parent, child)
An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the compact genomes stored in the
compact_genome
attributes of their labels.
- historydag.parsimony_utils.hamming_cg_edge_weight_ambiguous_leaves(parent, child)
An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the compact genomes stored in the
compact_genome
attributes of their labels.This edge weight function allows leaf nodes to have ambiguous compact genomes.
- historydag.parsimony_utils.hamming_distance_countfuncs = {'start_func': <function TransitionModel.get_weighted_parsimony_countfuncs.<locals>.<lambda>>, 'edge_weight_func': <function TransitionModel.weighted_hamming_edge_weight.<locals>.edge_weight>, 'accum_func': <built-in function sum>}
Provides functions to count hamming distance parsimony when leaf sequences may be ambiguous.
For use with
historydag.AmbiguousLeafSequenceHistoryDag.weight_count()
.
- historydag.parsimony_utils.leaf_ambiguous_hamming_distance_countfuncs = {'start_func': <function TransitionModel.get_weighted_parsimony_countfuncs.<locals>.<lambda>>, 'edge_weight_func': <function TransitionModel.min_weighted_hamming_edge_weight.<locals>.edge_weight>, 'accum_func': <built-in function sum>}
Provides functions to count hamming distance parsimony when leaf sequences may be ambiguous.
For use with
historydag.AmbiguousLeafSequenceHistoryDag.weight_count()
.
- historydag.parsimony_utils.compact_genome_hamming_distance_countfuncs = {'start_func': <function TransitionModel.get_weighted_cg_parsimony_countfuncs.<locals>.<lambda>>, 'edge_weight_func': <function TransitionModel.weighted_cg_hamming_edge_weight.<locals>.edge_weight>, 'accum_func': <built-in function sum>}
Provides functions to count hamming distance parsimony when sequences are stored as CompactGenomes.
For use with
historydag.CGHistoryDag.weight_count()
.
- historydag.parsimony_utils.leaf_ambiguous_compact_genome_hamming_distance_countfuncs = {'start_func': <function TransitionModel.get_weighted_cg_parsimony_countfuncs.<locals>.<lambda>>, 'edge_weight_func': <function TransitionModel.min_weighted_cg_hamming_edge_weight.<locals>.edge_weight>, 'accum_func': <built-in function sum>}
Provides functions to count hamming distance parsimony when leaf compact genomes may be ambiguous.
For use with
historydag.AmbiguousLeafCGHistoryDag.weight_count()
.