historydag.parsimony_utils

This module provides tools for describing and computing parsimony and weighted parsimony, and for describing allowed characters and ambiguity codes.

AmbiguityMap stores a collection of Characters (for example the strings 'A', 'G', 'C', and 'T') that are considered valid states, as well as a mapping between other Characters (‘ambiguity codes’) and subsets of these allowed characters. This type allows forward- and backward-lookup of ambiguity codes.

TransitionModel describes transition weights between an ordered collection of Characters, and provides methods for computing weighted parsimony. Functions in historydag.parsimony accept TransitionModel objects to customize the implementation of Sankoff, for example. This class has two subclasses: UnitTransitionModel, which describes unit transition costs between non-identical Characters, and SitewiseTransitionModel, which allows transition costs to depend on the location in a sequence in which a transition occurs.

Module Attributes

standard_nt_ambiguity_map

The standard IUPAC nucleotide ambiguity map, in which 'N', '?', and '-' are all considered fully ambiguous.

standard_nt_ambiguity_map_gap_as_char

The standard IUPAC nucleotide ambiguity map, in which '-' is treated as a fifth base, '?' is fully ambiguous, and the ambiguity 'N' excludes '-'.

default_nt_transitions

A transition model for nucleotides using unit transition weights, and the standard IUPAC ambiguity codes, with base order AGCT and N, ? and - treated as fully ambiguous characters.

default_nt_gaps_transitions

A transition model for nucleotides using unit transition weights, and the standard IUPAC nucleotide ambiguity map, with '-' is treated as a fifth base, '?' fully ambiguous, and the ambiguity 'N' excluding only '-'.

hamming_edge_weight(parent, child)

An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the sequences stored in the sequence attributes of their labels.

hamming_edge_weight_ambiguous_leaves(parent, ...)

An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the sequences stored in the sequence attributes of their labels.

hamming_cg_edge_weight(parent, child)

An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the compact genomes stored in the compact_genome attributes of their labels.

hamming_cg_edge_weight_ambiguous_leaves(...)

An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the compact genomes stored in the compact_genome attributes of their labels.

hamming_distance_countfuncs

Provides functions to count hamming distance parsimony when leaf sequences may be ambiguous.

leaf_ambiguous_hamming_distance_countfuncs

Provides functions to count hamming distance parsimony when leaf sequences may be ambiguous.

compact_genome_hamming_distance_countfuncs

Provides functions to count hamming distance parsimony when sequences are stored as CompactGenomes.

leaf_ambiguous_compact_genome_hamming_distance_countfuncs

Provides functions to count hamming distance parsimony when leaf compact genomes may be ambiguous.

Functions

hamming_cg_edge_weight(parent, child)

An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the compact genomes stored in the compact_genome attributes of their labels.

hamming_cg_edge_weight_ambiguous_leaves(...)

An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the compact genomes stored in the compact_genome attributes of their labels.

hamming_edge_weight(parent, child)

An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the sequences stored in the sequence attributes of their labels.

hamming_edge_weight_ambiguous_leaves(parent, ...)

An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the sequences stored in the sequence attributes of their labels.

Classes

AmbiguityMap(ambiguity_character_map[, ...])

Implements a bijection between a subset of the power set of an alphabet of bases, and an expanded alphabet of ambiguity codes.

ReversedAmbiguityMap

A subclass of frozendict for holding the reversed map in a AmbiguityMap instance.

SitewiseTransitionModel([bases, ...])

A subclass of TransitionModel to be used when transition costs depend on site.

TransitionModel([bases, transition_weights, ...])

A class describing a collection of states, and the transition costs between them, for weighted parsimony.

UnitTransitionModel([bases, ambiguity_map])

A subclass of TransitionModel to be used when all non-identical transitions have unit cost.

class historydag.parsimony_utils.AmbiguityMap(ambiguity_character_map, bases=None, reversed_defaults={})[source]

Implements a bijection between a subset of the power set of an alphabet of bases, and an expanded alphabet of ambiguity codes.

To look up the set of bases represented by an ambiguity code, use the object like a dictionary. To look up an ambiguity code representing a set of bases, use the reversed attribute.

Parameters:
  • ambiguity_character_map (Mapping[Any, Iterable[Any]]) – A mapping from ambiguity codes to collections of bases represented by that code.

  • bases (Optional[Iterable[Any]]) – A collection of bases. If not provided, this will be inferred from the ambiguity_character_map.

  • reversed_defaults (Mapping[Iterable[Any], Any]) – When ambiguity_character_map is not injective (i.e. multiple ambiguous characters map to the same set of bases) one might want to specify which character that set of bases maps to in the reversed map. If not provided, the choice will be made arbitrarily.

__init__(ambiguity_character_map, bases=None, reversed_defaults={})[source]
is_ambiguous(sequence)[source]

Returns whether the provided sequence contains any non-base characters (such as ambiguity codes).

Return type:

bool

sequence_resolutions(sequence)[source]

Iterates through possible disambiguations of sequence, recursively.

Recursion-depth-limited by number of ambiguity codes in sequence, not sequence length.

Return type:

Generator[Sequence[Any], None, None]

get_sequence_resolution_func(field_name)[source]

Returns a function which takes a Label, and returns a generator on labels containing all possible resolutions of the sequence in that node’s label’s field_name attribute.

Return type:

Callable[[Union[NamedTuple, UALabel]], Generator[Union[NamedTuple, UALabel], None, None]]

sequence_resolution_count(sequence)[source]

Count the number of possible sequence resolutions.

Equivalent to the number of items yielded by sequence_resolutions().

Return type:

int

get_sequence_resolution_count_func(field_name)[source]

Returns a function taking a Label and returning the number of resolutions for the sequence held in the label’s field_name attribute.

Return type:

Callable[[Union[NamedTuple, UALabel]], int]

class historydag.parsimony_utils.ReversedAmbiguityMap[source]

A subclass of frozendict for holding the reversed map in a AmbiguityMap instance.

historydag.parsimony_utils.standard_nt_ambiguity_map = <historydag.parsimony_utils.AmbiguityMap object>

The standard IUPAC nucleotide ambiguity map, in which ‘N’, ‘?’, and ‘-’ are all considered fully ambiguous.

historydag.parsimony_utils.standard_nt_ambiguity_map_gap_as_char = <historydag.parsimony_utils.AmbiguityMap object>

The standard IUPAC nucleotide ambiguity map, in which ‘-’ is treated as a fifth base, ‘?’ is fully ambiguous, and the ambiguity ‘N’ excludes ‘-‘.

class historydag.parsimony_utils.TransitionModel(bases=tuple('AGCT'), transition_weights=None, ambiguity_map=None)[source]

A class describing a collection of states, and the transition costs between them, for weighted parsimony.

In addition to them methods defined below, there are also attributes which are created by the constructor, which should be ensured to be correct in any subclass of TransitionModel:

  • base_indices is a dictionary mapping bases in self.bases to their indices

  • mask_vectors is a dictionary mapping ambiguity codes (including non-ambiguous bases)

    to vectors of floats which are 0 at indices compatible with the ambiguity code, and infinity otherwise.

  • bases is a tuple recording the correct order of unambiguous bases

Parameters:
__init__(bases=tuple('AGCT'), transition_weights=None, ambiguity_map=None)[source]
get_adjacency_array(seq_len)[source]

Return an adjacency array for a sequence of length seq_len.

This is a matrix containing transition costs, with index order (site, from_base, to_base)

Return type:

array

get_ambiguity_from_tuple(tup)[source]

Retrieve an ambiguity code encoded by tup, which encodes a set of bases with a tuple of 0’s and 1’s.

For example, if self.bases is ‘AGCT’, then ``(0, 1, 1, 1) would return A.

Return type:

Any

character_distance(parent_char, child_char, site=None)[source]

Return the transition cost from parent_char to child_char, two unambiguous characters.

keyword argument site is ignored in this base class.

Return type:

float

weighted_hamming_distance(parent_seq, child_seq)[source]

Return the sum of sitewise transition costs, from parent_seq to child_seq.

Both parent_seq and child_seq are expected to be unambiguous.

Return type:

float

weighted_cg_hamming_distance(parent_cg, child_cg)[source]

Return the sum of sitewise transition costs, from parent_cg to child_cg.

Both parent_seq and child_seq are expected to be unambiguous, but sites where parent_cg and child_cg both match their reference sequence are ignored, so this method is not suitable for weighted parsimony for transition matrices that contain nonzero entries along the diagonal.

Return type:

float

min_character_mutation(parent_char, child_char, site=None, **kwargs)[source]

Allowing child_char to be ambiguous, returns the minimum possible transition weight between parent_char and child_char, preceded in a tuple by the unambiguous child character that achieves that weight.

Keyword argument site is ignored in this base class.

Return type:

float

min_character_distance(parent_char, child_char, site=None)[source]

Allowing child_char to be ambiguous, returns the minimum possible transition weight between parent_char and child_char.

Keyword argument site is ignored in this base class.

Return type:

float

min_weighted_hamming_distance(parent_seq, child_seq)[source]

Assuming the child_seq may contain ambiguous characters, returns the minimum possible transition weight between parent_seq and child_seq.

Return type:

float

min_weighted_cg_hamming_distance(parent_cg, child_cg)[source]

Return the sum of sitewise transition costs, from parent_cg to child_cg.

child_cg may contain ambiguous characters, and in this case those sites will contribute the minimum possible transition cost to the returned value.

Sites where parent_cg and child_cg both match their reference sequence are ignored, so this method is not suitable for weighted parsimony for transition matrices that contain nonzero entries along the diagonal.

Return type:

float

weighted_hamming_edge_weight(field_name)[source]

Returns a function for computing weighted hamming distance between two nodes’ sequences.

Parameters:

field_name (str) – The name of the node label field which contains sequences.

Return type:

Callable[[HistoryDagNode, HistoryDagNode], float]

Returns:

A function accepting two HistoryDagNode objects n1 and n2 and returning a float: the transition cost from n1.label.<field_name> to n2.label.<field_name>, or 0 if n1 is the UA node.

weighted_cg_hamming_edge_weight(field_name, count_root_muts=True)[source]

Returns a function for computing weighted hamming distance between two nodes’ compact genomes.

Parameters:
  • field_name (str) – The name of the node label field which contains compact genomes.

  • count_root_muts (bool) – If True, the reference sequence of root nodes’ compact genomes is considered to be the ancestral sequence of each history, and mutations from the reference to the history root CG are counted.

Return type:

Callable[[HistoryDagNode, HistoryDagNode], float]

Returns:

A function accepting two HistoryDagNode objects n1 and n2 and returning a float: the transition cost from n1.label.<field_name> to n2.label.<field_name>, or the number of mutations between the reference and the CG of n2 if n1 is the UA node.

Sites where parent_cg and child_cg both match their reference sequence are ignored, so this method is not suitable for weighted parsimony for transition matrices that contain nonzero entries along the diagonal.

min_weighted_hamming_edge_weight(field_name)[source]

Returns a function for computing weighted hamming distance between two nodes’ sequences.

If the child node is a leaf node, and its sequence contains ambiguities, the minimum possible transition cost from parent to child will be returned.

Parameters:

field_name (str) – The name of the node label field which contains sequences.

Return type:

Callable[[HistoryDagNode, HistoryDagNode], float]

Returns:

A function accepting two HistoryDagNode objects n1 and n2 and returning a float: the transition cost from n1.label.<field_name> to n2.label.<field_name>, or 0 if n1 is the UA node.

min_weighted_cg_hamming_edge_weight(field_name, count_root_muts=True)[source]

Returns a function for computing weighted hamming distance between two nodes’ compact genomes.

If the child node is a leaf node, and its compact genome contains ambiguities, the minimum possible transition cost from parent to child will be returned by this function.

Parameters:
  • field_name (str) – The name of the node label field which contains compact genomes.

  • count_root_muts (bool) – If True, the reference sequence of root nodes’ compact genomes is considered to be the ancestral sequence of each history, and mutations from the reference to the history root CG are counted.

Return type:

Callable[[HistoryDagNode, HistoryDagNode], float]

Returns:

A function accepting two HistoryDagNode objects n1 and n2 and returning a float: the transition cost from n1.label.<field_name> to n2.label.<field_name>, or 0 if n1 is the UA node.

get_weighted_cg_parsimony_countfuncs(field_name, leaf_ambiguities=False, count_root_muts=True, name='WeightedParsimony')[source]

Create a historydag.utils.AddFuncDict object for counting weighted parsimony in a HistoryDag with labels containing compact genomes.

Parameters:
  • field_name (str) – the label field name in which compact genomes can be found

  • leaf_ambiguities (bool) – if True, leaf compact genomes are permitted to contain ambiguity codes

  • count_root_muts (bool) – If True, the reference sequence of root nodes’ compact genomes is considered to be the ancestral sequence of each history, and mutations from the reference to the history root CG are counted.

  • name (str) – the name for the returned AddFuncDict object

Return type:

AddFuncDict

get_weighted_parsimony_countfuncs(field_name, leaf_ambiguities=False, name='WeightedParsimony')[source]

Create a historydag.utils.AddFuncDict object for counting weighted parsimony in a HistoryDag with labels containing sequences.

Parameters:
  • field_name (str) – the label field name in which sequences can be found

  • leaf_ambiguities (bool) – if True, leaf sequences are permitted to contain ambiguity codes

  • name (str) – the name for the returned AddFuncDict object

class historydag.parsimony_utils.UnitTransitionModel(bases='AGCT', ambiguity_map=None)[source]

A subclass of TransitionModel to be used when all non-identical transitions have unit cost.

Parameters:
__init__(bases='AGCT', ambiguity_map=None)[source]
character_distance(parent_char, child_char, site=None)[source]

Return the transition cost from parent_char to child_char, two unambiguous characters.

keyword argument site is ignored in this subclass.

Return type:

int

min_character_mutation(parent_char, child_char, site=None, randomize=False, **kwargs)[source]

Allowing child_char to be ambiguous, returns the minimum possible transition weight between parent_char and child_char, preceded in a tuple by the unambiguous child character that achieves that weight.

Keyword argument site is ignored in this base class.

Return type:

float

get_weighted_cg_parsimony_countfuncs(field_name, leaf_ambiguities=False, name='HammingParsimony')[source]

Create a historydag.utils.AddFuncDict object for counting parsimony in a HistoryDag with labels containing compact genomes.

Parameters:
  • field_name (str) – the label field name in which compact genomes can be found

  • leaf_ambiguities (AmbiguityMap) – if True, leaf compact genomes are permitted to contain ambiguity codes

  • name (str) – the name for the returned AddFuncDict object

Return type:

AddFuncDict

get_weighted_parsimony_countfuncs(field_name, leaf_ambiguities=False, name='HammingParsimony')[source]

Create a historydag.utils.AddFuncDict object for counting weighted parsimony in a HistoryDag with labels containing sequences.

Parameters:
  • field_name (str) – the label field name in which sequences can be found

  • leaf_ambiguities (AmbiguityMap) – if True, leaf sequences are permitted to contain ambiguity codes

  • name (str) – the name for the returned AddFuncDict object

Return type:

AddFuncDict

class historydag.parsimony_utils.SitewiseTransitionModel(bases='AGCT', transition_matrix=None, ambiguity_map=None)[source]

A subclass of TransitionModel to be used when transition costs depend on site.

Parameters:
__init__(bases='AGCT', transition_matrix=None, ambiguity_map=None)[source]
character_distance(parent_char, child_char, site)[source]

Returns the transition cost from parent_char to child_char, assuming each character is at the one-based site in their respective sequences.

Both parent_char and child_char are expected to be unambiguous.

Return type:

float

min_character_mutation(parent_char, child_char, site=None, **kwargs)[source]

Allowing child_char to be ambiguous, returns the minimum possible transition weight between parent_char and child_char, preceded in a tuple by the unambiguous child character that achieves that weight.

Keyword argument site is ignored in this base class.

Return type:

float

get_adjacency_array(seq_len)[source]

Return an adjacency array for a sequence of length seq_len.

This is a matrix containing transition costs, with index order (site, from_base, to_base)

historydag.parsimony_utils.default_nt_transitions = <historydag.parsimony_utils.UnitTransitionModel object>

A transition model for nucleotides using unit transition weights, and the standard IUPAC ambiguity codes, with base order AGCT and N, ? and - treated as fully ambiguous characters.

historydag.parsimony_utils.default_nt_gaps_transitions = <historydag.parsimony_utils.UnitTransitionModel object>

A transition model for nucleotides using unit transition weights, and the standard IUPAC nucleotide ambiguity map, with ‘-’ is treated as a fifth base, ‘?’ fully ambiguous, and the ambiguity ‘N’ excluding only ‘-‘.

historydag.parsimony_utils.hamming_edge_weight(parent, child)

An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the sequences stored in the sequence attributes of their labels.

historydag.parsimony_utils.hamming_edge_weight_ambiguous_leaves(parent, child)

An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the sequences stored in the sequence attributes of their labels.

This edge weight function allows leaf nodes to have ambiguous sequences.

historydag.parsimony_utils.hamming_cg_edge_weight(parent, child)

An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the compact genomes stored in the compact_genome attributes of their labels.

historydag.parsimony_utils.hamming_cg_edge_weight_ambiguous_leaves(parent, child)

An edge weight function accepting (parent, child) pairs of HistoryDagNodes, and returning the hamming distance between the compact genomes stored in the compact_genome attributes of their labels.

This edge weight function allows leaf nodes to have ambiguous compact genomes.

historydag.parsimony_utils.hamming_distance_countfuncs = {'start_func': <function TransitionModel.get_weighted_parsimony_countfuncs.<locals>.<lambda>>, 'edge_weight_func': <function TransitionModel.weighted_hamming_edge_weight.<locals>.edge_weight>, 'accum_func': <built-in function sum>}

Provides functions to count hamming distance parsimony when leaf sequences may be ambiguous.

For use with historydag.AmbiguousLeafSequenceHistoryDag.weight_count().

historydag.parsimony_utils.leaf_ambiguous_hamming_distance_countfuncs = {'start_func': <function TransitionModel.get_weighted_parsimony_countfuncs.<locals>.<lambda>>, 'edge_weight_func': <function TransitionModel.min_weighted_hamming_edge_weight.<locals>.edge_weight>, 'accum_func': <built-in function sum>}

Provides functions to count hamming distance parsimony when leaf sequences may be ambiguous.

For use with historydag.AmbiguousLeafSequenceHistoryDag.weight_count().

historydag.parsimony_utils.compact_genome_hamming_distance_countfuncs = {'start_func': <function TransitionModel.get_weighted_cg_parsimony_countfuncs.<locals>.<lambda>>, 'edge_weight_func': <function TransitionModel.weighted_cg_hamming_edge_weight.<locals>.edge_weight>, 'accum_func': <built-in function sum>}

Provides functions to count hamming distance parsimony when sequences are stored as CompactGenomes.

For use with historydag.CGHistoryDag.weight_count().

historydag.parsimony_utils.leaf_ambiguous_compact_genome_hamming_distance_countfuncs = {'start_func': <function TransitionModel.get_weighted_cg_parsimony_countfuncs.<locals>.<lambda>>, 'edge_weight_func': <function TransitionModel.min_weighted_cg_hamming_edge_weight.<locals>.edge_weight>, 'accum_func': <built-in function sum>}

Provides functions to count hamming distance parsimony when leaf compact genomes may be ambiguous.

For use with historydag.AmbiguousLeafCGHistoryDag.weight_count().