historydag.mutation_annotated_dag
This module allows the loading and manipulation of Larch mutation annotated DAG protobuf files.
The resulting history DAG contains labels with ‘compact genomes’, and a ‘refseq’ attribute describing a reference sequence and set of mutations relative to the reference.
Functions
|
Convert a Larch MAD protobuf to a CGLeafIDHistoryDag with compact genomes in the compact_genome label attribute. |
|
Load a mutation annotated DAG protobuf file and return a CGHistoryDag. |
|
Load a Mutation Annotated DAG stored in a JSON file and return a CGHistoryDag. |
|
Takes a dictionary like that returned by flatten, and returns a HistoryDag. |
Classes
|
A HistoryDag subclass with node labels containing compact genomes. |
|
A HistoryDag subclass with node labels containing CompactGenome objects. |
|
Constructor for JSONEncoder, with sensible defaults. |
|
A HistoryDag subclass with node labels containing string |
- class historydag.mutation_annotated_dag.HDagJSONEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]
Constructor for JSONEncoder, with sensible defaults.
If skipkeys is false, then it is a TypeError to attempt encoding of keys that are not str, int, float or None. If skipkeys is True, such items are simply skipped.
If ensure_ascii is true, the output is guaranteed to be str objects with all incoming non-ASCII characters escaped. If ensure_ascii is false, the output can contain non-ASCII characters.
If check_circular is true, then lists, dicts, and custom encoded objects will be checked for circular references during encoding to prevent an infinite recursion (which would cause an RecursionError). Otherwise, no such check takes place.
If allow_nan is true, then NaN, Infinity, and -Infinity will be encoded as such. This behavior is not JSON specification compliant, but is consistent with most JavaScript based encoders and decoders. Otherwise, it will be a ValueError to encode such floats.
If sort_keys is true, then the output of dictionaries will be sorted by key; this is useful for regression tests to ensure that JSON serializations can be compared on a day-to-day basis.
If indent is a non-negative integer, then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0 will only insert newlines. None is the most compact representation.
If specified, separators should be an (item_separator, key_separator) tuple. The default is (’, ‘, ‘: ‘) if indent is
None
and (‘,’, ‘: ‘) otherwise. To get the most compact JSON representation, you should specify (‘,’, ‘:’) to eliminate whitespace.If specified, default is a function that gets called for objects that can’t otherwise be serialized. It should return a JSON encodable version of the object or raise a
TypeError
.- default(obj)[source]
Implement this method in a subclass such that it returns a serializable object for
o
, or calls the base implementation (to raise aTypeError
).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)
- class historydag.mutation_annotated_dag.NodeIDHistoryDag(dagroot, attr={})[source]
A HistoryDag subclass with node labels containing string
node_id
fields.For leaf nodes this string is a unique leaf identifier, and for internal nodes this is a string representation of an integer node ID.
- class historydag.mutation_annotated_dag.CGHistoryDag(dagroot, attr={})[source]
A HistoryDag subclass with node labels containing CompactGenome objects.
The constructor for this class requires that each node label contain a ‘compact_genome’ field, which is expected to hold a
compact_genome.CompactGenome
object.A HistoryDag containing ‘sequence’ node label fields may be automatically converted to this subclass by calling the class method
CGHistoryDag.from_dag()
, providing the HistoryDag object to be converted, and the reference sequence to the keyword argument ‘reference’.This subclass provides specialized methods for interfacing with Larch’s MADAG protobuf format
- weight_counts_with_ambiguities(*args, **kwargs)[source]
Template method for counting tree weights in the DAG, with exploded labels. Like
HistoryDag.weight_count()
, but creates dictionaries of Counter objects at each node, keyed by possible sequences at that node. Analogous toHistoryDag.count_histories()
with expand_func provided.Weights must be hashable.
- Parameters:
start_func – A function which assigns a weight to each leaf node
edge_func – A function which assigns a weight to pairs of labels, with the parent node label the first argument. Must correctly handle the UA node label which is a UALabel instead of a namedtuple.
accum_func – A way to ‘add’ a list of weights together
expand_func – A function which takes a label and returns a list of labels, such as disambiguations of an ambiguous sequence.
- Returns:
A Counter keyed by weights. The total number of trees will be greater than count_histories(), as these are possible disambiguations of trees. These disambiguations may not be unique, but if two are the same, they come from different subtrees of the DAG.
- hamming_parsimony_count()[source]
See
historydag.sequence_dag.SequenceHistoryDag.hamming_parsim ony_count()
- to_protobuf(leaf_data_func=None, randomize_leaf_muts=False, transition_model=default_nt_transitions)[source]
Convert a DAG with compact genome data on each node, and unique leaf IDs on leaf nodes, to a MAD protobuf with mutation information on edges.
- Parameters:
leaf_data_func – a function taking a DAG node and returning a string to store in the protobuf node_name field condensed_leaves of leaf nodes. On leaf nodes, this data is appended after the unique leaf ID.
randomize_leaf_muts – When leaf node sequences contain ambiguities, if True the mutations on pendant edges will be randomized, when there are multiple choices.
transition_model – A
historydag.parsimony_utils.TransitionModel()
object, used to decide which bases to record on pendant edge mutations with ambiguous bases as targets.
Note that internal node IDs will be reassigned, even if internal nodes have node IDs in their label data.
- to_protobuf_file(filename, leaf_data_func=None, randomize_leaf_muts=False)[source]
Write this CGHistoryDag to a Mutation Annotated DAG protobuf for use with Larch.
- flatten(sort_compact_genomes=False)[source]
Return a dictionary containing four keys:
refseq is a list containing the reference sequence id, and the reference sequence (the implied sequence on the UA node)
compact_genome_list is a list of compact genomes, where each compact genome is a list of nested lists [seq_idx, [old_base, new_base]] where seq_idx is (1-indexed) nucleotide sequence site. If sort_compact_genomes is True, compact genomes and compact_genome_list are sorted.
- node_list is a list of [label_idx, clade_list] pairs, where
label_idx is the index of the node’s compact genome in compact_genome_list, and
clade_list is a list of lists of compact_genome_list indices, encoding sets of child clades.
- edge_list is a list of triples [parent_idx, child_idx, clade_idx], where
parent_idx is the index of the edge’s parent node in node_list,
child_idx is the index of the edge’s child node in node_list, and
clade_idx is the index of the clade in the parent node’s clade_list from which this edge descends.
- test_equal(other)[source]
Deprecated test for whether two history DAGs are equal.
Compares sorted JSON representation. Only works when “compact_genome” is the only label field, on all nodes.
- class historydag.mutation_annotated_dag.AmbiguousLeafCGHistoryDag(dagroot, attr={})[source]
A HistoryDag subclass with node labels containing compact genomes.
The constructor for this class requires that each node label contain a ‘compact_genome’ field, which is expected to hold a
compact_genome.CompactGenome
object, which is expected to hold an unambiguous sequence if the node is internal. The sequence may contain ambiguities if the node is a leaf.A HistoryDag containing ‘sequence’ node label fields may be automatically converted to this subclass by calling the class method
CGHistoryDag.from_dag()
, providing the HistoryDag object to be converted, and the reference sequence to the keyword argument ‘reference’.
- historydag.mutation_annotated_dag.load_json_file(filename)[source]
Load a Mutation Annotated DAG stored in a JSON file and return a CGHistoryDag.
- historydag.mutation_annotated_dag.unflatten(flat_dag)[source]
Takes a dictionary like that returned by flatten, and returns a HistoryDag.
- historydag.mutation_annotated_dag.load_MAD_protobuf(pbdata, compact_genomes=False, node_ids=True, leaf_cgs={}, ambiguity_map=standard_nt_ambiguity_map)[source]
Convert a Larch MAD protobuf to a CGLeafIDHistoryDag with compact genomes in the compact_genome label attribute.
- Parameters:
pbdata – loaded protobuf data object
compact_genomes – If True, returns a CGHistoryDag or AmbiguousLeafCGHistoryDag object, with labels containing node_id and compact_genome fields. If no leaf sequence data is provided, leaf compact genomes will be inferred from pendant edge mutations, and will include ambiguities when mutations on two pendant edges pointing to the same leaf would otherwise contradict. node_id field on internal nodes will be None, unless node_ids argument is True. If False, this function will return a NodeIDHistoryDag.
node_ids – If True, node IDs will be included on all nodes’ labels. If False, internal nodes’ node_id label fields will be None. Unique leaf sequence IDs are always included in the node_id label field of leaf nodes, to ensure that leaf node labels are unique.
leaf_cgs – (not implemented) A dictionary keyed by unique string leaf IDs containing CompactGenomes. Use
compact_genome.read_alignment()
to read an alignment from a file.ambiguity_map – A
historydag.parsimony_utils.AmbiguityMap()
object to determine how conflicting pendant edge mutations are represented.
Note that if leaf sequences in the original alignment do not contain ambiguities, it is not necessary to provide alignment data; leaf sequences can be completely inferred without it.