historydag.parsimony

A module implementing Sankoff Algorithm.

Functions

build_dag_from_trees(trees)

Build a history DAG from trees containing a sequence attribute on all nodes.

build_tree(newickstring, fasta_map[, ...])

Load an ete tree from a newick string, and add 'sequence' attributes from fasta.

build_trees_from_files(newickfiles, ...)

Same as build_tree, but takes a list of filenames containing newick strings, and a filename for a fasta file, and returns a generator on trees.

disambiguate(tree[, compute_cvs, ...])

Randomly resolve ambiguous bases using a two-pass Sankoff Algorithm on subtrees of consecutive ambiguity codes.

disambiguate_history(history)

A rather ugly way to disambiguate a history, by converting to ete, using the disambiguate method, then converting back to HistoryDag.

make_weighted_hamming_count_funcs([...])

Returns an AddFuncDict for computing weighted parsimony.

parsimony_score(tree[, transition_model])

Returns the parsimony score of a (disambiguated) ete tree.

parsimony_scores_from_files(treefiles, ...)

Returns the parsimony scores of trees specified by newick files and a fasta file.

parsimony_scores_from_topologies(newicks, ...)

Returns a generator on parsimony scores of trees specified by newick strings and fasta.

remove_invariant_sites(fasta_map)

Eliminate invariant characters in a fasta alignment and record the 0-based indices of those variant sites.

replace_label_attr(original_label[, ...])

Generalizes :meth: _replace() for namedtuple datatype to replace multiple fields at once, and by string rather than as a keyword argument.

sankoff_downward(dag[, skip_ua_children, ...])

Assign sequences to internal nodes of dag using a weighted Sankoff algorithm by exploding all possible labelings associated to each internal node based on its subtrees.

sankoff_postorder_iter_accum(postorder_iter, ...)

This is a re-take of postorder_history_accum() that is altered so that it does not require a complete DAG, and simplified for the specific Sankoff algorithm.

sankoff_upward(node_list, seq_len[, ...])

Compute Sankoff cost vectors at nodes in a postorder traversal, and return best possible parsimony score of the tree.

treewise_sankoff_in_dag(dag[, cover_edges])

Perform tree-wise sankoff to compute labels for all nodes in the DAG.

historydag.parsimony.replace_label_attr(original_label, list_of_replacements={})[source]

Generalizes :meth: _replace() for namedtuple datatype to replace multiple fields at once, and by string rather than as a keyword argument.

Caveat: the keys of list_of_replacements dict should be existing fields in the namedtuple object for original_label

historydag.parsimony.make_weighted_hamming_count_funcs(transition_model=parsimony_utils.default_nt_transitions, allow_ambiguous_leaves=False, sequence_label='sequence')[source]

Returns an AddFuncDict for computing weighted parsimony.

The returned AddFuncDict is for use with HistoryDag objects whose nodes have unambiguous sequences stored in label attributes named sequence.

Args:

Returns:

AddFuncDict object for computing weighted parsimony.

historydag.parsimony.sankoff_postorder_iter_accum(postorder_iter, node_clade_function, child_node_function)[source]

This is a re-take of postorder_history_accum() that is altered so that it does not require a complete DAG, and simplified for the specific Sankoff algorithm.

Parameters:
  • postorder_iter – iterable of nodes to traverse.

  • node_clade_function – function to combine results for a given clade.

  • child_node_function – function that is applied to children for a given node.

historydag.parsimony.sankoff_upward(node_list, seq_len, sequence_attr_name='sequence', transition_model=parsimony_utils.default_nt_transitions, use_internal_node_sequences=False)[source]

Compute Sankoff cost vectors at nodes in a postorder traversal, and return best possible parsimony score of the tree.

Parameters:
  • node_list – An iterator for the set of nodes to compute Sankoff on. Can be of type ete3.TreeNode or of type HistoryDag, or else a list of nodes that respects postorder traversal. This last option is useful when using the Sankoff algorithm to compute node labels for a subset of the internal nodes of a history DAG, as only the nodes in the list will be relabelled. However it does assume that the remaining internal nodes are in fact equipped with a sequence.

  • seq_len – The length of sequence for each leaf node.

  • sequence_attr_name – the field name of the label attribute that is used to store sequences on nodes. Default value is set to ‘sequence’

  • transition_model – The type of parsimony_utils.TransitionModel() to use for Sankoff. See docstring for more information.

  • use_internal_node_sequences – (used when tree is of type ete3.TreeNode) If True, then compute the transition cost for sequences assigned to internal nodes.

historydag.parsimony.sankoff_downward(dag, skip_ua_children=False, partial_node_list=None, sequence_attr_name='sequence', compute_cvs=True, transition_model=parsimony_utils.default_nt_transitions, trim=True)[source]

Assign sequences to internal nodes of dag using a weighted Sankoff algorithm by exploding all possible labelings associated to each internal node based on its subtrees. This function alters the node labels and the topology of the DAG.

Parameters:
  • skip_ua_children – If True, all children of the UA node will remain unchanged.

  • partial_node_list – If None, all internal node sequences will be re-computed. If a list of nodes is provided, only sequences on nodes from that list will be updated. This list must be in post-order.

  • sequence_attr_name – The field name of the label attribute that is used to store sequences on nodes. Default value is set to ‘sequence’

  • compute_cvs – If true, compute upward sankoff cost vectors. If sankoff_upward was already run on the tree/dag, this may be skipped.

  • transition_model – The type of TransitionModel to use for Sankoff. See docstring for more information.

  • trim – If False, the history DAG will not be trimmed to express only maximally parsimonious histories after Sankoff.

historydag.parsimony.disambiguate(tree, compute_cvs=True, random_state=None, remove_cvs=False, adj_dist=False, transition_model=parsimony_utils.default_nt_transitions, min_ambiguities=False, use_internal_node_sequences=False)[source]

Randomly resolve ambiguous bases using a two-pass Sankoff Algorithm on subtrees of consecutive ambiguity codes.

Parameters:
  • compute_cvs – If true, compute upward sankoff cost vectors. If sankoff_upward was already run on the tree, this may be skipped.

  • random_state – A random module random state, returned by random.getstate(). Output from this function is otherwise deterministic.

  • remove_cvs – Remove sankoff cost vectors from tree nodes after disambiguation.

  • adj_dist – Recompute hamming parsimony distances on tree after disambiguation, and store them in dist node attributes.

  • transition_model – The type of parsimony_utils.TransitionModel() to use for Sankoff. See docstring for more information.

  • min_ambiguities – If True, leaves ambiguities in reconstructed sequences, expressing which bases are possible at each site in a maximally parsimonious disambiguation of the given topology. In the history DAG paper, this is known as a strictly min-weight ambiguous labeling. Otherwise, sequences are resolved in one possible way, and include no ambiguities.

  • use_internal_node_sequences – If True, then instead of overwriting internal node sequences, only disambiguate ambiguous bases in those sequences. For example, to fix a root sequence, this option must be True.

historydag.parsimony.build_tree(newickstring, fasta_map, newickformat=1, reference_id=None, reference_sequence=None, ignore_internal_sequences=False)[source]

Load an ete tree from a newick string, and add ‘sequence’ attributes from fasta.

If internal node sequences aren’t specified in the newick string and fasta data, internal node sequences will be fully ambiguous (contain repeated N’s).

Parameters:
  • newickstring – a newick string

  • fasta_map – a dictionary with sequence id keys matching node names in newickstring, and sequence values.

  • newickformat – the ete format identifier for the passed newick string. See ete docs

  • reference_id – key in fasta_map corresponding to the root sequence of the tree, if root sequence is fixed.

  • reference_sequence – fixed root sequence of the tree, if root sequence is fixed

  • ignore_internal_sequences – if True, sequences at non-leaf nodes specified in fasta_map

  • ignored (and newickstring will be)

  • ambiguous. (and all internal sequences will be fully)

historydag.parsimony.build_trees_from_files(newickfiles, fastafile, **kwargs)[source]

Same as build_tree, but takes a list of filenames containing newick strings, and a filename for a fasta file, and returns a generator on trees.

historydag.parsimony.parsimony_score(tree, transition_model=parsimony_utils.default_nt_transitions)[source]

Returns the parsimony score of a (disambiguated) ete tree.

Tree must have ‘sequence’ attributes on all nodes.

historydag.parsimony.remove_invariant_sites(fasta_map)[source]

Eliminate invariant characters in a fasta alignment and record the 0-based indices of those variant sites.

historydag.parsimony.parsimony_scores_from_topologies(newicks, fasta_map, gap_as_char=False, remove_invariants=False, **kwargs)[source]

Returns a generator on parsimony scores of trees specified by newick strings and fasta. additional keyword arguments are passed to build_tree.

Parameters:
  • newicks – newick strings of trees whose parsimony scores will be computed

  • fasta_map – fasta data as a dictionary, as output by load_fasta

  • gap_as_char – if True, gap characters - will be treated as a fifth character. Otherwise, they will be synonymous with N’s.

  • remove_invariants – if True, removes invariant sites from the provided fasta_map

historydag.parsimony.parsimony_scores_from_files(treefiles, fastafile, **kwargs)[source]

Returns the parsimony scores of trees specified by newick files and a fasta file.

Arguments match build_trees_from_files. Additional keyword arguments are passed to parsimony_scores_from_topologies.

historydag.parsimony.build_dag_from_trees(trees)[source]

Build a history DAG from trees containing a sequence attribute on all nodes.

unifurcations in the provided trees will be deleted.

historydag.parsimony.disambiguate_history(history)[source]

A rather ugly way to disambiguate a history, by converting to ete, using the disambiguate method, then converting back to HistoryDag.

historydag.parsimony.treewise_sankoff_in_dag(dag, cover_edges=False)[source]

Perform tree-wise sankoff to compute labels for all nodes in the DAG.