historydag.parsimony
A module implementing Sankoff Algorithm.
Functions
|
Build a history DAG from trees containing a sequence attribute on all nodes. |
|
Load an ete tree from a newick string, and add 'sequence' attributes from fasta. |
|
Same as build_tree, but takes a list of filenames containing newick strings, and a filename for a fasta file, and returns a generator on trees. |
|
Randomly resolve ambiguous bases using a two-pass Sankoff Algorithm on subtrees of consecutive ambiguity codes. |
|
A rather ugly way to disambiguate a history, by converting to ete, using the disambiguate method, then converting back to HistoryDag. |
Returns an |
|
|
Returns the parsimony score of a (disambiguated) ete tree. |
|
Returns the parsimony scores of trees specified by newick files and a fasta file. |
|
Returns a generator on parsimony scores of trees specified by newick strings and fasta. |
|
Eliminate invariant characters in a fasta alignment and record the 0-based indices of those variant sites. |
|
Generalizes :meth: |
|
Assign sequences to internal nodes of dag using a weighted Sankoff algorithm by exploding all possible labelings associated to each internal node based on its subtrees. |
|
This is a re-take of |
|
Compute Sankoff cost vectors at nodes in a postorder traversal, and return best possible parsimony score of the tree. |
|
Perform tree-wise sankoff to compute labels for all nodes in the DAG. |
- historydag.parsimony.replace_label_attr(original_label, list_of_replacements={})[source]
Generalizes :meth:
_replace()
for namedtuple datatype to replace multiple fields at once, and by string rather than as a keyword argument.Caveat: the keys of
list_of_replacements
dict should be existing fields in the namedtuple object fororiginal_label
- historydag.parsimony.make_weighted_hamming_count_funcs(transition_model=parsimony_utils.default_nt_transitions, allow_ambiguous_leaves=False, sequence_label='sequence')[source]
Returns an
AddFuncDict
for computing weighted parsimony.The returned
AddFuncDict
is for use withHistoryDag
objects whose nodes have unambiguous sequences stored in label attributes namedsequence
.Args:
- Returns:
AddFuncDict
object for computing weighted parsimony.
- historydag.parsimony.sankoff_postorder_iter_accum(postorder_iter, node_clade_function, child_node_function)[source]
This is a re-take of
postorder_history_accum()
that is altered so that it does not require a complete DAG, and simplified for the specific Sankoff algorithm.- Parameters:
postorder_iter – iterable of nodes to traverse.
node_clade_function – function to combine results for a given clade.
child_node_function – function that is applied to children for a given node.
- historydag.parsimony.sankoff_upward(node_list, seq_len, sequence_attr_name='sequence', transition_model=parsimony_utils.default_nt_transitions, use_internal_node_sequences=False)[source]
Compute Sankoff cost vectors at nodes in a postorder traversal, and return best possible parsimony score of the tree.
- Parameters:
node_list – An iterator for the set of nodes to compute Sankoff on. Can be of type
ete3.TreeNode
or of typeHistoryDag
, or else a list of nodes that respects postorder traversal. This last option is useful when using the Sankoff algorithm to compute node labels for a subset of the internal nodes of a history DAG, as only the nodes in the list will be relabelled. However it does assume that the remaining internal nodes are in fact equipped with a sequence.seq_len – The length of sequence for each leaf node.
sequence_attr_name – the field name of the label attribute that is used to store sequences on nodes. Default value is set to ‘sequence’
transition_model – The type of
parsimony_utils.TransitionModel()
to use for Sankoff. See docstring for more information.use_internal_node_sequences – (used when tree is of type
ete3.TreeNode
) If True, then compute the transition cost for sequences assigned to internal nodes.
- historydag.parsimony.sankoff_downward(dag, skip_ua_children=False, partial_node_list=None, sequence_attr_name='sequence', compute_cvs=True, transition_model=parsimony_utils.default_nt_transitions, trim=True)[source]
Assign sequences to internal nodes of dag using a weighted Sankoff algorithm by exploding all possible labelings associated to each internal node based on its subtrees. This function alters the node labels and the topology of the DAG.
- Parameters:
skip_ua_children – If True, all children of the UA node will remain unchanged.
partial_node_list – If None, all internal node sequences will be re-computed. If a list of nodes is provided, only sequences on nodes from that list will be updated. This list must be in post-order.
sequence_attr_name – The field name of the label attribute that is used to store sequences on nodes. Default value is set to ‘sequence’
compute_cvs – If true, compute upward sankoff cost vectors. If
sankoff_upward
was already run on the tree/dag, this may be skipped.transition_model – The type of
TransitionModel
to use for Sankoff. See docstring for more information.trim – If False, the history DAG will not be trimmed to express only maximally parsimonious histories after Sankoff.
- historydag.parsimony.disambiguate(tree, compute_cvs=True, random_state=None, remove_cvs=False, adj_dist=False, transition_model=parsimony_utils.default_nt_transitions, min_ambiguities=False, use_internal_node_sequences=False)[source]
Randomly resolve ambiguous bases using a two-pass Sankoff Algorithm on subtrees of consecutive ambiguity codes.
- Parameters:
compute_cvs – If true, compute upward sankoff cost vectors. If
sankoff_upward
was already run on the tree, this may be skipped.random_state – A
random
module random state, returned byrandom.getstate()
. Output from this function is otherwise deterministic.remove_cvs – Remove sankoff cost vectors from tree nodes after disambiguation.
adj_dist – Recompute hamming parsimony distances on tree after disambiguation, and store them in
dist
node attributes.transition_model – The type of
parsimony_utils.TransitionModel()
to use for Sankoff. See docstring for more information.min_ambiguities – If True, leaves ambiguities in reconstructed sequences, expressing which bases are possible at each site in a maximally parsimonious disambiguation of the given topology. In the history DAG paper, this is known as a strictly min-weight ambiguous labeling. Otherwise, sequences are resolved in one possible way, and include no ambiguities.
use_internal_node_sequences – If True, then instead of overwriting internal node sequences, only disambiguate ambiguous bases in those sequences. For example, to fix a root sequence, this option must be True.
- historydag.parsimony.build_tree(newickstring, fasta_map, newickformat=1, reference_id=None, reference_sequence=None, ignore_internal_sequences=False)[source]
Load an ete tree from a newick string, and add ‘sequence’ attributes from fasta.
If internal node sequences aren’t specified in the newick string and fasta data, internal node sequences will be fully ambiguous (contain repeated N’s).
- Parameters:
newickstring – a newick string
fasta_map – a dictionary with sequence id keys matching node names in newickstring, and sequence values.
newickformat – the ete format identifier for the passed newick string. See ete docs
reference_id – key in fasta_map corresponding to the root sequence of the tree, if root sequence is fixed.
reference_sequence – fixed root sequence of the tree, if root sequence is fixed
ignore_internal_sequences – if True, sequences at non-leaf nodes specified in fasta_map
ignored (and newickstring will be)
ambiguous. (and all internal sequences will be fully)
- historydag.parsimony.build_trees_from_files(newickfiles, fastafile, **kwargs)[source]
Same as build_tree, but takes a list of filenames containing newick strings, and a filename for a fasta file, and returns a generator on trees.
- historydag.parsimony.parsimony_score(tree, transition_model=parsimony_utils.default_nt_transitions)[source]
Returns the parsimony score of a (disambiguated) ete tree.
Tree must have ‘sequence’ attributes on all nodes.
- historydag.parsimony.remove_invariant_sites(fasta_map)[source]
Eliminate invariant characters in a fasta alignment and record the 0-based indices of those variant sites.
- historydag.parsimony.parsimony_scores_from_topologies(newicks, fasta_map, gap_as_char=False, remove_invariants=False, **kwargs)[source]
Returns a generator on parsimony scores of trees specified by newick strings and fasta. additional keyword arguments are passed to build_tree.
- Parameters:
newicks – newick strings of trees whose parsimony scores will be computed
fasta_map – fasta data as a dictionary, as output by
load_fasta
gap_as_char – if True, gap characters - will be treated as a fifth character. Otherwise, they will be synonymous with N’s.
remove_invariants – if True, removes invariant sites from the provided fasta_map
- historydag.parsimony.parsimony_scores_from_files(treefiles, fastafile, **kwargs)[source]
Returns the parsimony scores of trees specified by newick files and a fasta file.
Arguments match build_trees_from_files. Additional keyword arguments are passed to
parsimony_scores_from_topologies
.
- historydag.parsimony.build_dag_from_trees(trees)[source]
Build a history DAG from trees containing a sequence attribute on all nodes.
unifurcations in the provided trees will be deleted.