historydag.compact_genome
This module provides a CompactGenome class, intended as a convenient and compact representation of a nucleotide sequence as a collection of mutations relative to a reference sequence.
This object also provides methods to conveniently mutate CompactGenome objects according to a list of mutations, produce mutations defining the difference between two CompactGenome objects, and efficiently access the base at a site (or the entire sequence, as a string) implied by a CompactGenome.
Functions
|
Yields a minimal collection of mutations in the format (parent_nuc, child_nuc, one-based sequence_index) distinguishing two compact genomes, such that applying the resulting mutations to parent_cg would yield a compact genome compatible with the possibly ambiguous child_cg. |
|
Yields mutations in the format (parent_nuc, child_nuc, one-based sequence_index) distinguishing two compact genomes, such that applying the resulting mutations to parent_cg would yield child_cg |
|
Create a CompactGenome from a sequence and a reference sequence. |
|
Read a fasta or vcf alignment and return a dictionary mapping sequence ID strings to CompactGenomes. |
|
Returns a compact genome containing ambiguous bases, representing the least ambiguous sequence of which all provided cgs in cg_list are resolutions. |
|
Returns (one-based site, from_base, to_base) |
Classes
|
A collection of mutations relative to a reference sequence. |
- class historydag.compact_genome.CompactGenome(mutations, reference)[source]
A collection of mutations relative to a reference sequence.
- Parameters:
- mutations_as_strings()[source]
Return mutations as a tuple of strings of the format ‘<reference base><index><new base>’, sorted by index.
- mutate(mutstring, reverse=False)[source]
Apply a mutstring such as ‘A110G’ to this compact genome.
In this example, A is the old base, G is the new base, and 110 is the 1-based index of the mutation in the sequence. Returns the new CompactGenome, and prints a warning if the old base doesn’t match the recorded old base in this compact genome.
- apply_muts_raw(muts)[source]
Apply the mutations from the sequence of tuples
muts
.Each tuple should contain (one-based site, from_base, to_base)
- apply_muts(muts, reverse=False, debug=False)[source]
Apply a sequence of mutstrings like ‘A110G’ to this compact genome.
In this example, A is the old base, G is the new base, and 110 is the 1-based index of the mutation in the sequence. Returns the new CompactGenome, and prints a warning if the old base doesn’t match the recorded old base in this compact genome.
- Parameters:
muts (
Sequence
[str
]) – The mutations to apply, in the order they should be appliedreverse (
bool
) – Apply the mutations in reverse, such as when the provided mutations describe how to achieve this CompactGenome from the desired CompactGenome. If True, the mutations in muts will also be applied in reversed order.debug – If True, each mutation is applied individually by
CompactGenome.apply_mut()
and the from base is checked against the current recorded base at each site.
- Returns:
The new CompactGenome
- mask_sites(sites, one_based=True)[source]
Remove any mutations on sites in sites, leaving the reference sequence unchanged.
- Parameters:
sites – A collection of sites to be masked
one_based – If True, the provided sites will be interpreted as one-based sites. Otherwise, they will be interpreted as 0-based sites.
- superset_sites(sites, new_reference, one_based=True)[source]
Do the opposite of subset_sites, adjusting site indices from indices in a sequence of variant sites, to indices in a sequence containing all sites.
- Parameters:
sites – A sorted list of sites in the new_reference sequence which are represented by sites in the current compact genome’s reference sequence
new_reference – A new reference sequence
one_based – Whether the sites in sites are one-based
- subset_sites(sites, one_based=True, new_reference=None)[source]
Remove all but those sites in
sites
, and adjust the reference sequence.- Parameters:
sites – A collection of sites to be kept
one_based – If True, the provided sites will be interpreted as one-based sites. Otherwise, they will be interpreted as 0-based sites.
new_reference – If provided, this new reference sequence will be used instead of computing the new reference sequence directly.
- remove_sites(sites, one_based=True, new_reference=None)[source]
Remove all sites in
sites
, and adjust the reference sequence.- Parameters:
sites – A collection of sites to be removed
one_based – If True, the provided sites will be interpreted as one-based sites. Otherwise, they will be interpreted as 0-based sites.
new_reference – If provided, this new reference sequence will be used instead of computing the new reference sequence directly.
- historydag.compact_genome.unpack_mut_string(mut)[source]
Returns (one-based site, from_base, to_base)
- historydag.compact_genome.compact_genome_from_sequence(sequence, reference)[source]
Create a CompactGenome from a sequence and a reference sequence.
- historydag.compact_genome.cg_diff(parent_cg, child_cg)[source]
Yields mutations in the format (parent_nuc, child_nuc, one-based sequence_index) distinguishing two compact genomes, such that applying the resulting mutations to parent_cg would yield child_cg
- historydag.compact_genome.ambiguous_cg_diff(parent_cg, child_cg, transition_model=default_nt_transitions, randomize=False)[source]
Yields a minimal collection of mutations in the format (parent_nuc, child_nuc, one-based sequence_index) distinguishing two compact genomes, such that applying the resulting mutations to parent_cg would yield a compact genome compatible with the possibly ambiguous child_cg.
If randomize is True, mutations will be randomized when there are multiple possible min-weight choices.
- historydag.compact_genome.reconcile_cgs(cg_list, check_references=True, ambiguitymap=standard_nt_ambiguity_map)[source]
Returns a compact genome containing ambiguous bases, representing the least ambiguous sequence of which all provided cgs in cg_list are resolutions. Also returns a flag indicating whether the resulting CG contains ambiguities.
If check_references is False, reference sequences will be assumed equal.
- historydag.compact_genome.read_alignment(alignment_file, reference_sequence=None)[source]
Read a fasta or vcf alignment and return a dictionary mapping sequence ID strings to CompactGenomes.
- Parameters:
alignment_file – A file containing a fasta or vcf alignment. File format is determined by extension. .fa, .fasta, or .vcf are expected.
reference_sequence (
Optional
[Sequence
[Any
]]) – If a fasta file is provided, the first sequence in that file will be used as the compact genome reference sequence, unless one is explicitly provided to this keyword argument.