historydag.compact_genome

This module provides a CompactGenome class, intended as a convenient and compact representation of a nucleotide sequence as a collection of mutations relative to a reference sequence.

This object also provides methods to conveniently mutate CompactGenome objects according to a list of mutations, produce mutations defining the difference between two CompactGenome objects, and efficiently access the base at a site (or the entire sequence, as a string) implied by a CompactGenome.

Functions

ambiguous_cg_diff(parent_cg, child_cg[, ...])

Yields a minimal collection of mutations in the format (parent_nuc, child_nuc, one-based sequence_index) distinguishing two compact genomes, such that applying the resulting mutations to parent_cg would yield a compact genome compatible with the possibly ambiguous child_cg.

cg_diff(parent_cg, child_cg)

Yields mutations in the format (parent_nuc, child_nuc, one-based sequence_index) distinguishing two compact genomes, such that applying the resulting mutations to parent_cg would yield child_cg

compact_genome_from_sequence(sequence, reference)

Create a CompactGenome from a sequence and a reference sequence.

read_alignment(alignment_file[, ...])

Read a fasta or vcf alignment and return a dictionary mapping sequence ID strings to CompactGenomes.

reconcile_cgs(cg_list[, check_references, ...])

Returns a compact genome containing ambiguous bases, representing the least ambiguous sequence of which all provided cgs in cg_list are resolutions.

unpack_mut_string(mut)

Returns (one-based site, from_base, to_base)

Classes

CompactGenome(mutations, reference)

A collection of mutations relative to a reference sequence.

class historydag.compact_genome.CompactGenome(mutations, reference)[source]

A collection of mutations relative to a reference sequence.

Parameters:
  • mutations (Dict) – The difference between the reference and this sequence, expressed in a dictionary, in which keys are one-based sequence indices, and values are (reference base, new base) pairs.

  • reference (str) – The reference sequence

__init__(mutations, reference)[source]
get_site(site)[source]

Get the base at the provided (one-based) site index.

mutations_as_strings()[source]

Return mutations as a tuple of strings of the format ‘<reference base><index><new base>’, sorted by index.

mutate(mutstring, reverse=False)[source]

Apply a mutstring such as ‘A110G’ to this compact genome.

In this example, A is the old base, G is the new base, and 110 is the 1-based index of the mutation in the sequence. Returns the new CompactGenome, and prints a warning if the old base doesn’t match the recorded old base in this compact genome.

Parameters:
  • mutstring (str) – The mutation to apply

  • reverse (bool) – Apply the mutation in reverse, such as when the provided mutation describes how to achieve this CompactGenome from the desired CompactGenome.

Returns:

The new CompactGenome

apply_muts_raw(muts)[source]

Apply the mutations from the sequence of tuples muts.

Each tuple should contain (one-based site, from_base, to_base)

apply_muts(muts, reverse=False, debug=False)[source]

Apply a sequence of mutstrings like ‘A110G’ to this compact genome.

In this example, A is the old base, G is the new base, and 110 is the 1-based index of the mutation in the sequence. Returns the new CompactGenome, and prints a warning if the old base doesn’t match the recorded old base in this compact genome.

Parameters:
  • muts (Sequence[str]) – The mutations to apply, in the order they should be applied

  • reverse (bool) – Apply the mutations in reverse, such as when the provided mutations describe how to achieve this CompactGenome from the desired CompactGenome. If True, the mutations in muts will also be applied in reversed order.

  • debug – If True, each mutation is applied individually by CompactGenome.apply_mut() and the from base is checked against the current recorded base at each site.

Returns:

The new CompactGenome

to_sequence()[source]

Convert this CompactGenome to a full nucleotide sequence.

mask_sites(sites, one_based=True)[source]

Remove any mutations on sites in sites, leaving the reference sequence unchanged.

Parameters:
  • sites – A collection of sites to be masked

  • one_based – If True, the provided sites will be interpreted as one-based sites. Otherwise, they will be interpreted as 0-based sites.

superset_sites(sites, new_reference, one_based=True)[source]

Do the opposite of subset_sites, adjusting site indices from indices in a sequence of variant sites, to indices in a sequence containing all sites.

Parameters:
  • sites – A sorted list of sites in the new_reference sequence which are represented by sites in the current compact genome’s reference sequence

  • new_reference – A new reference sequence

  • one_based – Whether the sites in sites are one-based

subset_sites(sites, one_based=True, new_reference=None)[source]

Remove all but those sites in sites, and adjust the reference sequence.

Parameters:
  • sites – A collection of sites to be kept

  • one_based – If True, the provided sites will be interpreted as one-based sites. Otherwise, they will be interpreted as 0-based sites.

  • new_reference – If provided, this new reference sequence will be used instead of computing the new reference sequence directly.

remove_sites(sites, one_based=True, new_reference=None)[source]

Remove all sites in sites, and adjust the reference sequence.

Parameters:
  • sites – A collection of sites to be removed

  • one_based – If True, the provided sites will be interpreted as one-based sites. Otherwise, they will be interpreted as 0-based sites.

  • new_reference – If provided, this new reference sequence will be used instead of computing the new reference sequence directly.

historydag.compact_genome.unpack_mut_string(mut)[source]

Returns (one-based site, from_base, to_base)

historydag.compact_genome.compact_genome_from_sequence(sequence, reference)[source]

Create a CompactGenome from a sequence and a reference sequence.

Parameters:
  • sequence (str) – the sequence to be represented by a CompactGenome

  • reference (str) – the reference sequence for the CompactGenome

historydag.compact_genome.cg_diff(parent_cg, child_cg)[source]

Yields mutations in the format (parent_nuc, child_nuc, one-based sequence_index) distinguishing two compact genomes, such that applying the resulting mutations to parent_cg would yield child_cg

historydag.compact_genome.ambiguous_cg_diff(parent_cg, child_cg, transition_model=default_nt_transitions, randomize=False)[source]

Yields a minimal collection of mutations in the format (parent_nuc, child_nuc, one-based sequence_index) distinguishing two compact genomes, such that applying the resulting mutations to parent_cg would yield a compact genome compatible with the possibly ambiguous child_cg.

If randomize is True, mutations will be randomized when there are multiple possible min-weight choices.

historydag.compact_genome.reconcile_cgs(cg_list, check_references=True, ambiguitymap=standard_nt_ambiguity_map)[source]

Returns a compact genome containing ambiguous bases, representing the least ambiguous sequence of which all provided cgs in cg_list are resolutions. Also returns a flag indicating whether the resulting CG contains ambiguities.

If check_references is False, reference sequences will be assumed equal.

historydag.compact_genome.read_alignment(alignment_file, reference_sequence=None)[source]

Read a fasta or vcf alignment and return a dictionary mapping sequence ID strings to CompactGenomes.

Parameters:
  • alignment_file – A file containing a fasta or vcf alignment. File format is determined by extension. .fa, .fasta, or .vcf are expected.

  • reference_sequence (Optional[Sequence[Any]]) – If a fasta file is provided, the first sequence in that file will be used as the compact genome reference sequence, unless one is explicitly provided to this keyword argument.