data

Defines Data objects for handling data from one or more dms experiments under various conditions.

class multidms.data.Data(variants_df: DataFrame, reference: str, alphabet=('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'), collapse_identical_variants=False, condition_colors=('#0072B2', '#CC79A7', '#009E73', '#17BECF', '#BCDB22'), letter_suffixed_sites=False, assert_site_integrity=False, verbose=False, nb_workers=None, name=None)

Bases: object

Prep and store one-hot encoding of variant substitutions data. Individual objects of this type can be shared by multiple multidms.Model Objects for efficiently fitting various models to the same data.

Note

You can initialize a Data object with a pandas.DataFrame with a row for each variant sampled and annotations provided in the required columns:

  1. condition - Experimental condition from

    which a sample measurement was obtained.

  2. aa_substitutions - Defines each variant

    \(v\) as a string of substitutions (e.g., 'M3A K5G'). Note that while conditions may have differing wild types at a given site, the sites between conditions should reference the same site when alignment is performed between condition wild types.

  3. func_score - The functional score computed from experimental

    measurements.

Parameters:
  • variants_df (pandas.DataFrame or None) – The variant level information from all experiments you wish to analyze. Should have columns named 'condition', 'aa_substitutions', and 'func_score'. See the class note for descriptions of each of the features.

  • reference (str) – Name of the condition which annotates the reference. variants. Note that for model fitting this class will convert all amino acid substitutions for non-reference condition groups to relative to the reference condition. For example, if the wild type amino acid at site 30 is an A in the reference condition, and a G in a non-reference condition, then a Y30G mutation in the non-reference condition is recorded as an A30G mutation relative to the reference. This way, each condition informs the exact same parameters, even at sites that differ in wild type amino acid. These are encoded in a binarymap.binarymap.BinaryMap object for each condition, where all sites that are non-identical to the reference are 1’s. For motivation, see the Model overview section in multidms.Model class notes.

  • alphabet (array-like) – Allowed characters in mutation strings.

  • collapse_identical_variants ({'mean', 'median', False}) – If identical variants in variants_df (same ‘aa_substitutions’), exist within individual condition groups, collapse them by taking mean or median of ‘func_score’, or (if False) do not collapse at all. Collapsing will make fitting faster, but not a good idea if you are doing bootstrapping.

  • condition_colors (array-like or dict) – Maps each condition to the color used for plotting. Either a dict keyed by each condition, or an array of colors that are sequentially assigned to the conditions.

  • letter_suffixed_sites (bool) – True if sites are sequential and integer, False otherwise.

  • assert_site_integrity (bool) – If True, will assert that all sites in the data frame have the same wild type amino acid, grouped by condition.

  • verbose (bool) – If True, will print progress bars.

  • nb_workers (int) – Number of workers to use for parallel operations. If None, will use all available CPUs.

  • name (str or None) – Name of the data object. If None, will be assigned a unique name based upon the number of data objects instantiated.

Example

Simple example with two conditions ('a' and 'b')

>>> import pandas as pd
>>> import multidms
>>> func_score_data = {
...     'condition' : ["a","a","a","a", "b","b","b","b","b","b"],
...     'aa_substitutions' : [
...         'M1E', 'G3R', 'G3P', 'M1W', 'M1E',
...         'P3R', 'P3G', 'M1E P3G', 'M1E P3R', 'P2T'
...     ],
...     'func_score' : [2, -7, -0.5, 2.3, 1, -5, 0.4, 2.7, -2.7, 0.3],
... }
>>> func_score_df = pd.DataFrame(func_score_data)
>>> func_score_df  
condition aa_substitutions  func_score
0         a              M1E         2.0
1         a              G3R        -7.0
2         a              G3P        -0.5
3         a              M1W         2.3
4         b              M1E         1.0
5         b              P3R        -5.0
6         b              P3G         0.4
7         b          M1E P3G         2.7
8         b          M1E P3R        -2.7
9         b              P2T         0.3

Instantiate a Data Object allowing for stop codon variants and declaring condition “a” as the reference condition.

>>> data = multidms.Data(
...     func_score_df,
...     alphabet = multidms.AAS_WITHSTOP,
...     reference = "a",
... )  
...

Note this may take some time due to the string operations that must be performed when converting amino acid substitutions to be with respect to a reference wild type sequence.

After the object has finished being instantiated, we now have access to a few ‘static’ properties of our data. See individual property docstring for more information.

>>> data.reference
'a'
>>> data.conditions
('a', 'b')
>>> data.mutations
('M1E', 'M1W', 'G3P', 'G3R')
>>> data.site_map  
a  b
1  M  M
3  G  P
>>> data.mutations_df  
  mutation wts  sites muts  times_seen_a  times_seen_b
0      M1E   M      1    E             1             3
1      M1W   M      1    W             1             0
2      G3P   G      3    P             1             4
3      G3R   G      3    R             1             2
>>> data.variants_df  
  condition aa_substitutions  func_score var_wrt_ref
0         a              M1E         2.0         M1E
1         a              G3R        -7.0         G3R
2         a              G3P        -0.5         G3P
3         a              M1W         2.3         M1W
4         b              M1E         1.0     G3P M1E
5         b              P3R        -5.0         G3R
6         b              P3G         0.4
7         b          M1E P3G         2.7         M1E
8         b          M1E P3R        -2.7     G3R M1E
property name: str

The name of the data object.

property conditions: tuple

A tuple of all conditions.

property reference: str

The name of the reference condition.

property reference_index: int

The index of the reference condition.

property mutations: tuple

A tuple of all mutations in the order relative to their index into the binarymap.

property mutations_df: DataFrame

A dataframe summarizing all single mutations

property variants_df: DataFrame

A dataframe summarizing all variants in the training data.

property site_map: DataFrame

A dataframe indexed by site, with columns for all conditions giving the wild type amino acid at each site.

property non_identical_mutations: dict

A dictionary keyed by condition names with values being a string of all mutations that differ from the reference sequence.

property non_identical_sites: dict

A dictionary keyed by condition names with values being a pandas.DataFrame indexed by site, with columns for the reference and non-reference amino acid at each site that differs.

property bundle_idxs: dict

A dictionary keyed by condition names with values being the indices into the binarymap representing bundle (non_identical) mutations

property reference_sequence_conditions: list

A list of conditions that have the same wild type sequence as the reference condition.

property training_data: dict

A dictionary with keys ‘X’ and ‘y’ for the training data.

property scaled_training_data: dict

A dictionary with keys ‘X’ and ‘y’ for the scaled training data.

property binarymaps: dict

A dictionary keyed by condition names with values being a BinaryMap object for each condition.

property targets: dict

The functional scores for each variant in the training data.

property mutparser: MutationParser

The mutation polyclonal.utils.MutationParser used to parse mutations.

property parse_mut: MutationParser

returns a function that splits a single amino acid substitutions into wildtype, site, and mutation using the mutation parser.

property parse_muts: partial

A function that splits amino acid substitutions (a string of more than one) into wildtype, site, and mutation using the mutation parser.

property single_mut_encodings

A dictionary keyed by condition names with values being the one-hot encoding of all single mutations

convert_subs_wrt_ref_seq(condition, aa_subs)

Covert amino acid substitutions to be with respect to the reference sequence.

Parameters:
  • condition (str) – The condition from which aa substitutions are relative to.

  • aa_subs (str) – A string of amino acid substitutions, relative to the condition sequence, to converted

Returns:

A string of amino acid substitutions relative to the reference sequence.

Return type:

str

plot_times_seen_hist(saveas=None, show=True, **kwargs)

Plot a histogram of the number of times each mutation was seen.

plot_func_score_boxplot(saveas=None, show=True, **kwargs)

Plot a boxplot of the functional scores for each condition.