data¶
Defines Data
objects for handling data from one or more
dms experiments under various conditions.
- class multidms.data.Data(variants_df: DataFrame, reference: str, alphabet=('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'), collapse_identical_variants=False, condition_colors=('#0072B2', '#CC79A7', '#009E73', '#17BECF', '#BCDB22'), letter_suffixed_sites=False, assert_site_integrity=False, verbose=False, nb_workers=None, name=None)¶
Bases:
object
Prep and store one-hot encoding of variant substitutions data. Individual objects of this type can be shared by multiple
multidms.Model
Objects for efficiently fitting various models to the same data.Note
You can initialize a
Data
object with apandas.DataFrame
with a row for each variant sampled and annotations provided in the required columns:- condition - Experimental condition from
which a sample measurement was obtained.
- aa_substitutions - Defines each variant
\(v\) as a string of substitutions (e.g.,
'M3A K5G'
). Note that while conditions may have differing wild types at a given site, the sites between conditions should reference the same site when alignment is performed between condition wild types.
- func_score - The functional score computed from experimental
measurements.
- Parameters:
variants_df (
pandas.DataFrame
or None) – The variant level information from all experiments you wish to analyze. Should have columns named'condition'
,'aa_substitutions'
, and'func_score'
. See the class note for descriptions of each of the features.reference (str) – Name of the condition which annotates the reference. variants. Note that for model fitting this class will convert all amino acid substitutions for non-reference condition groups to relative to the reference condition. For example, if the wild type amino acid at site 30 is an A in the reference condition, and a G in a non-reference condition, then a Y30G mutation in the non-reference condition is recorded as an A30G mutation relative to the reference. This way, each condition informs the exact same parameters, even at sites that differ in wild type amino acid. These are encoded in a
binarymap.binarymap.BinaryMap
object for each condition, where all sites that are non-identical to the reference are 1’s. For motivation, see the Model overview section inmultidms.Model
class notes.alphabet (array-like) – Allowed characters in mutation strings.
collapse_identical_variants ({'mean', 'median', False}) – If identical variants in
variants_df
(same ‘aa_substitutions’), exist within individual condition groups, collapse them by taking mean or median of ‘func_score’, or (if False) do not collapse at all. Collapsing will make fitting faster, but not a good idea if you are doing bootstrapping.condition_colors (array-like or dict) – Maps each condition to the color used for plotting. Either a dict keyed by each condition, or an array of colors that are sequentially assigned to the conditions.
letter_suffixed_sites (bool) – True if sites are sequential and integer, False otherwise.
assert_site_integrity (bool) – If True, will assert that all sites in the data frame have the same wild type amino acid, grouped by condition.
verbose (bool) – If True, will print progress bars.
nb_workers (int) – Number of workers to use for parallel operations. If None, will use all available CPUs.
name (str or None) – Name of the data object. If None, will be assigned a unique name based upon the number of data objects instantiated.
Example
Simple example with two conditions (
'a'
and'b'
)>>> import pandas as pd >>> import multidms >>> func_score_data = { ... 'condition' : ["a","a","a","a", "b","b","b","b","b","b"], ... 'aa_substitutions' : [ ... 'M1E', 'G3R', 'G3P', 'M1W', 'M1E', ... 'P3R', 'P3G', 'M1E P3G', 'M1E P3R', 'P2T' ... ], ... 'func_score' : [2, -7, -0.5, 2.3, 1, -5, 0.4, 2.7, -2.7, 0.3], ... } >>> func_score_df = pd.DataFrame(func_score_data) >>> func_score_df condition aa_substitutions func_score 0 a M1E 2.0 1 a G3R -7.0 2 a G3P -0.5 3 a M1W 2.3 4 b M1E 1.0 5 b P3R -5.0 6 b P3G 0.4 7 b M1E P3G 2.7 8 b M1E P3R -2.7 9 b P2T 0.3
Instantiate a
Data
Object allowing for stop codon variants and declaring condition “a” as the reference condition.>>> data = multidms.Data( ... func_score_df, ... alphabet = multidms.AAS_WITHSTOP, ... reference = "a", ... ) ...
Note this may take some time due to the string operations that must be performed when converting amino acid substitutions to be with respect to a reference wild type sequence.
After the object has finished being instantiated, we now have access to a few ‘static’ properties of our data. See individual property docstring for more information.
>>> data.reference 'a'
>>> data.conditions ('a', 'b')
>>> data.mutations ('M1E', 'M1W', 'G3P', 'G3R')
>>> data.site_map a b 1 M M 3 G P
>>> data.mutations_df mutation wts sites muts times_seen_a times_seen_b 0 M1E M 1 E 1 3 1 M1W M 1 W 1 0 2 G3P G 3 P 1 4 3 G3R G 3 R 1 2
>>> data.variants_df condition aa_substitutions func_score var_wrt_ref 0 a M1E 2.0 M1E 1 a G3R -7.0 G3R 2 a G3P -0.5 G3P 3 a M1W 2.3 M1W 4 b M1E 1.0 G3P M1E 5 b P3R -5.0 G3R 6 b P3G 0.4 7 b M1E P3G 2.7 M1E 8 b M1E P3R -2.7 G3R M1E
- property mutations: tuple¶
A tuple of all mutations in the order relative to their index into the binarymap.
- property site_map: DataFrame¶
A dataframe indexed by site, with columns for all conditions giving the wild type amino acid at each site.
- property non_identical_mutations: dict¶
A dictionary keyed by condition names with values being a string of all mutations that differ from the reference sequence.
- property non_identical_sites: dict¶
A dictionary keyed by condition names with values being a
pandas.DataFrame
indexed by site, with columns for the reference and non-reference amino acid at each site that differs.
- property bundle_idxs: dict¶
A dictionary keyed by condition names with values being the indices into the binarymap representing bundle (non_identical) mutations
- property reference_sequence_conditions: list¶
A list of conditions that have the same wild type sequence as the reference condition.
- property scaled_training_data: dict¶
A dictionary with keys ‘X’ and ‘y’ for the scaled training data.
- property binarymaps: dict¶
A dictionary keyed by condition names with values being a
BinaryMap
object for each condition.
- property mutparser: MutationParser¶
The mutation
polyclonal.utils.MutationParser
used to parse mutations.
- property parse_mut: MutationParser¶
returns a function that splits a single amino acid substitutions into wildtype, site, and mutation using the mutation parser.
- property parse_muts: partial¶
A function that splits amino acid substitutions (a string of more than one) into wildtype, site, and mutation using the mutation parser.
- property single_mut_encodings¶
A dictionary keyed by condition names with values being the one-hot encoding of all single mutations
- convert_subs_wrt_ref_seq(condition, aa_subs)¶
Covert amino acid substitutions to be with respect to the reference sequence.
- plot_times_seen_hist(saveas=None, show=True, **kwargs)¶
Plot a histogram of the number of times each mutation was seen.
- plot_func_score_boxplot(saveas=None, show=True, **kwargs)¶
Plot a boxplot of the functional scores for each condition.