multidms package

multidms

multidms is a Python package for modeling deep mutational scanning data. In particular, it is designed to model data from more than one experiment, even if they don’t share the same wildtype amino acid sequence. It uses joint modeling to inform parameters across all experiments, while identifying experiment-specific mutation effects which differ.

Importing this package imports the following objects into the package namespace:

For a brief description about how the Model class works to compose, compile, and optimize the model parameters - as well as detailed code code documentation for each of the equations described in the biophysical docs - see:

plot mostly contains code for interactive plotting at the moment.

It also imports the following alphabets:

  • AAS

  • AAS_WITHSTOP

  • AAS_WITHGAP

  • AAS_WITHSTOP_WITHGAP

class multidms.Data(variants_df: DataFrame, reference: str, alphabet=('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'), collapse_identical_variants=False, condition_colors=('#0072B2', '#CC79A7', '#009E73', '#17BECF', '#BCDB22'), letter_suffixed_sites=False, assert_site_integrity=False, verbose=False, nb_workers=None, name=None)

Bases: object

Prep and store one-hot encoding of variant substitutions data. Individual objects of this type can be shared by multiple multidms.Model Objects for efficiently fitting various models to the same data.

Note

You can initialize a Data object with a pandas.DataFrame with a row for each variant sampled and annotations provided in the required columns:

  1. condition - Experimental condition from

    which a sample measurement was obtained.

  2. aa_substitutions - Defines each variant

    \(v\) as a string of substitutions (e.g., 'M3A K5G'). Note that while conditions may have differing wild types at a given site, the sites between conditions should reference the same site when alignment is performed between condition wild types.

  3. func_score - The functional score computed from experimental

    measurements.

Parameters:
  • variants_df (pandas.DataFrame or None) – The variant level information from all experiments you wish to analyze. Should have columns named 'condition', 'aa_substitutions', and 'func_score'. See the class note for descriptions of each of the features.

  • reference (str) – Name of the condition which annotates the reference. variants. Note that for model fitting this class will convert all amino acid substitutions for non-reference condition groups to relative to the reference condition. For example, if the wild type amino acid at site 30 is an A in the reference condition, and a G in a non-reference condition, then a Y30G mutation in the non-reference condition is recorded as an A30G mutation relative to the reference. This way, each condition informs the exact same parameters, even at sites that differ in wild type amino acid. These are encoded in a binarymap.binarymap.BinaryMap object for each condition, where all sites that are non-identical to the reference are 1’s. For motivation, see the Model overview section in multidms.Model class notes.

  • alphabet (array-like) – Allowed characters in mutation strings.

  • collapse_identical_variants ({'mean', 'median', False}) – If identical variants in variants_df (same ‘aa_substitutions’), exist within individual condition groups, collapse them by taking mean or median of ‘func_score’, or (if False) do not collapse at all. Collapsing will make fitting faster, but not a good idea if you are doing bootstrapping.

  • condition_colors (array-like or dict) – Maps each condition to the color used for plotting. Either a dict keyed by each condition, or an array of colors that are sequentially assigned to the conditions.

  • letter_suffixed_sites (bool) – True if sites are sequential and integer, False otherwise.

  • assert_site_integrity (bool) – If True, will assert that all sites in the data frame have the same wild type amino acid, grouped by condition.

  • verbose (bool) – If True, will print progress bars.

  • nb_workers (int) – Number of workers to use for parallel operations. If None, will use all available CPUs.

  • name (str or None) – Name of the data object. If None, will be assigned a unique name based upon the number of data objects instantiated.

Example

Simple example with two conditions ('a' and 'b')

>>> import pandas as pd
>>> import multidms
>>> func_score_data = {
...     'condition' : ["a","a","a","a", "b","b","b","b","b","b"],
...     'aa_substitutions' : [
...         'M1E', 'G3R', 'G3P', 'M1W', 'M1E',
...         'P3R', 'P3G', 'M1E P3G', 'M1E P3R', 'P2T'
...     ],
...     'func_score' : [2, -7, -0.5, 2.3, 1, -5, 0.4, 2.7, -2.7, 0.3],
... }
>>> func_score_df = pd.DataFrame(func_score_data)
>>> func_score_df  
condition aa_substitutions  func_score
0         a              M1E         2.0
1         a              G3R        -7.0
2         a              G3P        -0.5
3         a              M1W         2.3
4         b              M1E         1.0
5         b              P3R        -5.0
6         b              P3G         0.4
7         b          M1E P3G         2.7
8         b          M1E P3R        -2.7
9         b              P2T         0.3

Instantiate a Data Object allowing for stop codon variants and declaring condition “a” as the reference condition.

>>> data = multidms.Data(
...     func_score_df,
...     alphabet = multidms.AAS_WITHSTOP,
...     reference = "a",
... )  
...

Note this may take some time due to the string operations that must be performed when converting amino acid substitutions to be with respect to a reference wild type sequence.

After the object has finished being instantiated, we now have access to a few ‘static’ properties of our data. See individual property docstring for more information.

>>> data.reference
'a'
>>> data.conditions
('a', 'b')
>>> data.mutations
('M1E', 'M1W', 'G3P', 'G3R')
>>> data.site_map  
a  b
1  M  M
3  G  P
>>> data.mutations_df  
  mutation wts  sites muts  times_seen_a  times_seen_b
0      M1E   M      1    E             1             3
1      M1W   M      1    W             1             0
2      G3P   G      3    P             1             4
3      G3R   G      3    R             1             2
>>> data.variants_df  
  condition aa_substitutions  func_score var_wrt_ref
0         a              M1E         2.0         M1E
1         a              G3R        -7.0         G3R
2         a              G3P        -0.5         G3P
3         a              M1W         2.3         M1W
4         b              M1E         1.0     G3P M1E
5         b              P3R        -5.0         G3R
6         b              P3G         0.4
7         b          M1E P3G         2.7         M1E
8         b          M1E P3R        -2.7     G3R M1E
property name: str

The name of the data object.

property conditions: tuple

A tuple of all conditions.

property reference: str

The name of the reference condition.

property reference_index: int

The index of the reference condition.

property mutations: tuple

A tuple of all mutations in the order relative to their index into the binarymap.

property mutations_df: DataFrame

A dataframe summarizing all single mutations

property variants_df: DataFrame

A dataframe summarizing all variants in the training data.

property site_map: DataFrame

A dataframe indexed by site, with columns for all conditions giving the wild type amino acid at each site.

property non_identical_mutations: dict

A dictionary keyed by condition names with values being a string of all mutations that differ from the reference sequence.

property non_identical_sites: dict

A dictionary keyed by condition names with values being a pandas.DataFrame indexed by site, with columns for the reference and non-reference amino acid at each site that differs.

property bundle_idxs: dict

A dictionary keyed by condition names with values being the indices into the binarymap representing bundle (non_identical) mutations

property reference_sequence_conditions: list

A list of conditions that have the same wild type sequence as the reference condition.

property training_data: dict

A dictionary with keys ‘X’ and ‘y’ for the training data.

property scaled_training_data: dict

A dictionary with keys ‘X’ and ‘y’ for the scaled training data.

property binarymaps: dict

A dictionary keyed by condition names with values being a BinaryMap object for each condition.

property targets: dict

The functional scores for each variant in the training data.

property mutparser: MutationParser

The mutation polyclonal.utils.MutationParser used to parse mutations.

property parse_mut: MutationParser

returns a function that splits a single amino acid substitutions into wildtype, site, and mutation using the mutation parser.

property parse_muts: partial

A function that splits amino acid substitutions (a string of more than one) into wildtype, site, and mutation using the mutation parser.

property single_mut_encodings

A dictionary keyed by condition names with values being the one-hot encoding of all single mutations

convert_subs_wrt_ref_seq(condition, aa_subs)

Covert amino acid substitutions to be with respect to the reference sequence.

Parameters:
  • condition (str) – The condition from which aa substitutions are relative to.

  • aa_subs (str) – A string of amino acid substitutions, relative to the condition sequence, to converted

Returns:

A string of amino acid substitutions relative to the reference sequence.

Return type:

str

plot_times_seen_hist(saveas=None, show=True, **kwargs)

Plot a histogram of the number of times each mutation was seen.

plot_func_score_boxplot(saveas=None, show=True, **kwargs)

Plot a boxplot of the functional scores for each condition.

class multidms.Model(data: ~multidms.data.Data, epistatic_model=<function sigmoidal_global_epistasis>, output_activation=<function identity_activation>, PRNGKey=0, lower_bound=None, n_hidden_units=5, init_theta_scale=5.0, init_theta_bias=-5.0, init_beta_variance=0.0, name=None)

Bases: object

Represent one or more DMS experiments to obtain tuned parameters that provide insight into individual mutational effects and conditional shifts of those effects on all non-reference conditions. For more see the biophysical model documentation

Parameters:
  • data (multidms.Data) – A reference to the dataset which will define the parameters of the model to be fit.

  • epistatic_model (<class 'function'>) – A function which will transform the latent effects of mutations into a functional score. See the biophysical model documentation for more.

  • output_activation (<class 'function'>) – A function which will transform the output of the global epistasis function. Defaults to the identity function (no activation). See the biophysical model documentation

  • conditional_shifts (bool) – If true (default) initialize and fit the shift parameters for each non-reference condition. See Model Description section for more. Defaults to True.

  • alpha_d (bool) – If True introduce a latent offset parameter for each condition. See the biophysical docs section for more. Defaults to True.

  • gamma_corrected (bool) – If true (default), introduce the ‘gamma’ parameter for each non-reference parameter to account for differences between wild type behavior relative to its variants. This is essentially a bias added to the functional scores during fitting. See Model Description section for more. Defaults to False.

  • PRNGKey (int) – The initial seed key for random parameters assigned to Betas and any other randomly initialized parameters. for more.

  • init_beta_naught (float) – Initialize the latent offset parameter applied to all conditions. See the biophysical docs section for more.

  • init_theta_scale (float) – Initialize the scaling parameter \(\theta_{\text{scale}}\) of a two-parameter epistatic model (Sigmoid or Softplus).

  • init_theta_bias (float) – Initialize the bias parameter \(\theta_{\text{bias}}\) of a two parameter epistatic model (Sigmoid or Softplus).

  • init_beta_variance (float) – Beta parameters are initialized by sampling from a normal distribution. This parameter specifies the variance of the distribution being sampled.

  • n_hidden_units (int or None) – If using multidms.biophysical.nn_global_epistasis() as the epistatic model, this is the number of hidden units used in the transform.

  • lower_bound (float or None) – If using multidms.biophysical.softplus_activation() as the output activation, this is the lower bound of the softplus function.

  • name (str or None) – Name of the Model object. If None, will be assigned a unique name based upon the number of data objects instantiated.

Example

To create a Model object, all you need is the respective Data object for parameter fitting.

>>> import multidms
>>> from tests.test_data import data
>>> model = multidms.Model(data)

Upon initialization, you will now have access to the underlying data and parameters.

>>> model.data.mutations
('M1E', 'M1W', 'G3P', 'G3R')
>>> model.data.conditions
('a', 'b')
>>> model.data.reference
'a'
>>> model.data.condition_colors
{'a': '#0072B2', 'b': '#CC79A7'}

The mutations_df and variants_df may of course also be accessed. First, we set pandas to display all rows and columns.

>>> import pandas as pd
>>> pd.set_option('display.max_rows', None)
>>> pd.set_option('display.max_columns', None)
>>> model.data.mutations_df  
  mutation wts  sites muts  times_seen_a  times_seen_b
0      M1E   M      1    E             1             3
1      M1W   M      1    W             1             0
2      G3P   G      3    P             1             4
3      G3R   G      3    R             1             2

However, if accessed directly through the Model object, you will get the same information, along with model/parameter specific features included. These are automatically updated each time you request the property.

>>> model.get_mutations_df()  
         wts  sites muts  times_seen_a  times_seen_b  beta_a  beta_b  shift_b  \
mutation
M1E        M      1    E             1             3     0.0     0.0      0.0
M1W        M      1    W             1             0     0.0    -0.0      0.0
G3P        G      3    P             1             4    -0.0    -0.0     -0.0
G3R        G      3    R             1             2    -0.0     0.0     -0.0

          predicted_func_score_a  predicted_func_score_b
mutation
M1E                          0.0                     0.0
M1W                          0.0                     0.0
G3P                          0.0                     0.0
G3R                          0.0                     0.0

Notice the respective single mutation effects ("beta"), conditional shifts (shift_d), and predicted functional score (F_d) of each mutation in the model are now easily accessible. Similarly, we can take a look at the variants_df for the model,

>>> model.get_variants_df()  
   condition aa_substitutions  func_score var_wrt_ref  predicted_latent  \
0         a              M1E         2.0         M1E               0.0
1         a              G3R        -7.0         G3R               0.0
2         a              G3P        -0.5         G3P               0.0
3         a              M1W         2.3         M1W               0.0
4         b              M1E         1.0     G3P M1E               0.0
5         b              P3R        -5.0         G3R               0.0
6         b              P3G         0.4                           0.0
7         b          M1E P3G         2.7         M1E               0.0
8         b          M1E P3R        -2.7     G3R M1E               0.0
   predicted_func_score
0                   0.0
1                   0.0
2                   0.0
3                   0.0
4                   0.0
5                   0.0
6                   0.0
7                   0.0
8                   0.0

We now have access to the predicted (and gamma corrected) functional scores as predicted by the models current parameters.

So far, these parameters and predictions results from them have not been tuned to the dataset. Let’s take a look at the loss on the training dataset given our initialized parameters

>>> model.loss
2.9370000000000003

Next, we fit the model with some chosen hyperparameters.

>>> model.fit(maxiter=10, lasso_shift=1e-5, warn_unconverged=False)
>>> model.loss
0.3483478119356665

The model tunes its parameters in place, and the subsequent call to retrieve the loss reflects our models loss given its updated parameters.

property name: str

The name of the data object.

property state: dict

The current state of the model.

property converged: bool

Whether the model tolerance threshold was passed on last fit.

property data: Data

multidms.Data Object this model references for fitting its parameters.

property model_components: frozendict

A frozendict which hold the individual components of the model as well as the objective and forward functions.

property convergence_trajectory_df

The state.error through each training iteration. Currentlty, this is reset each time the fit() method is called

property params: dict

A copy of all current model parameters

property loss: float

Compute un-penalized model loss on all experimental training data without ridge or lasso penalties included.

property conditional_loss: float

Compute un-penalized loss individually for each condition.

property wildtype_df

Get a dataframe indexed by condition wildtype containing the prediction features for each.

get_variants_df(phenotype_as_effect=True)

Training data with model predictions for latent, and functional score phenotypes.

Parameters:

phenotype_as_effect (bool) – if True, phenotypes (both latent, and func_score) are calculated as the _difference_ between predicted phenotype of a given variant and the respective experimental wildtype prediction. Otherwise, report the unmodified model prediction.

Returns:

A copy of the training data, self.data.variants_df, with the phenotypes added. Phenotypes are predicted based on the current state of the model.

Return type:

pandas.DataFrame

get_mutations_df(times_seen_threshold=0, phenotype_as_effect=True, return_split=True)

Mutation attributes and phenotypic effects based on the current state of the model.

Parameters:
  • times_seen_threshold (int, optional) – Only report mutations that have been seen at least this many times in each condition. Defaults to 0.

  • phenotype_as_effect (bool, optional) – if True, phenotypes are reported as the difference from the conditional wildtype prediction. Otherwise, report the unmodified model prediction.

  • return_split (bool, optional) – If True, return the split mutations as separate columns: ‘wts’, ‘sites’, and ‘muts’. Defaults to True.

Returns:

A copy of the mutations data, self.data.mutations_df, with the mutations column set as the index, and columns with the mutational attributes (e.g. betas, shifts) and conditional functional score effect (e.g. ) added.

The columns are ordered as follows: - beta_a, beta_b, … : the latent effect of the mutation - shift_b, shift_c, … : the conditional shift of the mutation - predicted_func_score_a, predicted_func_score_b, … : the

predicted functional score of the mutation.

Return type:

pandas.DataFrame

get_df_loss(df, error_if_unknown=False, verbose=False, conditional=False)

Get the loss of the model on a given data frame.

Parameters:
  • df (pandas.DataFrame) – Data frame containing variants. Requirements are the same as those used to initialize the multidms.Data object - except the indices must be unique.

  • error_if_unknown (bool) – If some of the substitutions in a variant are not present in the model (not in AbstractEpistasis.binarymap) then by default we do not include those variants in the loss calculation. If True, raise an error.

  • verbose (bool) – If True, print the number of valid and invalid variants.

  • conditional (bool) – If True, return the loss for each condition as a dictionary. If False, return the total loss.

Returns:

The loss of the model on the given data frame.

Return type:

float or dict

add_phenotypes_to_df(df, substitutions_col='aa_substitutions', condition_col='condition', latent_phenotype_col='predicted_latent', observed_phenotype_col='predicted_func_score', converted_substitutions_col='aa_subs_wrt_ref', overwrite_cols=False, unknown_as_nan=False, phenotype_as_effect=True)

Add predicted phenotypes to data frame of variants.

Parameters:
  • df (pandas.DataFrame) – Data frame containing variants. Requirements are the same as those used to initialize the multidms.Data object - except the indices must be unique.

  • substitutions_col (str) – Column in df giving variants as substitution strings with respect to a given variants condition. These will be converted to be with respect to the reference sequence prior to prediction. Defaults to ‘aa_substitutions’.

  • condition_col (str) – Column in df giving the condition from which a variant was observed. Values must exist in the self.data.conditions and and error will be raised otherwise. Defaults to ‘condition’.

  • latent_phenotype_col (str) – Column added to df containing predicted latent phenotypes.

  • observed_phenotype_col (str) – Column added to df containing predicted observed phenotypes.

  • converted_substitutions_col (str or None) – Columns added to df containing converted substitution strings for non-reference conditions if they do not share a wildtype seq.

  • overwrite_cols (bool) – If the specified latent or observed phenotype column already exist in df, overwrite it? If False, raise an error.

  • unknown_as_nan (bool) – If some of the substitutions in a variant are not present in the model (not in AbstractEpistasis.binarymap) set the phenotypes to nan (not a number)? If False, raise an error.

  • phenotype_as_effect (bool) – if True, phenotypes (both latent, and func_score) are calculated as the _difference_ between predicted phenotype of a given variant and the respective experimental wildtype prediction. Otherwise, report the unmodified model prediction.

Returns:

A copy of df with the phenotypes added. Phenotypes are predicted based on the current state of the model.

Return type:

pandas.DataFrame

mutation_site_summary_df(agg_func='mean', **kwargs)

Get all single mutational attributes from self._data updated with all model specific attributes, then aggregate all numerical columns by “sites”

Parameters:
  • agg_func (str) – Aggregation function to use on the numerical columns. Defaults to “mean”.

  • **kwargs – Additional keyword arguments to pass to get_mutations_df.

Returns:

A summary of the mutation attributes aggregated by site.

Return type:

pandas.DataFrame

get_condition_params(condition=None)

Get the relent parameters for a model prediction

phenotype_fromsubs(aa_subs, condition=None)

take a single string of subs which are not already converted wrt reference, convert them and then make a functional score prediction and return the result.

latent_fromsubs(aa_subs, condition=None)

take a single string of subs which are not already converted wrt reference, convert them and them make a latent prediction and return the result.

phenotype_frombinary(X, condition=None)

Condition specific functional score prediction on X using the biophysical model given current model parameters.

Parameters:
  • X (jnp.array) – Binary encoded variants to make predictions on.

  • condition (str) – Condition to make predictions for. If None, use the reference

latent_frombinary(X, condition=None)

Condition specific latent phenotype prediction on X using the biophysical model given current model parameters.

Parameters:
  • X (jnp.array) – Binary encoded variants to make predictions on.

  • condition (str) – Condition to make predictions for. If None, use the reference

fit(scale_coeff_lasso_shift=1e-05, tol=0.0001, maxiter=1000, maxls=15, acceleration=True, lock_params={}, admm_niter=50, admm_tau=1.0, warn_unconverged=True, upper_bound_ge_scale='infer', convergence_trajectory_resolution=10, **kwargs)

Use jaxopt.ProximalGradiant to optimize the model’s free parameters.

Parameters:
  • scale_coeff_lasso_shift (float) – L1 penalty coefficient applied “shift” in beta_d parameters. Defaults to 1e-4. This parameter is used to regularize the shift parameters in the model if there’s more than one condition.

  • tol (float) – Tolerance for the optimization convergence criteria. Defaults to 1e-4.

  • maxiter (int) – Maximum number of iterations for the optimization. Defaults to 1000.

  • maxls (int) – Maximum number of iterations to perform during line search.

  • acceleration (bool) – If True, use FISTA acceleration. Defaults to True.

  • lock_params (dict) – Dictionary of parameters, and desired value to constrain them at during optimization. By default, no parameters are locked.

  • admm_niter (int) – Number of iterations to perform during the ADMM optimization. Defaults to 50. Note that in the case of single-condition models, This is set to zero as the generalized lasso ADMM optimization is not used.

  • admm_tau (float) – ADMM step size. Defaults to 1.0.

  • warn_unconverged (bool) – If True, raise a warning if the optimization does not converge. convergence is defined by whether the model tolerance (‘’tol’’) threshold was passed during the optimization process. Defaults to True.

  • upper_bound_ge_scale (float, None, or 'infer') – The positive upper bound of the theta scale parameter - negative values are not allowed. Passing None allows the scale of the sigmoid to be unconstrained. Passing the string literal ‘infer’ results in the scale being set to double the range of the training data. Defaults to ‘infer’.

  • convergence_trajectory_resolution (int) – The resolution of the loss and error trajectory recorded during optimization. Defaults to 100.

  • **kwargs (dict) – Additional keyword arguments passed to the objective function. See the multidms.biophysical.smooth_objective docstring for details on the other hyperparameters that may be supplied to regularize and otherwise modify the objective function being optimized.

plot_pred_accuracy(hue=True, show=True, saveas=None, annotate_corr=True, ax=None, r=2, **kwargs)

Create a figure which visualizes the correlation between model predicted functional score of all variants in the training with ground truth measurements.

plot_epistasis(hue=True, show=True, saveas=None, ax=None, sample=1.0, **kwargs)

Plot latent predictions against gamma corrected ground truth measurements of all samples in the training set.

plot_param_hist(param, show=True, saveas=False, times_seen_threshold=0, ax=None, **kwargs)

Plot the histogram of a parameter.

plot_param_heatmap(param, show=True, saveas=False, times_seen_threshold=0, ax=None, **kwargs)

Plot the heatmap of a parameters associated with specific sites and substitutions.

plot_shifts_by_site(condition, show=True, saveas=False, times_seen_threshold=0, agg_func='mean', ax=None, **kwargs)

Summarize shift parameter values by associated sites and conditions.

mut_param_heatmap(mut_param='shift', times_seen_threshold=0, phenotype_as_effect=True, **line_and_heat_kwargs)

Wrapper method for visualizing the shift plot. see multidms.plot.mut_shift_plot() for more

class multidms.ModelCollection(fit_models)

Bases: object

A class for the comparison and visualization of multiple multidms.Model fits. The respective collection of training datasets for each fit must share the same reference sequence and conditions. Additionally, the inferred site maps must agree upon condition wildtypes for all shared sites.

The utility function multidms.model_collection.fit_models is used to fit the collection of models, and the resulting dataframe is passed to the constructor of this class.

Parameters:

fit_models (pandas.DataFrame) – A dataframe containing the fit attributes and pickled model objects as returned by multidms.model_collection.fit_models.

property site_map_union: DataFrame

The union of all site maps of all datasets used for fitting.

property conditions: list

The conditions (shared by each fitting dataset) used for fitting.

property reference: str

The reference conditions (shared by each fitting dataset) used for fitting.

property shared_mutations: tuple

The mutations shared by each fitting dataset.

property all_mutations: tuple

The mutations shared by each fitting dataset.

split_apply_combine_muts(groupby=('dataset_name', 'scale_coeff_lasso_shift'), aggregate_func='mean', inner_merge_dataset_muts=True, query=None, **kwargs)

wrapper to split-apply-combine the set of mutational dataframes harbored by each of the fits in the collection.

Here, we group the collection of fits using attributes (columns in ModelCollection.fit_models) specified using the groupby parameter. Each of the individual fits within a groups may then be filtered via **kwargs, and aggregated via aggregate_func, before the function stacks all the groups back together in a tall style dataframe. The resulting dataframe will have a multiindex with the mutation and the groupby attributes.

Parameters:
  • groupby (str or tuple of str or None, optional) – The attributes to group the fits by. If None, then group by all attributes except for the model, data, and step_loss attributes. The default is (“dataset_name”, “scale_coeff_lasso_shift”).

  • aggregate_func (str or callable, optional) – The function to aggregate the mutational dataframes within each group. The default is “mean”.

  • inner_merge_dataset_muts (bool, optional) – Whether to toss mutations which are _not_ shared across all datasets before aggregation of group mutation parameter values. The default is True.

  • query (str, optional) – The pandas query to apply to the ModelCollection.fit_models dataframe before splitting. The default is None.

  • **kwargs (dict) – Keyword arguments to pass to the multidms.Model.get_mutations_df() method (“phenotype_as_effect”, and “times_seen_threshold”) see the method docstring for details.

Returns:

A dataframe containing the aggregated mutational parameter values

Return type:

pandas.DataFrame

add_validation_loss(test_data, overwrite=False)

Add validation loss to the fit collection dataframe.

Parameters:
  • test_data (pd.DataFrame or dict(str, pd.DataFrame)) – The testing dataframe to compute validation loss with respect to, must have columns “aa_substitutitions”, “condition”, and “func_score”. If a dictionary is passed, there should be a key for each unique dataset_name factor in the self.fit_models dataframe - with the value being the respective testing dataframe.

  • overwrite (bool, optional) – Whether to overwrite the validation_loss column if it already exists. The default is False.

Returns:

The self.fit_models dataframe with the validation loss added.

Return type:

pd.DataFrame

get_conditional_loss_df(query=None)

return a long form dataframe with columns “dataset_name”, “scale_coeff_lasso_shift”, “split” (“training” or “validation”), “loss” (actual value), and “condition”.

Parameters:

query (str, optional) – The query to apply to the fit_models dataframe before formatting the loss dataframe. The default is None.

convergence_trajectory_df(query=None, id_vars=('dataset_name', 'scale_coeff_lasso_shift'))

Combine the converence trajectory dataframes of all fits in the queried collection.

mut_param_heatmap(query=None, mut_param='shift', aggregate_func='mean', inner_merge_dataset_muts=True, times_seen_threshold=0, phenotype_as_effect=True, **kwargs)

Create lineplot and heatmap altair chart across replicate datasets. This function optionally applies a given pandas.query on the fit_models dataframe that should result in a subset of fit’s which make sense to aggregate mutational data across, e.g. replicate datasets. It then computes the mean or median mutational parameter value (“beta”, “shift”, or “predicted_func_score”) between the remaining fits. and creates an interactive altair chart.

Note that this will throw an error if the queried fits have more than one unique hyper-parameter besides “dataset_name”.

Parameters:
  • query (str) – The query to apply to the fit_models dataframe. This should be used to subset the fits to only those which make sense to aggregate mutational data across, e.g. replicate datasets. For example, if you have a collection of fits with different epistatic models, you may want to query for only those fits with the same epistatic model. e.g. query=”epistatic_model == ‘Sigmoid’”. For more on the query syntax, see the pandas.query documentation.

  • mut_param (str, optional) – The mutational parameter to plot. The default is “shift”. Must be one of “shift”, “predicted_func_score”, or “beta”.

  • aggregate_func (str, optional) – The function to aggregate the mutational parameter values between dataset fits. The default is “mean”.

  • inner_merge_dataset_muts (bool, optional) – Whether to toss mutations which are _not_ shared across all datasets before aggregation of group mutation parameter values. The default is True.

  • times_seen_threshold (int, optional) – The minimum number of times a mutation must be seen across all conditions within a single fit to be included in the aggregation. The default is 0.

  • phenotype_as_effect (bool, optional) – Passed to Model.get_mutations_df(), Only applies if mut_param=”predicted_func_score”.

  • **kwargs (dict) – Keyword arguments to pass to multidms.plot._lineplot_and_heatmap().

Returns:

A chart object which can be displayed in a jupyter notebook or saved to a file.

Return type:

altair.Chart

mut_param_traceplot(mutations, mut_param='shift', x='scale_coeff_lasso_shift', width_scalar=100, height_scalar=100, **kwargs)

visualize mutation parameter values across the lasso penalty weights (by default) of a given subset of the mutations in the form of an altair.FacetChart. This is useful when you would like to confirm that a reported mutational parameter value carries through across the individual fits.

Returns:

A chart object which can be displayed in a jupyter notebook or saved to a file.

Return type:

altair.Chart

shift_sparsity(x='scale_coeff_lasso_shift', width_scalar=100, height_scalar=100, return_data=False, **kwargs)

Visualize shift parameter set sparsity across the lasso penalty weights (by default) in the form of an altair.FacetChart. We will group the mutations according to their status as either a a “stop” (e.g. A15*), or “nonsynonymous” (e.g. A15G) mutation before calculating the sparsity. This is because in a way, mutations to stop codons act as a False positive rate, as we expect their mutational effect to be equally deleterious in all experiments, and thus have a shift parameter value of zero.

Returns:

A chart object which can be displayed in a jupyter notebook or saved to a file. If return_data=True, then a tuple containing the chart and the underlying data will be returned.

Return type:

altair.Chart or Tuple(pd.DataFrame, altair.Chart)

mut_param_dataset_correlation(x='scale_coeff_lasso_shift', width_scalar=200, height=200, return_data=False, r=2, **kwargs)

Visualize the correlation between replicate datasets across the lasso penalty weights (by default) in the form of an altair.FacetChart. We compute correlation of mutation parameters accross each pair of datasets in the collection.

Parameters:
  • x (str, optional) – The parameter to plot on the x-axis. The default is “scale_coeff_lasso_shift”.

  • width_scalar (int, optional) – The width of the chart. The default is 150.

  • height (int, optional) – The height of the chart. The default is 200.

  • return_data (bool, optional) – Whether to return the underlying data. The default is False.

  • r (int, optional) – The exponential of the correlation coefficient reported. May be either 1 for pearson, 2 for coefficient of determination (r-squared), The default is 2.

  • **kwargs (dict) – The keyword arguments to pass to the multidms.model_collection.ModelCollection.split_apply_combine_muts() method. See the method docstring for details.

Returns:

A chart object which can be displayed in a jupyter notebook or saved to a file. If return_data=True, then a tuple containing the chart and the underlying data will be returned.

Return type:

altair.Chart or Tuple(altair.Chart, pd.DataFrame)

multidms.fit_models(params, n_threads=-1, failures='error')

Fit collection of multidms.model.Model models.

Enables fitting of multiple models simultaneously using multiple threads. Most commonly, this function is used to fit a set of models across combinations of replicate training datasets, and lasso coefficients for model selection and evaluation. The returned dataframe is meant to be passed into the multidms.model_collection.ModelCollection class for comparison and visualization.

Parameters:
  • params (dict) – Dictionary which defines the parameter space of all models you wish to run. Each value in the dictionary must be a list of values, even in the case of singletons. This function will compute all combinations of the parameter space and pass each combination to multidms.utils.fit_one_model() to be run in parallel, thus only key-value pairs which match the kwargs are allowed. See the docstring of multidms.model_collection.fit_one_model() for a description of the allowed parameters.

  • n_threads (int) – Number of threads (CPUs, cores) to use for fitting. Set to -1 to use all CPUs available.

  • failures ({"error", "tolerate"}) – What if fitting fails for a model? If “error” then raise an error, if “ignore” then just return None for models that failed optimization.

Returns:

Number of models that fit successfully, number of models that failed, and a dataframe which contains a row for each of the multidms.Model object references along with the parameters each was fit with for convenience. The dataframe is ultimately meant to be passed into the ModelCollection class. for comparison and visualization.

Return type:

(n_fit, n_failed, fit_models)

Submodules