Biophysical model

The goal of multidms is to jointly infer mutational effects across multiple deep mutational scanning (DMS) experiments, including how much each mutation’s effect differs between experiments. We refer to these differences as shifts. If the experiments were performed with different homologs of the same protein, a shift would indicate epistasis between the shifted mutation and the amino-acid mutations that separate the homologs. Or, if they were performed with the same wildtype protein under different selective conditions (e.g., selection for viral entry using different but related host receptors), shifts would indicate condition-specific effects. Ultimately, this model was designed to identify those shifts relative to user’s chosen reference condition with feature selection via a lasso (\(L_1\)) regularization [1] of the shift parameters described in the additive latent phenotype section below.

multidms is compatible with DMS data that have the following characteristics. First, the data must report a functional score (e.g., log enrichment ratio) for each variant from each DMS experiment, where a variant constitutes a unique gene sequence covering the entire mutagenized region. Thus, the deep-sequencing data must resolve complete haplotypes. Second, most mutations must be seen in multiple unique variants per experiment, as this number is the basis by which shifts are regularized (see below). This second requirement is often met in DMS libraries with multiple mutations per variant, as long as each mutation occurs in multiple genetic backgrounds, or in libraries with one mutation per variant, as long as each variant is uniquely barcoded and individual mutations are found in the background of multiple barcodes. Given input data from one or more experiments, multidms fits to all the data and provides: (1) mutational effects \(\beta\) across all experiments, (2) \(\Delta\), how effects are shifted between experiments, and (3) a model to predict the functional score of any given variant.

Note: We suggest reading the Otwinowski et al. 2018 [2] paper to understand the approach for modeling global epistasis before reading the rest of the documentation.

Joint model composition

At a high level, the model is a composition of three functions which describe the expected biophysical interactions underlying a given phenotype; (1) \(\phi\), an additive model describing a variant’s latent phenotype under a given condition, (2) \(g\), a global epistasis model shared by all conditions to disentangle the effects of multiple mutations on the same variant, and (3) \(t\), an optional final output activation function which can account for variant functional scores which have been clipped at some lower bound prior to using multidms.

Concretely, the predicted phenotype for a given variant \(v\) under condition \(d\) is given by

\[\hat{y}_{v, d} = t_{\gamma_{d}}(g_{\theta}(\phi_{d, \beta, \Delta}(v))\]

Where \(\gamma\), \(\theta\), \(\beta\), and \(\Delta\) are free parameters inferred from experimental observations during the fitting process. We describe the individual components and their associated parameters in more detail below.

The motivation behind defining an abstract model in terms of its components are (1) modularity for method testing and development, and (2) the ability to provide multiple options for model components before jit-compilation and fitting of the specified model architecture. While there is only a single option for the latent prediction, \(\phi\), we offer a few options post-latent modeling, \(g\) and \(t\), that encompass the needs of differing research goals and experimental techniques. Generally speaking, the package defaults for these components should be sufficient for most purposes, and in this case feel free to ignore the multidms.biophysical module all-together as this functionality is hidden unless explicitly specified during the instantiation of a Model object.

For more about how multidms model components are defined and composed, as well as how the models work with the JAX infrastructure to fit parameters, we suggest you take a look at the multidms package biophysical docs

Latent phenotype

For each mutation \(m\), the model defines a single mutation effect parameter, \(\beta_{m}\) shared by all conditions. Additionally for each non-reference experiment \(d\) (the letter \(d\) is used as a mnemonic for DMS), the model defines the set of mutational shift parameters, \(\Delta_{d,m}\), that quantifies the shift a given mutation’s effect relative to some reference condition. Each mutation, for each non-reference condition, is associated with an independent shift parameter. For example, consider three total experimental conditions, \(d \in \{1, 2, 3\}\), where \(d = 1\) is the reference condition. The model then defines two sets of (non-reference) shift parameters \(\Delta_{2, m}\), \(\Delta_{3, m}\) that may be fit to non-zero values.

Concretely, the latent phenotype of any variant, \(v\), from the experimental condition, \(d\), is computed like so:

\[\phi_d(v) = \beta_0 + \alpha_d + \sum_{m \in v} (\beta_{m} + \Delta_{d, m})\]

Where:

  • \(\beta_0\) is a wildtype offset parameter applied to the latent prediction of all experiments.

  • \(\alpha_d\) is a set of bias parameters, optionally applied to the latent phenotype of all non-reference experiments. These parameters may capture the effect of the bundle of non-identical mutations which may exist between experiments.

  • \(\beta_{m}\) is the latent phenotypic effect of mutation \(m\) shared by all experiments (See the note below),

  • \(\Delta_{d, m}\) is the shift of the effect of mutation \(m\) in condition \(d\).

  • \(v\) is the set of all mutations relative to the reference wild type sequence including all potential non-identical wildtype mutations that separate condition \(d\) from the reference condition.

If a non-reference experiment has a different wildtype sequence from the reference (e.g., the wildtype sequences are homologs), then the multidms.Data object will encode non-reference genotypes relative to the reference wild type. Thus, the summation term includes all mutations at non-identical sites that convert between the two sequences. Likewise, If a mutation occurs at a non-identical site, then the mutation is encoded relative to the reference. For example, consider a protein where site \(30\) is a Y in the non-reference experiment’s wildtype sequence and an A in the reference experiment’s wildtype sequence. If a variant from the non-reference experiment had a Y30G mutation, then the mutation would be defined as A30G in the summation term. This does not assume that Y30G has the same effect as A30G. It merely follows the strategy to define all sequences relative to the reference, which is practical because it ensures that each experiment informs the exact same set of \(\beta_m\) and \(\Delta_{d,m}\) parameters. If a variant from a non-reference experiment had a reversion mutation at a non-identical site (e.g., Y30A), then the mutation would not be included in the summation term for that variant since the variant would have the reference amino-acid identity at that site. In comparison, A30Y would be included in the summation term of all variants from the non-reference experiment that lack a mutation at site 30. If mutations at non-identical sites are not sampled in the DMS libraries (e.g., Y30A is missing from the non-reference experiment and A30Y is missing from the reference experiment), then \(\alpha_d\) can be used to capture the combined effects of these missing mutations. Otherwise, if all such mutations are sampled, \(\alpha_d\) can be locked at zero.

One aspect of this approach which may be important to note, is that none of the experiment wildtype predictions are guerenteed to be zero-centered due to the shift, and offset parameters. To correct for this, multidms offers the ability (in predictive functions) to report both functional scores, as well as the functional score effect values which are reported as the difference between any given variant functional score and the wildtype such that \(\hat{y}_{v, d} = \hat{y}_{v, d} - \hat{y}_{\text{WT}, d}\). The API documentation specifies how to achieve the desired behavior.

Global epistasis

Latent phenotypes as described above give rise to functional scores according to a global-epistasis function. If this function is non-linear, then the model allows mutations to non-additively effect functional scores, helping to account for global epistasis. This type of model is useful for inferring effects of individual mutations in variants with more than one mutation. Below, we decribe the available options to model global-epistasis in the multidms infrastructure.

Note: when analyzing DMS libraries in which variants have a maximum of one mutation, then it will not be possible for the model to learn the shape of global-epistasis, in which case we provide an identity global-epistasis function described below, which assumes no global epistasis, but still allows the user to take advantage of the rest of the multidms approach.

Sigmoidal model (default):

By default, the global-epistasis function here assumes a sigmoidal relationship between a protein’s latent property and its functional score measured in the experiment (e.g., log enrichment score). Using free parameters, the sigmoid can flexibly conform to an optimal shape informed by the data. Note that this function is independent from the experimental condition from which a variant is observed.

Given latent phenotype, \(\phi_d(v) = z\), let

\[g(z) = \frac{\theta_{\text{scale}}}{1 + e^{-z}} + \theta_{\text{bias}}\]

Where \(\theta_{\text{scale}}\) and \(\theta_{\text{bias}}\) are free parameters defining the range and lower bound of the sigmoid, respectively.

Below is an interactive plot showing the effect of the sigmoidal global epistasis as a function of an adjustable \(\theta_{\text{scale}}\), and \(\theta_{\text{bias}}\):

[4]:
import altair as alt

import numpy

import pandas as pd


df = pd.DataFrame({"latent": numpy.linspace(-5, 5, 50)})

slider_s = alt.binding_range(min=0.1, max=10)
var_s = alt.param(bind=slider_s, value=5, name="theta_scale")

slider_b = alt.binding_range(min=-10, max=5)
var_b = alt.param(bind=slider_b, value=0, name="theta_bias")

(
    alt.Chart(df)
    .transform_calculate(
        phenotype=(1 / (1 + alt.expr.exp(-1*alt.datum['latent'])))
        * var_s
        + var_b
    )
    .encode(
        x=alt.X("latent", title="latent phenotype", scale=alt.Scale(domain=[-5, 5])),
        y=alt.Y("phenotype:Q", title="predicted phenotype", scale=alt.Scale(domain=[-5, 5]))
    )
    .mark_line()
    .add_params(var_s, var_b)
)

[4]:

Softplus model

This function is a log-transformed version of a sigmoid function from above, with \(\theta_\text{scale}\) and \(\theta_\text{bias}\) parameters serving similar roles. The shape of this function mimics the Hill equation from the above example if the y-axis is instead the log of the fraction of protein molecules that are folded. Such a function could be more appropriate for modeling functional scores in log space. You may note that this function has no natural lower bound, and thus it is recommended to use this model with a user defined lower bound which is described in the Optional truncation of predicted functional scores section below.

Given latent phenotype, \(\phi_d(v) = z\), let

\[g(z) = -\theta_\text{scale}\log\left(1+e^{-z}\right) + \theta_\text{bias}\]

Note In the multidms API, This behavior is accomplished by setting the output_activation parameter in the constructor for Model to be a pointer to the function multidms.biophysical.softplus_activation

Below is an interactive plot showing the effect of the modified softplus function as a function of an adjustable \(\theta_\text{scale}\) scaling parameter, and lower bound, \(\theta_\text{bias}\):

[1]:
import altair as alt

import numpy

import pandas as pd


df = pd.DataFrame({"latent": numpy.linspace(-5, 5, 50)})

slider_lsp = alt.binding_range(min=1, max=5)
alpha_scale = alt.param(bind=slider_lsp, value=1, name="lambda_sp")

slider_lb = alt.binding_range(min=0, max=5)
alpha_bias = alt.param(bind=slider_lb, value=2, name="lower_bound")

(
    alt.Chart(df)
    .transform_calculate(
        phenotype=alt.expr.log(1 + alt.expr.exp(-1*alt.datum['latent']))
        * (-1 * alpha_scale)
        + alpha_bias
    )
    .encode(
        x=alt.X("latent", title="global epistasis prediction (z')", scale=alt.Scale(domain=[-5, 5])),
        y=alt.Y("phenotype:Q", title="predicted phenotype", scale=alt.Scale(domain=[-5, 5]))
    )
    .mark_line()
    .add_params(alpha_scale, alpha_bias)
)
[1]:

Single-layer neural network model:

If you prefer a less constrained shape for global epistasis, we also offer the ability to learn the shape of global epistasis using a single-layer neural network, sometimes referred to as a multi-layer perceptron. This is similar to what was presented by Zhou et. al. 2022

For this option, the user defines a number of units in the singular hidden layer of the model. For each hidden unit, we introduce three parameters (two weights and a bias) to be inferred. All weights are clipped at zero to maintain assumptions of monotonicity in the resulting epistasis function shape. The network applies a sigmoid activation to each internal unit before a final transformation and addition of a constant gives us our predicted functional score.

Given latent phenotype, \(\phi_d(v) = z\), let

\[g(z) = b^{o}+ \sum_{i}^{n} \frac{w^{o}_{i}}{1 + e^{w^{l}_{i}*z + b^{l}_{i}}}\]

Where:

  • \(n\) is the number of units in the hidden layer.

  • \(w^{l}_{i}\) and \(w^{o}_{i}\) are free parameters representing latent and output tranformations, respectively, associated with unit \(i\) in the hidden layer of the network.

  • \(b^{l}_{i}\) is a free parameter, as an added bias term to unit \(i\).

  • \(b^{o}\) is a constant, singular free parameter.

Note This is an advanced feature and we advise against its use unless the other options are not sufficiently parameterized for particularly complex experimental conditions.

Note In the multidms API, This behavior is accomplished by setting the output_activation parameter in the constructor for Model to be a pointer to the function multidms.biophysical.single_layer_nn.

Identity (no epistasis):

Given latent phenotype, \(\phi_d(v) = z\), let

\[g(z) = z\]

In this functional form, there is no global epistasis: latent phenotypes are identical to functional scores. We recommend this functional form if none of the variants in the DMS experiment have more than one mutation, since multi-mutant variants are needed for the model to accurately infer a non-linear global-epistasis function. It can also be used as a baseline to determine if a non-linear global-epistasis function leads to better model fit.

Note In the multidms API, This is accomplished by setting the output_activation parameter in the constructor for Model to be a pointer to the function definition multidms.biophysical.identity_activation.

Optional normalization of observed functional scores

A common way to compute functional scores is with log enrichment ratios, where all scores from a given condition are normalized so that the wildtype sequence from that condition has a value of zero. If wildtype sequences differ between conditions, then log enrichment ratios may not be directly comparable between conditions, as they are normalized to different reference points. This breaks the assumption of our joint-modeling scheme: that all functional scores are directly comparable.

Ideally one or more of the same sequences would be included in the DMS library of each condition, so that all scores could be normalized to the same sequence. However, if there are no common sequences, then it may be possible to computationally estimate how to normalize scores so that they are more directly comparable.

To this end, the model includes an additional parameter \(\gamma_d\) for each non-reference condition that allows functional scores from that condition to be normalized as follows:

\[y_{v,d}^{\text{norm}} = y_{v,d} + \gamma_d\]

where \(\gamma_d\) for the reference condition is locked at zero. There is a theoretical basis for adding \(\gamma_d\) to \(y_{v,d}\) if functional scores are log enrichment ratios. As mentioned above, log enrichment ratios are normalized so that the wildtype sequence from a given experiment has a value of zero, according to the formula:

\[y_{v,d} = \log(E_{v,d}) - \log(E_{\text{wt},d})\]

Thus, adding \(\gamma_d\) to \(y_{v,d}\) is akin to renormalizing the log enrichment ratios so that a different sequence has a functional score of zero. In theory, for each non-reference condition, there is a \(\gamma_d\) that normalizes functional scores to be relative to the wildtype sequence of the reference condition. If these values are not experimentally measured, the model allows \(\gamma_d\) parameters to be fit during optimization, which assumes that the correct \(\gamma_d\) values will give the best model fit. Alternatively, \(\gamma_d\) values can be locked at zero if the user does not wish to implement this optional feature.

Optional truncation of predicted functional scores

By default, the output of the model is equal to the output of the global epistasis model chosen \(\hat{y}_d(v) = g(\phi_d(v))\). However, we provide infrastructure to clip the predicted functional scores at some lower bound for users who may clip their input data. Here, we’ll describe the motivation and approach for this.

The commonly reported fold-change metric for functional scores described above falls victim to a limit of detection problem for deleterious mutations: once a mutation is sufficiently deleterious, further differences (eg, between -3 and -5) become largely meaningless because it is already basically nonfunctional. Thus, it’s common for researchers to simply truncate (i.e. clip) the functional at some lower bound \(l\), where observations below are assumed largely to be outlier.

When fitting the model to these data, it may be desirable to truncate predicted functional scores in a similar way. This is especially relevant when allowing \(\gamma_d\) values to be non-zero. For instance, say the user has truncated observed functional scores at a lower bound of \(-3\) across all conditions. If \(\gamma_d\) is fit to \(-1.0\) for one of the non-reference conditions, then the new floor of normalized functional scores for that condition would be \(-4\), while the floor for the reference condition would still be \(-3\). In this case, the global-epistasis function could find itself in a pickle: if it allowed predictions to go below \(-3\), it could help model the floor of points in the non-reference condition, but hurt with modeling the floor of points in the reference condition. However, if it was able to truncate predicted scores for a given condition at the floor for that condition, then this tension would be relieved: predicted scores for the reference condition could be truncated at \(-3\), while (non-truncated) predicted scores for the non-reference condition could go as low as \(-4\).

To this end, the software package provides an option to truncate predicted functional scores which should be used if (and only if) the user has clipped the functional scores in their data to some lower bound. To enable this, the mathematical model passes predicted functional scores through an activation function, \(t\).

In essence, this is a modified softplus activation, (\(\text{softplus}(x)=\log(1 + e^{x})\)) with a lower bound at \(l + \gamma_{h}\), as well as a ramping coefficient, \(\lambda_{\text{sp}}\).

Concretely, if we let \(z' = g(\phi_d(v))\), then the predicted functional score of our model is given by:

\[t(z') = \lambda_{sp}\log(1 + e^{\frac{z' - l}{\lambda_{sp}}}) + l\]

Functionally speaking, this truncates scores below a lower bound, while leaving scores above (mostly) unaltered. There is a small range of input values where the function smoothly transitions between a flat regime (where data is truncated) and a linear regime (where data is not truncated).

Note We recommend leaving the \(\lambda_{sp}\) parameter at It’s default value of \(0.1\). this ensures a sharp transition between regimes similar to a ReLU function, but retain the differentiable property for gradient based optimization. However, the option is there in case you find the model will not converge.

Below is an interactive plot showing the effect of the modified softplus activation as a function of an adjustable \(\lambda_{sp}\) scaling parameter, and lower bound, \(l\):

[2]:
# Below is an interactive plot showing the effect of the modified
# softplus activation as a function of an adjustable $\lambda_{sp}$ scaling parameter, and lower bound, $l$:

import altair as alt

import numpy

import pandas as pd


df = pd.DataFrame({"latent": numpy.linspace(-10, 5, 100)})

slider_lsp = alt.binding_range(min=0.1, max=2)
var_lambda_sp = alt.param(bind=slider_lsp, value=1, name="lambda_sp")

slider_lb = alt.binding_range(min=-10, max=0)
var_lower_bound = alt.param(bind=slider_lb, value=-3.5, name="lower_bound")

(
    alt.Chart(df)
    .transform_calculate(
        phenotype=alt.expr.log(1 + alt.expr.exp((alt.datum['latent']-var_lower_bound)/var_lambda_sp))
        * var_lambda_sp
        + var_lower_bound
    )
    .encode(
        x=alt.X("latent", title="global epistasis prediction (z')", scale=alt.Scale(domain=[-10, 5])),
        y=alt.Y("phenotype:Q", title="predicted phenotype", scale=alt.Scale(domain=[-10, 5]))
    )
    .mark_line()
    .add_params(var_lambda_sp, var_lower_bound)
)
[2]:

Fitting procedure

The multidms software package has a framework for fitting free parameters in the model given input DMS data. The mathematical model is coded in Python using the JAX library [3] allowing for autograd and XLA compilation for high-performance optimization. We approach our non-smooth optimization problem via proximal gradient descent in order to satisfy several simultaneous constraints such as L1 regularization, optionally locking parameters, and clipping parameters at pre-specified lower bounds. Specifically, we use JAXopt to optimize parameters via full-batch proximal gradient descent using the JAXopt.ProximalGradient function.

We now define notation and introduce our model more formally to describe the fitting procedure without ambiguity. Let \(M\in\mathbb{N}\) denote the number of distinct mutations, and represent a given variant \(v\subset\mathcal{M}\equiv\{1,\dots,M\}\) as an index set of the mutations it contains (\(v\) is in the set of subsets of the \(M\) mutations, i.e. \(v\in\mathcal{V}\equiv2^\mathcal{M}\), where \(2^\mathcal{M}\) denotes the power set of \(\mathcal{M}\)).

Consider a variant \(v\) as an indicator (one-hot) vector \(x_v\in\{0,1\}^M\) where \(\left[x_v\right]_i=1\) if \(i\in v\) and \(\left[x_v\right]_i=0\) otherwise. Let \(D\in\mathbb{N}\) be the number of experiments and write \(\mathbf{1}\) for the \(D\)-vector of ones (the letter \(D\) is used as a mnemonic for DMS). We introduce an additive latent phenotype model jointly for \(D\) experiments via a family of affine maps \(\phi_{(\beta_0,\beta,\Delta)}:\{0, 1\}^M\to\mathbb{R}^D\) defined by

\[\phi_{(\beta_0,\beta,\Delta)}(x) = \beta_0\mathbf{1} + (\mathbf{1}\beta^\intercal + \Delta) x, \quad x\in\{0, 1\}^M,\]

where the family is parameterized by intercept \(\beta_0\in\mathbb{R}\) and mutational effects \(\beta\in\mathbb{R}^M\) that are shared by all \(D\) output dimensions, and shift matrix \(\Delta\in\mathbb{R}^{D\times M}\). We require that the first row of \(\Delta\) is the zero \(M\)-vector, so that the reference experiment (indexed 1 WLOG) has no shifts, and \(\beta\) is then interpreted as the vector of mutational effects in the reference experiment, with the intercept \(\beta_0\) representing the latent phenotype of the wildtype sequence in the reference experiment.

Next, we introduce a global-epistasis function via a family of strictly monotone maps \(g_\theta:\mathbb{R}\to\mathbb{R}\) that we use to take latent phenotypes to predicted functional scores. This family is parameterized by \(\theta\in\mathbb{R}^r\) for some \(r\in\mathbb{N}\). By default, multidms will use the sigmoid function

\[g_\theta(z) = \theta_0 + \frac{\theta_1}{1+e^{-z}}, \quad z\in\mathbb{R},\]

with \(r=2\) parameters, which allows us to adapt the output range of the global-epistasis function (the interval \((\theta_0, \theta_0 + \theta_1)\)) to the range of our functional score data, but is otherwise a fixed link function (imposing a gauge on our latent phenotype model parameters). We finally compute the predicted functional score in experiment \(d\in\{1,\dots,D\}\) of a variant \(v\in\mathcal{V}\) with one-hot encoding \(x_v\in\{0, 1\}^M\) as

\[\hat{y}_d(x_v) = g_\theta\left(\left[\phi_{(\beta_0,\beta,\Delta)}(x_v)\right]_d\right).\]

Our data consists of sets of one-hot encoded variants and their associated functional scores from each of \(D\) experiments. Denote these as \(\mathcal{D}_d \subset \{0, 1\}^M\times\mathbb{R}\) for \(d=1,\dots,D\). We minimize an objective of the form:

\[f(\beta_0,\beta,\Delta,\theta) = \sum_{d=1}^D\sum_{(x, y)\in\mathcal{D}_d}\!\!\ell(y,\hat{y}_d(x)) + \lambda\|\Delta\|_{1,1},\]

where \(\ell:\mathbb{R}\times\mathbb{R}\to\mathbb{R}\) is a Huber loss function measuring the difference between a predicted and an observed functional score, \(\lambda\in\mathbb{R}\) is the lasso penalty weight, and \(\|\cdot\|_{1,1}\) denotes the entrywise \(L_1\) norm (not to be confused with the matrix 1-norm \(\|\cdot\|_1\)). Note that the parameters \((\beta_0,\beta,\Delta,\theta)\) appear in the loss function via the predicted functional score equation, but are suppressed in the objective for notational compactness. Note also that, by taking \(\lambda=0\), the loss term of the objective becomes separable over the \(D\) experiments, so marginal inference is recovered as a special case.

For a general global epistasis function \(g_\theta\), the objective is in general non-convex. However, with the simple sigmoid function, it is bi-convex in \((\beta_0, \beta, \Delta)\) and \(\theta\). This can be seen by noting that, for fixed \(\theta\), the prediction model takes the form of a generalized linear model with a sigmoid link function, and for fixed \((\beta_0, \beta, \Delta)\), the model parameterized by \(\theta\) is a linear regression problem. We optimize the objective using the Nesterov-accelerated proximal gradient method with backtracking line search [4].

References