Normalize

A set of useful functions for normalizing enrichments in a phippery dataset.

phippery.normalize.counts_per_million(ds, inplace=True, new_table_name='cpm', per_sample=True, data_table='counts')[source]

Compute counts per million.

Concretely, given a Matrix of enrichments, \(X\), in the phippery dataset with shape P peptides and S samples, we compute the \(i^{th}\) peptide and \(j^{th}\) sample position like so:

\[\begin{split}\text{cpm}(X,i,j) = \left\{ \begin{array}{@{}ll@{}} X_{i,j}/\sum_{p\in P} X_{p,j} \times 10^6, & \text{if per_sample is True} \\[10pt] X_{i,j}/\sum_{p\in P}\sum_{s\in S} X_{p,s} \times 10^6, & \text{if per_sample is False} \end{array} \right.\end{split}\]

Parameters:

ds (xarray.DataSet) – The dataset you would like to fit to
per_sample (bool) – If True, compute counts per million separately for each sample. Otherwise, frequencies are computed as a ratio of each count to the sum of all counts in the dataset.
data_table (str) – The name of the enrichment layer you would like to fit mlxp to.
new_table_name (str) – The name of the new layer you would like to append to the dataset.
inplace (bool) – If True, then this function appends a dataArray to ds which is indexed with the same coordinate dimensions as ‘data_table’. If False, a copy of ds is returned with the appended dataArray

Returns:

If inplace is False, return a new DataSet object which has the enrichment values appended

Return type:

xarray.DataSet

Example

>>> ds.counts.values
array([[8, 8, 3, 1],
       [1, 3, 7, 4],
       [9, 0, 5, 1],
       [5, 2, 4, 4],
       [1, 4, 1, 3],
       [3, 4, 4, 0],
       [4, 8, 2, 7],
       [4, 3, 6, 5],
       [0, 2, 9, 1],
       [0, 5, 6, 5]])
>>> from phippery.normalize import counts_per_million
>>> counts_per_million(ds, new_table_name="cpm", inplace=True)
>>> ds.cpm.values
array([[228571.43, 205128.21,  63829.79,  32258.06],
       [ 28571.43,  76923.08, 148936.17, 129032.26],
       [257142.86,      0.  , 106382.98,  32258.06],
       [142857.14,  51282.05,  85106.38, 129032.26],
       [ 28571.43, 102564.1 ,  21276.6 ,  96774.19],
       [ 85714.29, 102564.1 ,  85106.38,      0.  ],
       [114285.71, 205128.21,  42553.19, 225806.45],
       [114285.71,  76923.08, 127659.57, 161290.32],
       [     0.  ,  51282.05, 191489.36,  32258.06],
       [     0.  , 128205.13, 127659.57, 161290.32]])

phippery.normalize.differential_selection_sample_groups(ds, sample_feature='library_batch', is_equal_to='batch_a', data_table='counts', aggregate_function=<function mean>, inplace=True, new_table_name='sample_group_differential_selection')[source]

This function computes differential selection between groups of samples.

Note

This function is still experimental.

Parameters:

ds (xarray.DataSet) – The dataset you would like to fit to
data_table (str) – The name of the enrichment layer you would like to fit mlxp to.
new_table_name (str) – The name of the new layer you would like to append to the dataset.
inplace (bool) – If True, then this function appends a dataArray to ds which is indexed with the same coordinate dimensions as ‘data_table’. If False, a copy of ds is returned with the appended dataArray

Returns:

If inplace is False, return a new DataSet object which has the enrichment values appended

Return type:

xarray.DataSet

phippery.normalize.differential_selection_wt_mut(ds, data_table='enrichment', scaled_by_wt=False, smoothing_flank_size=0, groupby=['Protein'], loc_column='Loc', is_wt_column='is_wt', inplace=True, new_table_name='wt_mutant_differential_selection', relu_bias=None, skip_samples={})[source]

A generalized function to compute differential selection of amino acid variants in relation to the wildtype sequence. The function computed log fold change between enrichments of a wildtype and mutation at any given site.

Concretely, given some site, \(s\) (defined by loc_column feature in the peptide table) in the enrichment matrix, \(X\), the differential selection of any mutation with enrichment, \(m\), at a site with wildtype enrichment, \(wt\), is \(\log_{2}(m/wt)\).

Note

This function is meant to be used specifically with phage-dms data where the peptide table includes a “loc” column, and an “aa_sub_col” which specifies the amino acid at that location.

Note

This calculation of differential selection is derived from the Bloom Lab’s formulated from https://jbloomlab.github.io/dms_tools2/diffsel.html#id5

Parameters:

ds (xarray.DataSet) – The dataset you would like to fit to
data_table (str) – The name of the enrichment layer you would like to fit mlxp to.
scaled_by_wt (bool) – A boolean flag indicating whether or not you would like to multiply the differential selection value by the wildtype enrichment
smoothing_flank_size (int) – This parameter should only be used if scaled_by_wt is true. By specifying an integer greater than 0, you are scaling the differential selection value by enrichment values of the wildtype peptides surrounding , in both directions, a given site. The integer specified here then determines how many peptides are used for the scaling in both directions.
groupby (list[str]) – This will specify which peptide feature groups such that site-mutation combinations are unique.
loc_column (str) – The peptide table feature which specifies the site that a particular peptide is centered at.
is_wt_column (str) – The column specifying which peptides are wildtype.
relu_bias (int) – If an integer is specified, then enrichment values less than 1 are replaced by the specified value before computing differential selection.
skip_samples (set) – sample id’s which you do not want to calculate the differential selection on, such as controls. This function has many nested loops, so avoid computing on unnecessary samples.
new_table_name (str) – The name of the new layer you would like to append to the dataset.
inplace (bool) – If True, then this function appends a dataArray to ds which is indexed with the same coordinate dimensions as ‘data_table’. If False, a copy of ds is returned with the appended dataArray

Returns:

If inplace is False, return a new DataSet object which has the enrichment values appended

Return type:

xarray.DataSet

phippery.normalize.enrichment(ds, lib_ds, data_table='counts', inplace=True, new_table_name='enrichment')[source]

This function computes fold enrichment in the same fashion as the standardized_enrichment, but does not subtract beads only controls

Parameters:

ds (xarray.DataSet) – The dataset you would like to fit to
lib_ds (xarray.DataSet) – The dataset of phage library control samples used in normalization
data_table (str) – The name of the enrichment layer you would like to fit mlxp to.
new_table_name (str) – The name of the new layer you would like to append to the dataset.
inplace (bool) – If True, then this function appends a dataArray to ds which is indexed with the same coordinate dimensions as ‘data_table’. If False, a copy of ds is returned with the appended dataArray

Returns:

If inplace is False, return a new DataSet object which has the enrichment values appended

Return type:

xarray.DataSet

phippery.normalize.rank_data(ds, data_table='counts', inplace=True, per_sample=False, new_table_name='rank')[source]

Compute the rank of specified enrichment layer. The rank is decending

Parameters:

ds (xarray.DataSet) – The dataset you would like to fit to
per_sample (bool) – If True, compute rank separately for each sample. Otherwise, frequencies are computed as a ratio of each count to the sum of all counts in the dataset.
data_table (str) – The name of the enrichment layer you would like to fit mlxp to.
new_table_name (str) – The name of the new layer you would like to append to the dataset.
inplace (bool) – If True, then this function appends a dataArray to ds which is indexed with the same coordinate dimensions as ‘data_table’. If False, a copy of ds is returned with the appended dataArray

Returns:

If inplace is False, return a new DataSet object which has the enrichment values appended

Return type:

xarray.DataSet

Example

>>> ds["counts"].values
array([[533, 734, 399, 588],
       [563, 947, 814, 156],
       [ 47, 705, 750, 685],
       [675, 118, 897, 290],
       [405, 880, 772, 570],
       [629, 961, 530,  63],
       [633, 931, 268, 115],
       [833, 290, 164, 184],
       [ 18, 704, 359,  33],
       [143, 486, 371, 415]])
>>> rank_data(ds, data_table="counts", per_sample=True)
>>> ds["rank"].values
array([[4, 5, 4, 8],
       [5, 8, 8, 3],
       [1, 4, 6, 9],
       [8, 0, 9, 5],
       [3, 6, 7, 7],
       [6, 9, 5, 1],
       [7, 7, 1, 2],
       [9, 1, 0, 4],
       [0, 3, 2, 0],
       [2, 2, 3, 6]])

phippery.normalize.size_factors(ds, inplace=True, data_table='counts', new_table_name='size_factors')[source]

Warning

This method is deprecated. It is currently maintained only for reproducibility of previous results.

Compute size factors from Anders and Huber 2010.

Concretely, given a Matrix of enrichments, \(X_{i,j}\), in the phippery dataset with shape (peptides, samples). Consider a pseudo-reference sample where the count of each peptide is the geometric mean of counts for that peptide across the samples in the dataset. For a sample to be normalized, calculate for each peptide the ratio of peptide count in the sample to the peptide count in the pseudo-reference sample. The median of this ratio, \(\hat S_{j}\), is the size factor to normalize all counts in sample \(j\),

\[\hat S_{j} = {median \atop i} \frac{X_{i,j}}{(\prod_{v=1}^{m}{X_{i,v}})^{1/m}}\]

Parameters:

ds (xarray.DataSet) – The dataset you would like to fit to
data_table (str) – The name of the enrichment layer you would like to fit mlxp to.
new_table_name (str) – The name of the new layer you would like to append to the dataset.
inplace (bool) – If True, then this function appends a dataArray to ds which is indexed with the same coordinate dimensions as ‘data_table’. If False, a copy of ds is returned with the appended dataArray

Returns:

If inplace is False, return a new DataSet object which has the enrichment values appended

Return type:

xarray.DataSet

phippery.normalize.standardized_enrichment(ds, lib_ds, beads_ds, data_table='counts', inplace=True, new_table_name='std_enrichment')[source]

Compute standardized enrichment of sample counts. This is the fold enrichment of each sample’s frequency compared to the library average frequency, minus the mock IP (beads only control) frequency.

Note

Psuedo counts and exact calculation are derived from the Bloom Lab’s formulated normalization heuristic for differential selection. See https://jbloomlab.github.io/dms_tools2/diffsel.html#id5

Parameters:

ds (xarray.DataSet) – The dataset you would like to fit to
lib_ds (xarray.DataSet) – The dataset of phage library control samples used in normalization
beads_ds (xarray.DataSet) – The dataset of beads only control samples used in normalization
data_table (str) – The name of the enrichment layer you would like to fit mlxp to.
new_table_name (str) – The name of the new layer you would like to append to the dataset.
inplace (bool) – If True, then this function appends a dataArray to ds which is indexed with the same coordinate dimensions as ‘data_table’. If False, a copy of ds is returned with the appended dataArray

Returns:

If inplace is False, return a new DataSet object which has the enrichment values appended

Return type:

xarray.DataSet

Example

Given a dataset, ds, with the following samples and counts across 10 peptides:

>>> phippery.get_annotation_table(ds, "sample")
sample_metadata  fastq_filename reference seq_dir sample_type
sample_id
0                sample_0.fastq      refa    expa  beads_only
1                sample_1.fastq      refa    expa  beads_only
2                sample_2.fastq      refa    expa     library
3                sample_3.fastq      refa    expa     library
4                sample_4.fastq      refa    expa          IP
5                sample_5.fastq      refa    expa          IP
6                sample_6.fastq      refa    expa          IP
7                sample_7.fastq      refa    expa          IP
8                sample_8.fastq      refa    expa          IP
9                sample_9.fastq      refa    expa          IP
>>> ds.counts
<xarray.DataArray 'counts' (peptide_id: 5, sample_id: 10)>
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
Coordinates:
  * sample_id   (sample_id) int64 0 1 2 3 4 5 6 7 8 9
  * peptide_id  (peptide_id) int64 0 1 2 3 4

First, separate the dataset into it’s various samples types using phippery.utils.ds_query()

>>> ip_ds = ds_query(ds, "sample_type == 'IP'")
>>> lib_ds = ds_query(ds, "sample_type == 'library'")
>>> mockip_ds = ds_query(ds, "sample_type == 'beads_only'")

We can then compute standardized fold enrichment like so:

>>> phippery.normalize.standardized_enrichment(ip_ds, lib_ds, mockip_ds)

which will modify the ip_ds dataset inplace to include a new table

>>> ip_ds.std_enrichment
<xarray.DataArray 'std_enrichment' (peptide_id: 5, sample_id: 6)>
array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])
Coordinates:
  * sample_id   (sample_id) int64 4 5 6 7 8 9
  * peptide_id  (peptide_id) int64 0 1 2 3 4

Note that we expect the result to be all zeros because a 1-to-1 fold enrichment for ip’s to library samples minus 1-to-1 beads to library

phippery.normalize.svd_aa_loc(ds, rank=1, data_table='enrichment', protein_name_column='Protein', location_col='Loc', aa_sub_col='aa_sub', inplace=True, new_table_name='svd_rr')[source]

Compute singular value decomposition rank reduction on the aa / loc matrix by pivoting before computing decomposition and re-shaping to add to the dataset.

Note

This function is meant to be used specifically with phage-dms data where the peptide table includes a “loc” column, and an “aa_sub_col” which specifies the amino acid at that location.

Parameters:

ds (xarray.DataSet) – The dataset you would like to fit to
data_table (str) – The name of the enrichment layer you would like to fit mlxp to.
protein_name_column (str) – The peptide table feature which specifies which protein a specific peptide derives from.
location_col (str) – The peptide table feature which specifies the site that a particular peptide is centered at.
aa_sub_col (str) – The peptide table feature which specifies the amino acid at a given site.

new_table_namestr: The name of the new layer you would like to append to the dataset.
inplacebool: If True, then this function appends a dataArray to ds which is indexed with the same coordinate dimensions as ‘data_table’. If False, a copy of ds is returned with the appended dataArray

Returns:: If inplace is False, return a new DataSet object which has the enrichment values appended
Return type:: xarray.DataSet

phippery.normalize.svd_rank_reduction(ds, rank=1, data_table='enrichment', inplace=True, new_table_name='svd_rr')[source]

This function computes the singular value decomposition, then recomputes the enrichment matrix up to the rank specified.

Concretely, given a Matrix of, \(X\) enrichments in the phippery dataset with shape (peptides, samples). We compute the decomposition \(X = USV^{T}\), then return the recomposition using the first rank eigenvectors and eigenvaluesvalues.

Parameters:

ds (xarray.DataSet) – The dataset you would like to fit to
rank (int) – The number of ranks to include in the reconstruction.
data_table (str) – The name of the enrichment layer you would like to fit mlxp to.
new_table_name (str) – The name of the new layer you would like to append to the dataset.
inplace (bool) – If True, then this function appends a dataArray to ds which is indexed with the same coordinate dimensions as ‘data_table’. If False, a copy of ds is returned with the appended dataArray

Returns:

If inplace is False, return a new DataSet object which has the enrichment values appended

Return type:

xarray.DataSet