Normalize
A set of useful functions for normalizing enrichments in a phippery dataset.
- phippery.normalize.counts_per_million(ds, inplace=True, new_table_name='cpm', per_sample=True, data_table='counts')[source]
Compute counts per million.
Concretely, given a Matrix of enrichments, \(X\), in the phippery dataset with shape P peptides and S samples, we compute the \(i^{th}\) peptide and \(j^{th}\) sample position like so:
\[\begin{split}\text{cpm}(X,i,j) = \left\{ \begin{array}{@{}ll@{}} X_{i,j}/\sum_{p\in P} X_{p,j} \times 10^6, & \text{if per_sample is True} \\[10pt] X_{i,j}/\sum_{p\in P}\sum_{s\in S} X_{p,s} \times 10^6, & \text{if per_sample is False} \end{array} \right.\end{split}\]- Parameters:
ds (xarray.DataSet) – The dataset you would like to fit to
per_sample (bool) – If True, compute counts per million separately for each sample. Otherwise, frequencies are computed as a ratio of each count to the sum of all counts in the dataset.
data_table (str) – The name of the enrichment layer you would like to fit mlxp to.
new_table_name (str) – The name of the new layer you would like to append to the dataset.
inplace (bool) – If True, then this function appends a dataArray to ds which is indexed with the same coordinate dimensions as ‘data_table’. If False, a copy of ds is returned with the appended dataArray
- Returns:
If inplace is False, return a new DataSet object which has the enrichment values appended
- Return type:
xarray.DataSet
Example
>>> ds.counts.values array([[8, 8, 3, 1], [1, 3, 7, 4], [9, 0, 5, 1], [5, 2, 4, 4], [1, 4, 1, 3], [3, 4, 4, 0], [4, 8, 2, 7], [4, 3, 6, 5], [0, 2, 9, 1], [0, 5, 6, 5]]) >>> from phippery.normalize import counts_per_million >>> counts_per_million(ds, new_table_name="cpm", inplace=True) >>> ds.cpm.values array([[228571.43, 205128.21, 63829.79, 32258.06], [ 28571.43, 76923.08, 148936.17, 129032.26], [257142.86, 0. , 106382.98, 32258.06], [142857.14, 51282.05, 85106.38, 129032.26], [ 28571.43, 102564.1 , 21276.6 , 96774.19], [ 85714.29, 102564.1 , 85106.38, 0. ], [114285.71, 205128.21, 42553.19, 225806.45], [114285.71, 76923.08, 127659.57, 161290.32], [ 0. , 51282.05, 191489.36, 32258.06], [ 0. , 128205.13, 127659.57, 161290.32]])
- phippery.normalize.differential_selection_sample_groups(ds, sample_feature='library_batch', is_equal_to='batch_a', data_table='counts', aggregate_function=<function mean>, inplace=True, new_table_name='sample_group_differential_selection')[source]
This function computes differential selection between groups of samples.
Note
This function is still experimental.
- Parameters:
ds (xarray.DataSet) – The dataset you would like to fit to
data_table (str) – The name of the enrichment layer you would like to fit mlxp to.
new_table_name (str) – The name of the new layer you would like to append to the dataset.
inplace (bool) – If True, then this function appends a dataArray to ds which is indexed with the same coordinate dimensions as ‘data_table’. If False, a copy of ds is returned with the appended dataArray
- Returns:
If inplace is False, return a new DataSet object which has the enrichment values appended
- Return type:
xarray.DataSet
- phippery.normalize.differential_selection_wt_mut(ds, data_table='enrichment', scaled_by_wt=False, smoothing_flank_size=0, groupby=['Protein'], loc_column='Loc', is_wt_column='is_wt', inplace=True, new_table_name='wt_mutant_differential_selection', relu_bias=None, skip_samples={})[source]
A generalized function to compute differential selection of amino acid variants in relation to the wildtype sequence. The function computed log fold change between enrichments of a wildtype and mutation at any given site.
Concretely, given some site, \(s\) (defined by loc_column feature in the peptide table) in the enrichment matrix, \(X\), the differential selection of any mutation with enrichment, \(m\), at a site with wildtype enrichment, \(wt\), is \(\log_{2}(m/wt)\).
Note
This function is meant to be used specifically with phage-dms data where the peptide table includes a “loc” column, and an “aa_sub_col” which specifies the amino acid at that location.
Note
This calculation of differential selection is derived from the Bloom Lab’s formulated from https://jbloomlab.github.io/dms_tools2/diffsel.html#id5
- Parameters:
ds (xarray.DataSet) – The dataset you would like to fit to
data_table (str) – The name of the enrichment layer you would like to fit mlxp to.
scaled_by_wt (bool) – A boolean flag indicating whether or not you would like to multiply the differential selection value by the wildtype enrichment
smoothing_flank_size (int) – This parameter should only be used if scaled_by_wt is true. By specifying an integer greater than 0, you are scaling the differential selection value by enrichment values of the wildtype peptides surrounding , in both directions, a given site. The integer specified here then determines how many peptides are used for the scaling in both directions.
groupby (list[str]) – This will specify which peptide feature groups such that site-mutation combinations are unique.
loc_column (str) – The peptide table feature which specifies the site that a particular peptide is centered at.
is_wt_column (str) – The column specifying which peptides are wildtype.
relu_bias (int) – If an integer is specified, then enrichment values less than 1 are replaced by the specified value before computing differential selection.
skip_samples (set) – sample id’s which you do not want to calculate the differential selection on, such as controls. This function has many nested loops, so avoid computing on unnecessary samples.
new_table_name (str) – The name of the new layer you would like to append to the dataset.
inplace (bool) – If True, then this function appends a dataArray to ds which is indexed with the same coordinate dimensions as ‘data_table’. If False, a copy of ds is returned with the appended dataArray
- Returns:
If inplace is False, return a new DataSet object which has the enrichment values appended
- Return type:
xarray.DataSet
- phippery.normalize.enrichment(ds, lib_ds, data_table='counts', inplace=True, new_table_name='enrichment')[source]
This function computes fold enrichment in the same fashion as the standardized_enrichment, but does not subtract beads only controls
- Parameters:
ds (xarray.DataSet) – The dataset you would like to fit to
lib_ds (xarray.DataSet) – The dataset of phage library control samples used in normalization
data_table (str) – The name of the enrichment layer you would like to fit mlxp to.
new_table_name (str) – The name of the new layer you would like to append to the dataset.
inplace (bool) – If True, then this function appends a dataArray to ds which is indexed with the same coordinate dimensions as ‘data_table’. If False, a copy of ds is returned with the appended dataArray
- Returns:
If inplace is False, return a new DataSet object which has the enrichment values appended
- Return type:
xarray.DataSet
- phippery.normalize.rank_data(ds, data_table='counts', inplace=True, per_sample=False, new_table_name='rank')[source]
Compute the rank of specified enrichment layer. The rank is descending
- Parameters:
ds (xarray.DataSet) – The dataset you would like to fit to
per_sample (bool) – If True, compute rank separately for each sample. Otherwise, frequencies are computed as a ratio of each count to the sum of all counts in the dataset.
data_table (str) – The name of the enrichment layer you would like to fit mlxp to.
new_table_name (str) – The name of the new layer you would like to append to the dataset.
inplace (bool) – If True, then this function appends a dataArray to ds which is indexed with the same coordinate dimensions as ‘data_table’. If False, a copy of ds is returned with the appended dataArray
- Returns:
If inplace is False, return a new DataSet object which has the enrichment values appended
- Return type:
xarray.DataSet
Example
>>> ds["counts"].values array([[533, 734, 399, 588], [563, 947, 814, 156], [ 47, 705, 750, 685], [675, 118, 897, 290], [405, 880, 772, 570], [629, 961, 530, 63], [633, 931, 268, 115], [833, 290, 164, 184], [ 18, 704, 359, 33], [143, 486, 371, 415]]) >>> rank_data(ds, data_table="counts", per_sample=True) >>> ds["rank"].values array([[4, 5, 4, 8], [5, 8, 8, 3], [1, 4, 6, 9], [8, 0, 9, 5], [3, 6, 7, 7], [6, 9, 5, 1], [7, 7, 1, 2], [9, 1, 0, 4], [0, 3, 2, 0], [2, 2, 3, 6]])
- phippery.normalize.size_factors(ds, inplace=True, data_table='counts', new_table_name='size_factors')[source]
Warning
This method is deprecated. It is currently maintained only for reproducibility of previous results.
Compute size factors from Anders and Huber 2010.
Concretely, given a Matrix of enrichments, \(X_{i,j}\), in the phippery dataset with shape (peptides, samples). Consider a pseudo-reference sample where the count of each peptide is the geometric mean of counts for that peptide across the samples in the dataset. For a sample to be normalized, calculate for each peptide the ratio of peptide count in the sample to the peptide count in the pseudo-reference sample. The median of this ratio, \(\hat S_{j}\), is the size factor to normalize all counts in sample \(j\),
\[\hat S_{j} = {median \atop i} \frac{X_{i,j}}{(\prod_{v=1}^{m}{X_{i,v}})^{1/m}}\]- Parameters:
ds (xarray.DataSet) – The dataset you would like to fit to
data_table (str) – The name of the enrichment layer you would like to fit mlxp to.
new_table_name (str) – The name of the new layer you would like to append to the dataset.
inplace (bool) – If True, then this function appends a dataArray to ds which is indexed with the same coordinate dimensions as ‘data_table’. If False, a copy of ds is returned with the appended dataArray
- Returns:
If inplace is False, return a new DataSet object which has the enrichment values appended
- Return type:
xarray.DataSet
- phippery.normalize.standardized_enrichment(ds, lib_ds, beads_ds, data_table='counts', inplace=True, new_table_name='std_enrichment')[source]
Compute standardized enrichment of sample counts. This is the fold enrichment of each sample’s frequency compared to the library average frequency, minus the mock IP (beads only control) frequency.
Note
Psuedo counts and exact calculation are derived from the Bloom Lab’s formulated normalization heuristic for differential selection. See https://jbloomlab.github.io/dms_tools2/diffsel.html#id5
- Parameters:
ds (xarray.DataSet) – The dataset you would like to fit to
lib_ds (xarray.DataSet) – The dataset of phage library control samples used in normalization
beads_ds (xarray.DataSet) – The dataset of beads only control samples used in normalization
data_table (str) – The name of the enrichment layer you would like to fit mlxp to.
new_table_name (str) – The name of the new layer you would like to append to the dataset.
inplace (bool) – If True, then this function appends a dataArray to ds which is indexed with the same coordinate dimensions as ‘data_table’. If False, a copy of ds is returned with the appended dataArray
- Returns:
If inplace is False, return a new DataSet object which has the enrichment values appended
- Return type:
xarray.DataSet
Example
Given a dataset,
ds
, with the following samples and counts across 10 peptides:>>> phippery.get_annotation_table(ds, "sample") sample_metadata fastq_filename reference seq_dir sample_type sample_id 0 sample_0.fastq refa expa beads_only 1 sample_1.fastq refa expa beads_only 2 sample_2.fastq refa expa library 3 sample_3.fastq refa expa library 4 sample_4.fastq refa expa IP 5 sample_5.fastq refa expa IP 6 sample_6.fastq refa expa IP 7 sample_7.fastq refa expa IP 8 sample_8.fastq refa expa IP 9 sample_9.fastq refa expa IP >>> ds.counts <xarray.DataArray 'counts' (peptide_id: 5, sample_id: 10)> array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]) Coordinates: * sample_id (sample_id) int64 0 1 2 3 4 5 6 7 8 9 * peptide_id (peptide_id) int64 0 1 2 3 4
First, separate the dataset into it’s various samples types using
phippery.utils.ds_query()
>>> ip_ds = ds_query(ds, "sample_type == 'IP'") >>> lib_ds = ds_query(ds, "sample_type == 'library'") >>> mockip_ds = ds_query(ds, "sample_type == 'beads_only'")
We can then compute standardized fold enrichment like so:
>>> phippery.normalize.standardized_enrichment(ip_ds, lib_ds, mockip_ds)
which will modify the
ip_ds
dataset inplace to include a new table>>> ip_ds.std_enrichment <xarray.DataArray 'std_enrichment' (peptide_id: 5, sample_id: 6)> array([[0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0.]]) Coordinates: * sample_id (sample_id) int64 4 5 6 7 8 9 * peptide_id (peptide_id) int64 0 1 2 3 4
Note that we expect the result to be all zeros because a 1-to-1 fold enrichment for ip’s to library samples minus 1-to-1 beads to library
- phippery.normalize.svd_aa_loc(ds, rank=1, data_table='enrichment', protein_name_column='Protein', location_col='Loc', aa_sub_col='aa_sub', inplace=True, new_table_name='svd_rr')[source]
Compute singular value decomposition rank reduction on the aa / loc matrix by pivoting before computing decomposition and re-shaping to add to the dataset.
Note
This function is meant to be used specifically with phage-dms data where the peptide table includes a “loc” column, and an “aa_sub_col” which specifies the amino acid at that location.
- Parameters:
ds (xarray.DataSet) – The dataset you would like to fit to
data_table (str) – The name of the enrichment layer you would like to fit mlxp to.
protein_name_column (str) – The peptide table feature which specifies which protein a specific peptide derives from.
location_col (str) – The peptide table feature which specifies the site that a particular peptide is centered at.
aa_sub_col (str) – The peptide table feature which specifies the amino acid at a given site.
- new_table_namestr
The name of the new layer you would like to append to the dataset.
- inplacebool
If True, then this function appends a dataArray to ds which is indexed with the same coordinate dimensions as ‘data_table’. If False, a copy of ds is returned with the appended dataArray
- Returns:
If inplace is False, return a new DataSet object which has the enrichment values appended
- Return type:
xarray.DataSet
- phippery.normalize.svd_rank_reduction(ds, rank=1, data_table='enrichment', inplace=True, new_table_name='svd_rr')[source]
This function computes the singular value decomposition, then recomputes the enrichment matrix up to the rank specified.
Concretely, given a Matrix of, \(X\) enrichments in the phippery dataset with shape (peptides, samples). We compute the decomposition \(X = USV^{T}\), then return the recomposition using the first
rank
eigenvectors and eigenvalues.- Parameters:
ds (xarray.DataSet) – The dataset you would like to fit to
rank (int) – The number of ranks to include in the reconstruction.
data_table (str) – The name of the enrichment layer you would like to fit mlxp to.
new_table_name (str) – The name of the new layer you would like to append to the dataset.
inplace (bool) – If True, then this function appends a dataArray to ds which is indexed with the same coordinate dimensions as ‘data_table’. If False, a copy of ds is returned with the appended dataArray
- Returns:
If inplace is False, return a new DataSet object which has the enrichment values appended
- Return type:
xarray.DataSet