Utils

Utilities for building, indexing, and manipulating and xarray dataset topology specific to most phippery functions provided in this package

phippery.utils.add_enrichment_layer_from_array(ds, enrichment, new_table_name=None, inplace=True)[source]

Append an enrichment layer to the dataset.

Parameters:

ds (xarray.DataSet) – The phippery dataset to append to.
enrichment (np.array) – The enrichment matrix to append to the phippery dataset. The number of rows should be the same length as ds.peptide_id and the number of columns should be the same length as ds.sample_id
new_table_name (str) – What you would like to name the enrichment layer.
inplace (bool) – Determines whether to modify the passed dataset, or return an augmented copy

Returns:

The augmented phippery dataset copy is returned if inplace is True

Return type:

None | xarray.DataSet

phippery.utils.collapse_groups(ds, by, collapse_dim='sample', agg_func=<function mean>, compute_pw_cc=False, **kwargs)[source]

Collapse an xarray dataset along one of the annotation axis by applying the agg_function to annotation groups of ‘by’.

Parameters:

ds (xarray.DataSet) – The phippery dataset to append to.
by (list) – The name of the annotation feature you would like to collapse.
collapse_dim (str) – The dimension you’s like to collapse. “sample” or “peptide”
compute_pw_cc (bool) – Whether or not to compute the mean pairwise correlation of all values within any feature group that is being collapsed.
agg_func (function) – This function must take a one-dimensional array and aggregate all values to a single number, agg_func(list[float | int]) -> float | int

Returns:

The collapsed phippery dataset.

Return type:

xarray.DataSet

Example

>>> get_annotation_table(ds, dim="sample")
sample_metadata  fastq_filename reference seq_dir sample_type
sample_id
0                sample_0.fastq      refa    expa  beads_only
1                sample_1.fastq      refa    expa  beads_only
2                sample_2.fastq      refa    expa     library
3                sample_3.fastq      refa    expa     library
4                sample_4.fastq      refa    expa          IP
5                sample_5.fastq      refa    expa          IP
>>> ds["counts"].to_pandas()
sample_id   0  1  2  3  4  5
peptide_id
0           7  0  3  2  3  2
1           6  3  1  0  7  5
2           9  1  7  8  4  7
>>> mean_sample_type_ds = collapse_groups(ds, by=["sample_type"])
>>> get_annotation_table(mean_sample_type_ds, dim="sample")
sample_metadata reference seq_dir sample_type
sample_id
0                    refa    expa          IP
1                    refa    expa  beads_only
2                    refa    expa     library
>>> mean_sample_type_ds["counts"].to_pandas()
sample_id     0    1    2
peptide_id
0           2.5  3.5  2.5
1           6.0  4.5  0.5
2           5.5  5.0  7.5

phippery.utils.collapse_peptide_groups(*args, **kwargs)[source]: wrap for peptide collapse

phippery.utils.collapse_sample_groups(*args, **kwargs)[source]: wrap for sample collapse

phippery.utils.collect_counts(counts)[source]

merge individual tsv files for a bunh of samples into a counts matrix.

Parameters:

counts (list[str]) –

A list a filepaths relative to current working directory to read in. The filepaths should point to tab-separated files for each sample which contains two columns (without headers):

peptide ids - the integer peptide identifiers

enrichments - the respective enrichments for any peptide id

Returns:

The merged enrichments with peptides as the index, and filenames as column names.

Return type:

pd.DataFrame

phippery.utils.dataset_from_csv(peptide_table_filename, sample_table_filename, counts_table_filename)[source]

Load a dataset from individual comma separated files containing the counts matrix, as well as sample and peptide annotation tables.

Note

This is the inverse operation of the to_wide_csv() utility function. Generally speaking these functions are used for long term storage in common formats when pickle dumped binaries are not ideal. For now, this function only supports a single enrichment table to be added with the variable name “counts” to the dataset. If you would like to add other transformation of the enrichment table (i.e. cpm, mlxp, etc), you can load the csv’s via pandas and add to the dataset using the add_enrichment_layer_from_array function

Parameters:

counts_table_filename (str) – The glob filepath to csv file(s) containing the enrichments. All files should have the first column be indices which match the given peptide table index column. The first row then should have column headers that match the index of the sample table.
peptide_table_filename (str) – The relative filepath to the peptide annotation table.
sample_table_filename (str) – The relative filepath to the sample annotation table.

Returns:

The combined tables in a phippery dataset.

Return type:

xarray.DataSet

phippery.utils.ds_query(ds, query, dim='sample')[source]

Apply a sample or peptide query statement to the entire dataset.

Note

For more on pandas queries, see the pandas documentation

Parameters:

ds (xarray.DataSet) – The dataset you would like to query.
query (str) – pandas query expression
dim (str) – The dimension to to apply the expression

Returns:

reference to the dataset slice from the given expression.

Return type:

xarray.DataSet

Example

>>> phippery.get_annotation_table(ds, "peptide")
peptide_metadata Oligo   virus
peptide_id
0                 ATCG    zika
1                 ATCG    zika
2                 ATCG    zika
3                 ATCG    zika
4                 ATCG  dengue
5                 ATCG  dengue
6                 ATCG  dengue
7                 ATCG  dengue
>>> zka_ds = ds_query(ds, "virus == 'zika'", dim="peptide")
>>> zka_ds["counts"].to_pandas()
sample_id     0    1    2    3    4    5    6    7    8    9
peptide_id
0           110  829  872  475  716  815  308  647  216  791
1           604  987  776  923  858  985  396  539   32  600
2           865  161  413  760  422  297  639  786  857  878
3           992  354  825  535  440  416  572  988  763  841

phippery.utils.dump(ds, path)[source]

simple wrapper for dumping xarray datasets to pickle binary

Parameters:

ds (xarray.DataSet) – The dataset you would like to write to disk.
path (str) – The relative path you would like to write to.

Return type:

None

phippery.utils.get_annotation_table(ds, dim='sample')[source]

return a copy of the peptide table after converting all the data types applying the pandas NaN heuristic

Parameters:

ds (xarray.DataSet) – The dataset to extract an annotation from.
dim (str) – The annotation table to grab: “sample” or “peptide”.

Returns:

The annotation table.

Return type:

pd.DataFrame

phippery.utils.id_coordinate_from_query_df(ds, query_df)[source]

Given a dataframe with pandas query statements for both samples and peptides, return the relevent sample and peptide id’s after applying the logical AND of all queries.

Parameters:

ds (xarray.DataSet) – The dataset you would like to query.
query_df (pd.DataFrame) –
A dataframe with must have two columns (including headers):
1. ”dimension” - either “sample” or “peptide” to specify expression dimension
2. ”expression” - The pandas query expression to apply.

Returns:

tuple – Return a tuple of sample id’s and peptide id’s

Return type:

list, list

phippery.utils.id_query(ds, query, dim='sample')[source]

Apply a sample or peptide query statement to the entire dataset and retrieve the respective indices.

Note

For more on pandas queries, see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

Parameters:

ds (xarray.DataSet) – The dataset you would like to query.
query (str) – pandas query expression
dim (str) –

Returns:

The list of integer identifiers that apply to a given expression for the respective dimension.

Return type:

list[int]

phippery.utils.iter_groups(ds, by, dim='sample')[source]

This function returns an iterator yeilding subsets of the provided dataset, grouped by items in the metadata of either of the dimensions specified.

Parameters:: ds (xarray.DataSet) – The dataset to iterate over.
Returns:: Returns subsets of the original dataset sliced by either sample or peptide table groups.
Return type:: generator

Example

>>> phippery.get_annotation_table(ds, "sample")
sample_metadata  fastq_filename reference seq_dir sample_type
sample_id
0                sample_0.fastq      refa    expa  beads_only
1                sample_1.fastq      refa    expa  beads_only
2                sample_2.fastq      refa    expa     library
3                sample_3.fastq      refa    expa     library
>>> ds["counts"].values
array([[458, 204, 897, 419],
       [599, 292, 436, 186],
       [ 75,  90, 978, 471],
       [872,  33, 108, 505],
       [206, 107, 981, 208]])
>>> sample_groups = iter_groups(ds, by="sample_type")
>>> for group, phip_dataset in sample_groups:
...     print(group)
...     print(phip_dataset["counts"].values)
...
beads_only
[[458 204]
 [599 292]
 [ 75  90]
 [872  33]
 [206 107]]
library
[[897 419]
 [436 186]
 [978 471]
 [108 505]
 [981 208]]

phippery.utils.load(path)[source]

simple wrapper for loading xarray datasets from pickle binary

Parameters:: path (str) – Relative path of binary phippery dataset
Returns:: phippery dataset
Return type:: xarray.DataSet

phippery.utils.stitch_dataset(counts, peptide_table, sample_table)[source]

Build an phippery xarray dataset from individual tables.

Note

If the counts matrix that you’re passing has the shape (M x N) for M peptides, and N samples, the sample table should have a len of N, and peptide table should have len M

Parameters:

counts (numpy.ndarray) – The counts matrix for sample peptide enrichments.
sample_table (pd.DataFrame) – The sample annotations corresponding to the columns of the counts matrix.
peptide_table (pd.DataFrame) – The peptide annotations corresponding to the rows of the counts matrix.

Returns:

The formatted phippery xarray dataset used by most of the phippery functionality.

Return type:

xarray.DataSet

phippery.utils.to_tall(ds: Dataset)[source]

Melt a phippery xarray dataset into a single long-formatted dataframe that has a unique sample peptide interaction on each row. Ideal for ggplotting.

Parameters:: ds (xarray.DataSet) – The dataset to extract an annotation from.
Returns:: The tall formatted dataset.
Return type:: pd.DataFrame

Example

>>> ds["counts"].to_pandas()
sample_id     0    1
peptide_id
0           453  393
1           456  532
2           609  145
>>> to_tall(ds)[["sample_id", "peptide_id", "counts"]]
  sample_id  peptide_id  counts
0         0           0     453
1         0           1     456
2         0           2     609
3         1           0     393
4         1           1     532
5         1           2     145

phippery.utils.to_wide(ds)[source]

Take a phippery dataset and split it into its separate components in a dictionary.

Parameters:: ds (xarray.DataSet) – The dataset to separate.
Returns:: The dictionary of annotation tables and enrichment matrices.
Return type:: dict

phippery.utils.to_wide_csv(ds, file_prefix)[source]

Take a phippery dataset and split it into its separate components at writes each into a comma separated file.

Note

This is the inverse operation of the dataset_from_csv() utility function. Generally speaking these functions are used for long term storage in common formats when pickle dumped binaries are not ideal.

Parameters:

ds (xarray.DataSet) – The dataset to extract an annotation from.
file_prefix (str) – The fileprefix relative to the current working directory where the files should be written.

Return type:

None

phippery.utils.yield_tall(ds: Dataset)[source]: For each sample, yield a tall DataFrame.