Utils
Utilities for building, indexing, and manipulating and xarray dataset topology specific to most phippery functions provided in this package
- phippery.utils.add_enrichment_layer_from_array(ds, enrichment, new_table_name=None, inplace=True)[source]
Append an enrichment layer to the dataset.
- Parameters:
ds (xarray.DataSet) – The phippery dataset to append to.
enrichment (np.array) – The enrichment matrix to append to the phippery dataset. The number of rows should be the same length as ds.peptide_id and the number of columns should be the same length as ds.sample_id
new_table_name (str) – What you would like to name the enrichment layer.
inplace (bool) – Determines whether to modify the passed dataset, or return an augmented copy
- Returns:
The augmented phippery dataset copy is returned if inplace is
True
- Return type:
None | xarray.DataSet
- phippery.utils.collapse_groups(ds, by, collapse_dim='sample', agg_func=<function mean>, compute_pw_cc=False, **kwargs)[source]
Collapse an xarray dataset along one of the annotation axis by applying the agg_function to annotation groups of ‘by’.
- Parameters:
ds (xarray.DataSet) – The phippery dataset to append to.
by (list) – The name of the annotation feature you would like to collapse.
collapse_dim (str) – The dimension you’s like to collapse. “sample” or “peptide”
compute_pw_cc (bool) – Whether or not to compute the mean pairwise correlation of all values within any feature group that is being collapsed.
agg_func (function) – This function must take a one-dimensional array and aggregate all values to a single number, agg_func(list[float | int]) -> float | int
- Returns:
The collapsed phippery dataset.
- Return type:
xarray.DataSet
Example
>>> get_annotation_table(ds, dim="sample") sample_metadata fastq_filename reference seq_dir sample_type sample_id 0 sample_0.fastq refa expa beads_only 1 sample_1.fastq refa expa beads_only 2 sample_2.fastq refa expa library 3 sample_3.fastq refa expa library 4 sample_4.fastq refa expa IP 5 sample_5.fastq refa expa IP >>> ds["counts"].to_pandas() sample_id 0 1 2 3 4 5 peptide_id 0 7 0 3 2 3 2 1 6 3 1 0 7 5 2 9 1 7 8 4 7 >>> mean_sample_type_ds = collapse_groups(ds, by=["sample_type"]) >>> get_annotation_table(mean_sample_type_ds, dim="sample") sample_metadata reference seq_dir sample_type sample_id 0 refa expa IP 1 refa expa beads_only 2 refa expa library >>> mean_sample_type_ds["counts"].to_pandas() sample_id 0 1 2 peptide_id 0 2.5 3.5 2.5 1 6.0 4.5 0.5 2 5.5 5.0 7.5
- phippery.utils.collect_counts(counts)[source]
merge individual tsv files from individual samples alignments into a counts matrix.
- Parameters:
counts (list[str]) –
A list a filepaths relative to current working directory to read in. The filepaths should point to tab-separated files for each sample which contains two columns (without headers):
peptide ids - the integer peptide identifiers
enrichments - the respective enrichments for any peptide id
- Returns:
The merged enrichments with peptides as the index, and filenames as column names.
- Return type:
pd.DataFrame
- phippery.utils.dataset_from_csv(peptide_table_filename, sample_table_filename, counts_table_filename)[source]
Load a dataset from individual comma separated files containing the counts matrix, as well as sample and peptide annotation tables.
Note
This is the inverse operation of the to_wide_csv() utility function. Generally speaking these functions are used for long term storage in common formats when pickle dumped binaries are not ideal. For now, this function only supports a single enrichment table to be added with the variable name “counts” to the dataset. If you would like to add other transformation of the enrichment table (i.e. cpm, mlxp, etc), you can load the csv’s via pandas and add to the dataset using the add_enrichment_layer_from_array function
- Parameters:
counts_table_filename (str) – The glob filepath to csv file(s) containing the enrichments. All files should have the first column be indices which match the given peptide table index column. The first row then should have column headers that match the index of the sample table.
peptide_table_filename (str) – The relative filepath to the peptide annotation table.
sample_table_filename (str) – The relative filepath to the sample annotation table.
- Returns:
The combined tables in a phippery dataset.
- Return type:
xarray.DataSet
- phippery.utils.ds_query(ds, query, dim='sample')[source]
Apply a sample or peptide query statement to the entire dataset.
Note
For more on pandas queries, see the pandas documentation
- Parameters:
ds (xarray.DataSet) – The dataset you would like to query.
query (str) – pandas query expression
dim (str) – The dimension to to apply the expression
- Returns:
reference to the dataset slice from the given expression.
- Return type:
xarray.DataSet
Example
>>> phippery.get_annotation_table(ds, "peptide") peptide_metadata Oligo virus peptide_id 0 ATCG zika 1 ATCG zika 2 ATCG zika 3 ATCG zika 4 ATCG dengue 5 ATCG dengue 6 ATCG dengue 7 ATCG dengue >>> zka_ds = ds_query(ds, "virus == 'zika'", dim="peptide") >>> zka_ds["counts"].to_pandas() sample_id 0 1 2 3 4 5 6 7 8 9 peptide_id 0 110 829 872 475 716 815 308 647 216 791 1 604 987 776 923 858 985 396 539 32 600 2 865 161 413 760 422 297 639 786 857 878 3 992 354 825 535 440 416 572 988 763 841
- phippery.utils.dump(ds, path)[source]
simple wrapper for dumping xarray datasets to pickle binary
- Parameters:
ds (xarray.DataSet) – The dataset you would like to write to disk.
path (str) – The relative path you would like to write to.
- Return type:
None
- phippery.utils.get_annotation_table(ds, dim='sample')[source]
return a copy of the peptide table after converting all the data types applying the pandas NaN heuristic
- Parameters:
ds (xarray.DataSet) – The dataset to extract an annotation from.
dim (str) – The annotation table to grab: “sample” or “peptide”.
- Returns:
The annotation table.
- Return type:
pd.DataFrame
- phippery.utils.id_coordinate_from_query_df(ds, query_df)[source]
Given a dataframe with pandas query statements for both samples and peptides, return the relevant sample and peptide id’s after applying the logical AND of all queries.
- Parameters:
ds (xarray.DataSet) – The dataset you would like to query.
query_df (pd.DataFrame) –
A dataframe with must have two columns (including headers):
”dimension” - either “sample” or “peptide” to specify expression dimension
”expression” - The pandas query expression to apply.
- Returns:
tuple – Return a tuple of sample id’s and peptide id’s
- Return type:
list, list
- phippery.utils.id_query(ds, query, dim='sample')[source]
Apply a sample or peptide query statement to the entire dataset and retrieve the respective indices.
Note
For more on pandas queries, see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html
- Parameters:
ds (xarray.DataSet) – The dataset you would like to query.
query (str) – pandas query expression
dim (str) –
- Returns:
The list of integer identifiers that apply to a given expression for the respective dimension.
- Return type:
list[int]
- phippery.utils.iter_groups(ds, by, dim='sample')[source]
This function returns an iterator yielding subsets of the provided dataset, grouped by items in the metadata of either of the dimensions specified.
- Parameters:
ds (xarray.DataSet) – The dataset to iterate over.
- Returns:
Returns subsets of the original dataset sliced by either sample or peptide table groups.
- Return type:
generator
Example
>>> phippery.get_annotation_table(ds, "sample") sample_metadata fastq_filename reference seq_dir sample_type sample_id 0 sample_0.fastq refa expa beads_only 1 sample_1.fastq refa expa beads_only 2 sample_2.fastq refa expa library 3 sample_3.fastq refa expa library >>> ds["counts"].values array([[458, 204, 897, 419], [599, 292, 436, 186], [ 75, 90, 978, 471], [872, 33, 108, 505], [206, 107, 981, 208]]) >>> sample_groups = iter_groups(ds, by="sample_type") >>> for group, phip_dataset in sample_groups: ... print(group) ... print(phip_dataset["counts"].values) ... beads_only [[458 204] [599 292] [ 75 90] [872 33] [206 107]] library [[897 419] [436 186] [978 471] [108 505] [981 208]]
- phippery.utils.load(path)[source]
simple wrapper for loading xarray datasets from pickle binary
- Parameters:
path (str) – Relative path of binary phippery dataset
- Returns:
phippery dataset
- Return type:
xarray.DataSet
- phippery.utils.stitch_dataset(counts, peptide_table, sample_table)[source]
Build an phippery xarray dataset from individual tables.
Note
If the counts matrix that you’re passing has the shape (M x N) for M peptides, and N samples, the sample table should have a len of N, and peptide table should have len M
- Parameters:
counts (numpy.ndarray) – The counts matrix for sample peptide enrichments.
sample_table (pd.DataFrame) – The sample annotations corresponding to the columns of the counts matrix.
peptide_table (pd.DataFrame) – The peptide annotations corresponding to the rows of the counts matrix.
- Returns:
The formatted phippery xarray dataset used by most of the phippery functionality.
- Return type:
xarray.DataSet
- phippery.utils.to_tall(ds: Dataset)[source]
Melt a phippery xarray dataset into a single long-formatted dataframe that has a unique sample peptide interaction on each row. Ideal for ggplotting.
- Parameters:
ds (xarray.DataSet) – The dataset to extract an annotation from.
- Returns:
The tall formatted dataset.
- Return type:
pd.DataFrame
Example
>>> ds["counts"].to_pandas() sample_id 0 1 peptide_id 0 453 393 1 456 532 2 609 145 >>> to_tall(ds)[["sample_id", "peptide_id", "counts"]] sample_id peptide_id counts 0 0 0 453 1 0 1 456 2 0 2 609 3 1 0 393 4 1 1 532 5 1 2 145
- phippery.utils.to_wide(ds)[source]
Take a phippery dataset and split it into its separate components in a dictionary.
- Parameters:
ds (xarray.DataSet) – The dataset to separate.
- Returns:
The dictionary of annotation tables and enrichment matrices.
- Return type:
dict
- phippery.utils.to_wide_csv(ds, file_prefix)[source]
Take a phippery dataset and split it into its separate components at writes each into a comma separated file.
Note
This is the inverse operation of the dataset_from_csv() utility function. Generally speaking these functions are used for long term storage in common formats when pickle dumped binaries are not ideal.
- Parameters:
ds (xarray.DataSet) – The dataset to extract an annotation from.
file_prefix (str) – The file prefix relative to the current working directory where the files should be written.
- Return type:
None