torchdms.data¶

Tools for handling data.

Functions

`check_onehot_encoding`	Asserts that the tensor onehot encoding we have in the Datasets is the same as one that we make ourselves from the strings.
`expand_substitutions_into_df`	Expand a Series of substitutions into a dataframe showing the wt_AA, the site, and the mut_AA.
`explode_binarymap_dataframe`	Make a dataframe that has one row for each mutation of every mutated variant, showing the wt_AA, the site, and the mut_AA.
`partition`	Partition the data as needed and build a SplitDataframe.
`prep_by_stratum_and_export`	Print number of training examples per stratum and test samples, run prepare(), and export to .pkl file with descriptive filename.
`summarize_binarymap_dataframe`

Classes

`BinaryMapDataset`	Binarymap dataset.
`SplitDataframe`	Dataframes for each of test, validation, and train.
`SplitDataset`	BinaryMapDatasets for each of test, validation, and train.

class torchdms.data.BinaryMapDataset(samples, targets, original_df, wtseq, target_names, alphabet)[source]¶

Binarymap dataset.

This class organizes the information from the input dataset into a wrapper containing all relevent attributes for training and evaluation.

We also store the original dataframe as it may contain important metadata (such as target variance), but drop redundant columns that are already attributes.

__init__(samples, targets, original_df, wtseq, target_names, alphabet)[source]¶

target_extrema()[source]¶: Return a (min, max) tuple for the value of each target.

class torchdms.data.SplitDataframe(*, test_data, val_data, train_data_list)[source]¶

Dataframes for each of test, validation, and train.

Train is partitioned into a list of dataframes according to the number of mutations.

__init__(*, test_data, val_data, train_data_list)[source]¶

class torchdms.data.SplitDataset(*, test_data, val_data, train_data_list, description_string)[source]¶

BinaryMapDatasets for each of test, validation, and train.

Train is partitioned into a list of BinaryMapDatasets according to the number of mutations.

__init__(*, test_data, val_data, train_data_list, description_string)[source]¶

property labeled_splits¶: Returns an iterator on (label, split) pairs.

torchdms.data.expand_substitutions_into_df(substitution_series)[source]¶: Expand a Series of substitutions into a dataframe showing the wt_AA, the site, and the mut_AA.

torchdms.data.explode_binarymap_dataframe(in_df)[source]¶

Make a dataframe that has one row for each mutation of every mutated variant, showing the wt_AA, the site, and the mut_AA.

Other information is duplicated as needed.

torchdms.data.check_onehot_encoding(dataset)[source]¶: Asserts that the tensor onehot encoding we have in the Datasets is the same as one that we make ourselves from the strings.

torchdms.data.partition(aa_func_scores, per_stratum_variants_for_test, skip_stratum_if_count_is_smaller_than, export_dataframe, partition_label, train_on_all_single_mutants=False)[source]¶

Partition the data as needed and build a SplitDataframe.

A “stratum” is a slice of the data with a given number of mutations. We group training data sets into strata based on their number of mutations so that the data is presented the neural network with an even proportion of each.

Furthermore, we group data rows by unique variants and then split on those grouped items so that we don’t have the same variant showing up in train and test.

torchdms.data.prep_by_stratum_and_export(split_df, wtseq, targets, out_prefix, description_string, partition_label)[source]¶: Print number of training examples per stratum and test samples, run prepare(), and export to .pkl file with descriptive filename.