torchdms.data

Tools for handling data.

Functions

check_onehot_encoding

Asserts that the tensor onehot encoding we have in the Datasets is the same as one that we make ourselves from the strings.

expand_substitutions_into_df

Expand a Series of substitutions into a dataframe showing the wt_AA, the site, and the mut_AA.

explode_binarymap_dataframe

Make a dataframe that has one row for each mutation of every mutated variant, showing the wt_AA, the site, and the mut_AA.

partition

Partition the data as needed and build a SplitDataframe.

prep_by_stratum_and_export

Print number of training examples per stratum and test samples, run prepare(), and export to .pkl file with descriptive filename.

summarize_binarymap_dataframe

Classes

BinaryMapDataset

Binarymap dataset.

SplitDataframe

Dataframes for each of test, validation, and train.

SplitDataset

BinaryMapDatasets for each of test, validation, and train.

class torchdms.data.BinaryMapDataset(samples, targets, original_df, wtseq, target_names, alphabet)[source]

Binarymap dataset.

This class organizes the information from the input dataset into a wrapper containing all relevent attributes for training and evaluation.

We also store the original dataframe as it may contain important metadata (such as target variance), but drop redundant columns that are already attributes.

__init__(samples, targets, original_df, wtseq, target_names, alphabet)[source]
target_extrema()[source]

Return a (min, max) tuple for the value of each target.

class torchdms.data.SplitDataframe(*, test_data, val_data, train_data_list)[source]

Dataframes for each of test, validation, and train.

Train is partitioned into a list of dataframes according to the number of mutations.

__init__(*, test_data, val_data, train_data_list)[source]
class torchdms.data.SplitDataset(*, test_data, val_data, train_data_list, description_string)[source]

BinaryMapDatasets for each of test, validation, and train.

Train is partitioned into a list of BinaryMapDatasets according to the number of mutations.

__init__(*, test_data, val_data, train_data_list, description_string)[source]
property labeled_splits

Returns an iterator on (label, split) pairs.

torchdms.data.expand_substitutions_into_df(substitution_series)[source]

Expand a Series of substitutions into a dataframe showing the wt_AA, the site, and the mut_AA.

torchdms.data.explode_binarymap_dataframe(in_df)[source]

Make a dataframe that has one row for each mutation of every mutated variant, showing the wt_AA, the site, and the mut_AA.

Other information is duplicated as needed.

torchdms.data.check_onehot_encoding(dataset)[source]

Asserts that the tensor onehot encoding we have in the Datasets is the same as one that we make ourselves from the strings.

torchdms.data.partition(aa_func_scores, per_stratum_variants_for_test, skip_stratum_if_count_is_smaller_than, export_dataframe, partition_label, train_on_all_single_mutants=False)[source]

Partition the data as needed and build a SplitDataframe.

A “stratum” is a slice of the data with a given number of mutations. We group training data sets into strata based on their number of mutations so that the data is presented the neural network with an even proportion of each.

Furthermore, we group data rows by unique variants and then split on those grouped items so that we don’t have the same variant showing up in train and test.

torchdms.data.prep_by_stratum_and_export(split_df, wtseq, targets, out_prefix, description_string, partition_label)[source]

Print number of training examples per stratum and test samples, run prepare(), and export to .pkl file with descriptive filename.