Overview of built-in datasets
[1]:
import mavenn
MAVE-NN comes with multiple built-in datasets for use in training or evaluating models. These datasets can be accessed by passing a datset name to mavenn.load_example_dataset(). To get a list of valid datset names, execute this command without any arguments:
[2]:
mavenn.load_example_dataset()
Please enter a dataset name. Valid choices are:
"amyloid"
"gb1"
"mpsa"
"mpsa_replicate"
"nisthal"
"sortseq"
"tdp43"
Datasets are returned in the form of pandas dataframes. Common fields include:
'x': Assayed sequences, all of which are the same length.'y': Values of continuous measurements (used to train GE models).'ct_y': Read counts observed in bin numbery, whereyis an integer ranging from0toY-1(used to train MPA models).'set': Indicates whether each observation was reserved for the'training','validation', or'test'set when inferring the corresponding example models provided with MAVE-NN.
Other fields are sometimes provided as well, e.g. the raw input and output counts used to compute measurement values.