Overview of built-in datasets¶
MAVE-NN comes with multiple built-in datasets for use in training or evaluating models. These datasets can be accessed by passing a datset name to
mavenn.load_example_dataset(). To get a list of valid datset names, execute this command without any arguments:
Please enter a dataset name. Valid choices are: "amyloid" "gb1" "mpsa" "mpsa_replicate" "nisthal" "sortseq" "tdp43"
Datasets are returned in the form of
pandas dataframes. Common fields include:
'x': Assayed sequences, all of which are the same length.
'y': Values of continuous measurements (used to train GE models).
'ct_y': Read counts observed in bin number
yis an integer ranging from
Y-1(used to train MPA models).
'set': Indicates whether each observation was reserved for the
'test'set when inferring the corresponding example models provided with MAVE-NN.
Other fields are sometimes provided as well, e.g. the raw input and output counts used to compute measurement values.