Overview of built-in datasetsΒΆ
[1]:
import mavenn
MAVE-NN comes with multiple built-in datasets for use in training or evaluating models. These datasets can be accessed by passing a datset name to mavenn.load_example_dataset()
. To get a list of valid datset names, execute this command without any arguments:
[2]:
mavenn.load_example_dataset()
Please enter a dataset name. Valid choices are:
"amyloid"
"gb1"
"mpsa"
"mpsa_replicate"
"nisthal"
"sortseq"
"tdp43"
Datasets are returned in the form of pandas
dataframes. Common fields include:
'x'
: Assayed sequences, all of which are the same length.'y'
: Values of continuous measurements (used to train GE models).'ct_y'
: Read counts observed in bin numbery
, wherey
is an integer ranging from0
toY-1
(used to train MPA models).'set'
: Indicates whether each observation was reserved for the'training'
,'validation'
, or'test'
set when inferring the corresponding example models provided with MAVE-NN.
Other fields are sometimes provided as well, e.g. the raw input and output counts used to compute measurement values.