Overview of built-in datasetsΒΆ

[1]:
import mavenn

MAVE-NN comes with multiple built-in datasets for use in training or evaluating models. These datasets can be accessed by passing a datset name to mavenn.load_example_dataset(). To get a list of valid datset names, execute this command without any arguments:

[2]:
mavenn.load_example_dataset()
Please enter a dataset name. Valid choices are:
"amyloid"
"gb1"
"mpsa"
"mpsa_replicate"
"nisthal"
"sortseq"
"tdp43"

Datasets are returned in the form of pandas dataframes. Common fields include:

  • 'x': Assayed sequences, all of which are the same length.

  • 'y': Values of continuous measurements (used to train GE models).

  • 'ct_y': Read counts observed in bin number y, where y is an integer ranging from 0 to Y-1 (used to train MPA models).

  • 'set': Indicates whether each observation was reserved for the 'training', 'validation', or 'test' set when inferring the corresponding example models provided with MAVE-NN.

Other fields are sometimes provided as well, e.g. the raw input and output counts used to compute measurement values.