{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial 2: Protein DMS modeling using additive G-P maps\n", "\n", "This tutorial covers perhaps the simplest application of MAVE-NN: the modeling of DMS data using an additive genotype-phenotype (G-P) map together with a global epistasis (GE) measurement process. The code below steps users through this process, and can be used to train models similar to the following built-in models, which are accessible using `mavenn.load_example_model()`:\n", "\n", "\n", "- `'amyloid_additive_ge'`\n", "- `'tdp43_additive_ge'`\n", "- `'gb1_additive_ge'`" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2021-12-16T00:28:24.191016Z", "start_time": "2021-12-16T00:28:22.614488Z" } }, "outputs": [], "source": [ "# Standard imports\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# Import MAVE-NN\n", "import mavenn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training\n", "\n", "First we choose which dataset we wish to model, and we load it as a Pandas dataframe using `mavenn.load_example_dataset()`. We then compute the length of sequences in that dataset; we will need this quantity for defining the architecture of our model." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2021-12-16T00:28:24.217141Z", "start_time": "2021-12-16T00:28:24.192066Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading dataset 'tdp43' \n", "Sequence length: 84 amino acids (+ stops)\n", "data_df:\n" ] }, { "data": { "text/html": [ "
| \n", " | set | \n", "dist | \n", "y | \n", "dy | \n", "x | \n", "
|---|---|---|---|---|---|
| 0 | \n", "training | \n", "1 | \n", "0.032210 | \n", "0.037438 | \n", "NNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... | \n", "
| 1 | \n", "training | \n", "1 | \n", "-0.009898 | \n", "0.038981 | \n", "TNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... | \n", "
| 2 | \n", "training | \n", "1 | \n", "-0.010471 | \n", "0.005176 | \n", "RNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... | \n", "
| 3 | \n", "training | \n", "1 | \n", "0.030803 | \n", "0.005341 | \n", "SNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... | \n", "
| 4 | \n", "training | \n", "1 | \n", "-0.054716 | \n", "0.035752 | \n", "INSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 57991 | \n", "training | \n", "2 | \n", "-0.009706 | \n", "0.035128 | \n", "GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... | \n", "
| 57992 | \n", "validation | \n", "2 | \n", "-0.030744 | \n", "0.029436 | \n", "GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... | \n", "
| 57993 | \n", "validation | \n", "2 | \n", "-0.086802 | \n", "0.033174 | \n", "GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... | \n", "
| 57994 | \n", "training | \n", "2 | \n", "-0.049587 | \n", "0.029130 | \n", "GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... | \n", "
| 57995 | \n", "training | \n", "2 | \n", "-0.105390 | \n", "0.031189 | \n", "GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... | \n", "
57996 rows × 5 columns
\n", "| \n", " | validation | \n", "dist | \n", "y | \n", "dy | \n", "x | \n", "
|---|---|---|---|---|---|
| 0 | \n", "False | \n", "1 | \n", "0.032210 | \n", "0.037438 | \n", "NNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... | \n", "
| 1 | \n", "False | \n", "1 | \n", "-0.009898 | \n", "0.038981 | \n", "TNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... | \n", "
| 2 | \n", "False | \n", "1 | \n", "-0.010471 | \n", "0.005176 | \n", "RNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... | \n", "
| 3 | \n", "False | \n", "1 | \n", "0.030803 | \n", "0.005341 | \n", "SNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... | \n", "
| 4 | \n", "False | \n", "1 | \n", "-0.054716 | \n", "0.035752 | \n", "INSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 55160 | \n", "False | \n", "2 | \n", "-0.009706 | \n", "0.035128 | \n", "GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... | \n", "
| 55161 | \n", "True | \n", "2 | \n", "-0.030744 | \n", "0.029436 | \n", "GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... | \n", "
| 55162 | \n", "True | \n", "2 | \n", "-0.086802 | \n", "0.033174 | \n", "GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... | \n", "
| 55163 | \n", "False | \n", "2 | \n", "-0.049587 | \n", "0.029130 | \n", "GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... | \n", "
| 55164 | \n", "False | \n", "2 | \n", "-0.105390 | \n", "0.031189 | \n", "GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... | \n", "
55165 rows × 5 columns
\n", "