{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial 3: Splicing MPRA modeling using multiple built-in G-P maps" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2021-12-16T01:07:19.140753Z", "start_time": "2021-12-16T01:07:17.016587Z" }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# Standard imports\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "\n", "# Import MAVE-NN\n", "import mavenn\n", "\n", "# Import Logomaker for visualization\n", "import logomaker" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial we show how to train multiple models with different G-P maps on the same dataset. To this end we use the built-in `'mpsa'` dataset, which contains data from the splicing MPRA of Wong et al. (2018). Next we show how to to compare the performance of these models, as in Figs. 5a-5d of Tareen et al. (2020). Finally, we demonstrate how to visualize the parameters of the `'pairwise'` G-P map trained on these data; similar visualizations are shown in Figs. 5e and 5f of Tareen et al. (2020).\n", "\n", "## Training multiple models\n", "\n", "The models that we train each have a GE measurement process and one of four different types of G-P map: additive, neighbor, pairwise, or blackbox. The trained models are similar (though not identical) to the following built-in models, which can be loaded with `mavenn.load_example_model()`:\n", "\n", "- `'mpsa_additive_ge'`\n", "- `'mpsa_neighbor_ge'`\n", "- `'mpsa_pairwise_ge'`\n", "- `'mpsa_blackbox_ge'`\n", "\n", "First we load, split, and preview the built-in `'mpsa'` dataset. We also compute the length of sequences in this dataset." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2021-12-16T01:07:19.183751Z", "start_time": "2021-12-16T01:07:19.143355Z" }, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sequence length: 9 RNA nucleotides\n", "Training set : 18,469 observations ( 60.59%)\n", "Validation set : 5,936 observations ( 19.47%)\n", "Test set : 6,078 observations ( 19.94%)\n", "-------------------------------------------------\n", "Total dataset : 30,483 observations ( 100.00%)\n", "\n", "\n", "trainval_df:\n" ] }, { "data": { "text/html": [ "
| \n", " | validation | \n", "tot_ct | \n", "ex_ct | \n", "y | \n", "x | \n", "
|---|---|---|---|---|---|
| 0 | \n", "False | \n", "28 | \n", "2 | \n", "0.023406 | \n", "GGAGUGAUG | \n", "
| 1 | \n", "False | \n", "193 | \n", "15 | \n", "-0.074999 | \n", "UUCGCGCCA | \n", "
| 2 | \n", "False | \n", "27 | \n", "0 | \n", "-0.438475 | \n", "UAAGCUUUU | \n", "
| 3 | \n", "False | \n", "130 | \n", "2 | \n", "-0.631467 | \n", "AUGGUCGGG | \n", "
| 4 | \n", "False | \n", "552 | \n", "19 | \n", "-0.433012 | \n", "AGGGCAGGA | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 24400 | \n", "False | \n", "167 | \n", "1467 | \n", "1.950100 | \n", "GAGGUAAAU | \n", "
| 24401 | \n", "False | \n", "682 | \n", "17 | \n", "-0.570465 | \n", "AUCGCUAGA | \n", "
| 24402 | \n", "False | \n", "190 | \n", "17 | \n", "-0.017078 | \n", "CUGGUUGCA | \n", "
| 24403 | \n", "False | \n", "154 | \n", "10 | \n", "-0.140256 | \n", "CGCGCACAA | \n", "
| 24404 | \n", "False | \n", "265 | \n", "6 | \n", "-0.571100 | \n", "AUAGUCUAA | \n", "
24405 rows × 5 columns
\n", "