{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "dca40d3f",
   "metadata": {},
   "source": [
    "# mpsa datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "43acb29b",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-12T15:21:06.957930Z",
     "start_time": "2021-11-12T15:21:05.491772Z"
    }
   },
   "outputs": [],
   "source": [
    "# Standard imports\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "# Special imports\n",
    "import mavenn\n",
    "import os\n",
    "import urllib"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5336886a",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-11T17:26:47.608641Z",
     "start_time": "2021-11-11T17:26:47.392567Z"
    }
   },
   "source": [
    "## Summary"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "030ff69a",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-11T17:27:24.538136Z",
     "start_time": "2021-11-11T17:27:24.529622Z"
    }
   },
   "source": [
    "The massively parallel splicing assay (MPSA) dataset of Wong et al., 2018. The authors used 3-exon minigenes to assay how inclusion of the middle exon varies with the sequence of that exon's 5' splice site. Nearly all 5' splice site variants of the form NNN/GYNNNN were measured, where the slash demarcates the exon/intron boundary. The authors performed experiments on multiple replicates of multiple libraries in three different minigene contexts: *IKBKAP* exons 19-21, *SMN1* exons 6-8, and *BRCA2* exons 17-19. The dataset ``'mpsa'`` is from library 1 replicate 1 in the *BRCA2* context, while ``'mpsa_replicate'`` is from library 2 replicate 1 in the same context. \n",
    "\n",
    "In these dataframes, the ``'tot_ct'`` column reports the number of reads obtained for each splice site from total processed mRNA transcripts, the ``'ex_ct'`` column reports the number of reads obtained from processed mRNA transcripts containing the central exon, ``'y'`` is the $\\log_{10}$ percent-spliced-in (PSI) value measured for each sequence, and ``'x'`` is the variant 5' splice site. Note that some sequences have $y > 2.0$, corresponding to PSI > 100, due to experimental noise.\n",
    "\n",
    "**Names**: ``'mpsa'``, ``'mpsa_replicate'``\n",
    "\n",
    "**Reference**: Wong MS, Kinney JB, Krainer AR. Quantitative activity profile and context dependence of all human 5' splice sites. [Mol Cell. 71:1012-1026.e3 (2018).](https://doi.org/10.1016/j.molcel.2018.07.033)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "3072cf25",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-12T15:21:06.983473Z",
     "start_time": "2021-11-12T15:21:06.958981Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>set</th>\n",
       "      <th>tot_ct</th>\n",
       "      <th>ex_ct</th>\n",
       "      <th>y</th>\n",
       "      <th>x</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>training</td>\n",
       "      <td>28</td>\n",
       "      <td>2</td>\n",
       "      <td>0.023406</td>\n",
       "      <td>GGAGUGAUG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>test</td>\n",
       "      <td>315</td>\n",
       "      <td>7</td>\n",
       "      <td>-0.587914</td>\n",
       "      <td>AGUGUGCAA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>training</td>\n",
       "      <td>193</td>\n",
       "      <td>15</td>\n",
       "      <td>-0.074999</td>\n",
       "      <td>UUCGCGCCA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>training</td>\n",
       "      <td>27</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.438475</td>\n",
       "      <td>UAAGCUUUU</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>training</td>\n",
       "      <td>130</td>\n",
       "      <td>2</td>\n",
       "      <td>-0.631467</td>\n",
       "      <td>AUGGUCGGG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30478</th>\n",
       "      <td>training</td>\n",
       "      <td>190</td>\n",
       "      <td>17</td>\n",
       "      <td>-0.017078</td>\n",
       "      <td>CUGGUUGCA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30479</th>\n",
       "      <td>training</td>\n",
       "      <td>154</td>\n",
       "      <td>10</td>\n",
       "      <td>-0.140256</td>\n",
       "      <td>CGCGCACAA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30480</th>\n",
       "      <td>test</td>\n",
       "      <td>407</td>\n",
       "      <td>16</td>\n",
       "      <td>-0.371528</td>\n",
       "      <td>ACUGCUCAC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30481</th>\n",
       "      <td>training</td>\n",
       "      <td>265</td>\n",
       "      <td>6</td>\n",
       "      <td>-0.571100</td>\n",
       "      <td>AUAGUCUAA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30482</th>\n",
       "      <td>test</td>\n",
       "      <td>26</td>\n",
       "      <td>22</td>\n",
       "      <td>0.939047</td>\n",
       "      <td>GUGGUAACU</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>30483 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "            set  tot_ct  ex_ct         y          x\n",
       "0      training      28      2  0.023406  GGAGUGAUG\n",
       "1          test     315      7 -0.587914  AGUGUGCAA\n",
       "2      training     193     15 -0.074999  UUCGCGCCA\n",
       "3      training      27      0 -0.438475  UAAGCUUUU\n",
       "4      training     130      2 -0.631467  AUGGUCGGG\n",
       "...         ...     ...    ...       ...        ...\n",
       "30478  training     190     17 -0.017078  CUGGUUGCA\n",
       "30479  training     154     10 -0.140256  CGCGCACAA\n",
       "30480      test     407     16 -0.371528  ACUGCUCAC\n",
       "30481  training     265      6 -0.571100  AUAGUCUAA\n",
       "30482      test      26     22  0.939047  GUGGUAACU\n",
       "\n",
       "[30483 rows x 5 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mavenn.load_example_dataset('mpsa')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6fe018cb",
   "metadata": {},
   "source": [
    "## Preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ecccf7df",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-11T17:34:28.129865Z",
     "start_time": "2021-11-11T17:34:28.124430Z"
    }
   },
   "source": [
    "The raw datasets for both `'mpsa'` and `'mpsa_replicate'` are available at [github.com/jbkinney/15_splicing/](https://github.com/jbkinney/15_splicing/). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "145e2a20",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-12T15:21:08.254891Z",
     "start_time": "2021-11-12T15:21:06.984520Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Creating dataset \"mpsa\"...\n",
      "Retrieving data from https://github.com/jbkinney/15_splicing/raw/master/for_mavenn/results.brca2_9nt_lib1_rep1.txt.gz...\n",
      "Done!\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>tot_ct</th>\n",
       "      <th>ex_ct</th>\n",
       "      <th>lib_ct</th>\n",
       "      <th>mis_ct</th>\n",
       "      <th>ss</th>\n",
       "      <th>bc</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>377</td>\n",
       "      <td>27</td>\n",
       "      <td>164</td>\n",
       "      <td>3</td>\n",
       "      <td>ACAGCGGGA</td>\n",
       "      <td>TTAGCTATCGGCTGACGTCT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>332</td>\n",
       "      <td>5</td>\n",
       "      <td>97</td>\n",
       "      <td>1</td>\n",
       "      <td>AGCGTGTAT</td>\n",
       "      <td>CCACCCAACGCGCCGTCAGT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>320</td>\n",
       "      <td>3286</td>\n",
       "      <td>46</td>\n",
       "      <td>1</td>\n",
       "      <td>CAGGTGAGA</td>\n",
       "      <td>TTGAGGTACACTGAACAGTC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>312</td>\n",
       "      <td>2248</td>\n",
       "      <td>87</td>\n",
       "      <td>1</td>\n",
       "      <td>CAGGTTAGA</td>\n",
       "      <td>ACCGATCTGCCACGGCGACC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>291</td>\n",
       "      <td>8</td>\n",
       "      <td>109</td>\n",
       "      <td>2</td>\n",
       "      <td>CAAGCCTTA</td>\n",
       "      <td>AGGGACCATCCAGTTCGCCT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>944960</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>14</td>\n",
       "      <td>0</td>\n",
       "      <td>ACCGCGATG</td>\n",
       "      <td>TGAAATTGACCCGAGCCTGC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>944961</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>14</td>\n",
       "      <td>1</td>\n",
       "      <td>AACGCCTCG</td>\n",
       "      <td>AACCAAAATACCTTGCGCTT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>944962</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>14</td>\n",
       "      <td>0</td>\n",
       "      <td>TACGCATCG</td>\n",
       "      <td>TACTCAGCCAATGGCGAACA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>944963</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>14</td>\n",
       "      <td>0</td>\n",
       "      <td>AAGGTCACG</td>\n",
       "      <td>CTATGCATCTACGCTTAATG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>944964</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>CCAGCGCCG</td>\n",
       "      <td>AAAAAAAAAAAAGATTTGTT</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>944965 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        tot_ct  ex_ct  lib_ct  mis_ct         ss                    bc\n",
       "0          377     27     164       3  ACAGCGGGA  TTAGCTATCGGCTGACGTCT\n",
       "1          332      5      97       1  AGCGTGTAT  CCACCCAACGCGCCGTCAGT\n",
       "2          320   3286      46       1  CAGGTGAGA  TTGAGGTACACTGAACAGTC\n",
       "3          312   2248      87       1  CAGGTTAGA  ACCGATCTGCCACGGCGACC\n",
       "4          291      8     109       2  CAAGCCTTA  AGGGACCATCCAGTTCGCCT\n",
       "...        ...    ...     ...     ...        ...                   ...\n",
       "944960       0      0      14       0  ACCGCGATG  TGAAATTGACCCGAGCCTGC\n",
       "944961       0      0      14       1  AACGCCTCG  AACCAAAATACCTTGCGCTT\n",
       "944962       0      0      14       0  TACGCATCG  TACTCAGCCAATGGCGAACA\n",
       "944963       0      0      14       0  AAGGTCACG  CTATGCATCTACGCTTAATG\n",
       "944964       0      0       2       0  CCAGCGCCG  AAAAAAAAAAAAGATTTGTT\n",
       "\n",
       "[944965 rows x 6 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Specify dataset\n",
    "dataset = 'mpsa' # use to make 'mpsa' dataset\n",
    "# dataset = 'mpsa_replicate' # uncomment to make 'mpsa_replicate' dataset instead\n",
    "print(f'Creating dataset \"{dataset}\"...')\n",
    "\n",
    "# Specify online directory\n",
    "url_dir = 'https://github.com/jbkinney/15_splicing/raw/master/for_mavenn/'\n",
    "\n",
    "# Specify raw data file and output file names\n",
    "if dataset=='mpsa':\n",
    "    raw_data_file = 'results.brca2_9nt_lib1_rep1.txt.gz'\n",
    "elif dataset=='mpsa_replicate':\n",
    "    raw_data_file = 'results.brca2_9nt_lib2_rep1.txt.gz'\n",
    "else:\n",
    "    assert False\n",
    "\n",
    "# Download raw datasert\n",
    "print(f'Retrieving data from {url_dir+raw_data_file}...')\n",
    "urllib.request.urlretrieve(url_dir+raw_data_file, raw_data_file)\n",
    "\n",
    "# Load raw datset\n",
    "raw_df = pd.read_csv(raw_data_file, sep='\\t', index_col=0, compression='gzip')\n",
    "\n",
    "# Delete raw dataset\n",
    "os.remove(raw_data_file)\n",
    "\n",
    "# Preview raw dataset\n",
    "print('Done!')\n",
    "raw_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f5f643aa",
   "metadata": {},
   "source": [
    "Each raw dataset lists read counts for every cloned minigene plasmid. Note that each minigene contained one 5'ss (column `'ss'`) and one associated barcode (column `'bc'`). Each barcode is associated with a unique 5'ss, but each 5'ss will typically be associated with many different barcodes. In addition to `'tot_ct'` and `'ex_ct'`, two other read count quantities are listed: '`lib_ct`' is the number of reads supporting the indicated 5'ss-barcode association, while `'mis_ct'` is the number of reads indicating association of the barcode with other 5'ss that are not shown. Filtering for `'lib_ct'` >= 2 and `'lib_ct'` >=4x `'mis_ct'` was already performed by the authors, and we will ignore these columns in what follows.\n",
    "\n",
    "To format this dataset for use in MAVE-NN, we do the following:\n",
    "- Drop the '`lib_ct`' and `'mis_ct'` columns.\n",
    "- Drop the `'bc'` column and marginalize '`tot_ct`' and '`ex_ct`' by 5'ss sequence.\n",
    "- Rename the `'ss'` column to `'x'`. \n",
    "- Convert DNA 5'ss sequences to RNA sequences by replacing `'T'` with `'U'`.\n",
    "- Remove 5'ss with '`tot_ct'` < 10\n",
    "- Remove 5'ss with sequences that don't match `'NNNGYNNNN'`. \n",
    "- Compute `'y'`, which lists estimated log10 PSI values. \n",
    "- Assign each 5'ss to the training, validation, or test set.\n",
    "- Shuffle the rows of the final dataframe and reorder the columns for clarity.\n",
    "- Save the dataframe as a .csv.gz file.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "153b12e9",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-12T15:21:08.526377Z",
     "start_time": "2021-11-12T15:21:08.255702Z"
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2309 ss removed for having tot_ct < 10\n",
      "7 ss with invalid sequences removed\n",
      "Dataset \"mpsa\" saved to mpsa_data.csv.gz\n",
      "Done!\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>set</th>\n",
       "      <th>tot_ct</th>\n",
       "      <th>ex_ct</th>\n",
       "      <th>y</th>\n",
       "      <th>x</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>training</td>\n",
       "      <td>28</td>\n",
       "      <td>2</td>\n",
       "      <td>0.023406</td>\n",
       "      <td>GGAGUGAUG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>test</td>\n",
       "      <td>315</td>\n",
       "      <td>7</td>\n",
       "      <td>-0.587914</td>\n",
       "      <td>AGUGUGCAA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>training</td>\n",
       "      <td>193</td>\n",
       "      <td>15</td>\n",
       "      <td>-0.074999</td>\n",
       "      <td>UUCGCGCCA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>training</td>\n",
       "      <td>27</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.438475</td>\n",
       "      <td>UAAGCUUUU</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>training</td>\n",
       "      <td>130</td>\n",
       "      <td>2</td>\n",
       "      <td>-0.631467</td>\n",
       "      <td>AUGGUCGGG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30478</th>\n",
       "      <td>training</td>\n",
       "      <td>190</td>\n",
       "      <td>17</td>\n",
       "      <td>-0.017078</td>\n",
       "      <td>CUGGUUGCA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30479</th>\n",
       "      <td>training</td>\n",
       "      <td>154</td>\n",
       "      <td>10</td>\n",
       "      <td>-0.140256</td>\n",
       "      <td>CGCGCACAA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30480</th>\n",
       "      <td>test</td>\n",
       "      <td>407</td>\n",
       "      <td>16</td>\n",
       "      <td>-0.371528</td>\n",
       "      <td>ACUGCUCAC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30481</th>\n",
       "      <td>training</td>\n",
       "      <td>265</td>\n",
       "      <td>6</td>\n",
       "      <td>-0.571100</td>\n",
       "      <td>AUAGUCUAA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30482</th>\n",
       "      <td>test</td>\n",
       "      <td>26</td>\n",
       "      <td>22</td>\n",
       "      <td>0.939047</td>\n",
       "      <td>GUGGUAACU</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>30483 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "            set  tot_ct  ex_ct         y          x\n",
       "0      training      28      2  0.023406  GGAGUGAUG\n",
       "1          test     315      7 -0.587914  AGUGUGCAA\n",
       "2      training     193     15 -0.074999  UUCGCGCCA\n",
       "3      training      27      0 -0.438475  UAAGCUUUU\n",
       "4      training     130      2 -0.631467  AUGGUCGGG\n",
       "...         ...     ...    ...       ...        ...\n",
       "30478  training     190     17 -0.017078  CUGGUUGCA\n",
       "30479  training     154     10 -0.140256  CGCGCACAA\n",
       "30480      test     407     16 -0.371528  ACUGCUCAC\n",
       "30481  training     265      6 -0.571100  AUAGUCUAA\n",
       "30482      test      26     22  0.939047  GUGGUAACU\n",
       "\n",
       "[30483 rows x 5 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Remove unwanted columns\n",
    "data_df = raw_df[['tot_ct','ex_ct','ss','bc']].copy()\n",
    "\n",
    "# Marginalize by splice site\n",
    "data_df = data_df.groupby('ss').sum().reset_index()\n",
    "\n",
    "# Rename columns\n",
    "data_df.rename(columns={'ss':'x'}, inplace=True)\n",
    "\n",
    "# Make sequences RNA\n",
    "data_df['x'] = [ss.replace('T','U') for ss in data_df['x']]\n",
    "\n",
    "# Remove ss with minimum tot_ct\n",
    "min_ct = 10\n",
    "ix = data_df['tot_ct'] >= min_ct\n",
    "data_df = data_df[ix].copy()\n",
    "print(f'{(~ix).sum()} ss removed for having tot_ct < {min_ct}')\n",
    "\n",
    "# Remove ss with invalid sequences\n",
    "ix = np.array([((x[3]=='G') and (x[4] in {'C','U'})) for x in data_df['x']])\n",
    "data_df = data_df[ix]\n",
    "print(f'{(~ix).sum()} ss with invalid sequences removed')\n",
    "\n",
    "# Get consensus i_cons and o_cons\n",
    "pseudoct = 1.0\n",
    "cons_seq = 'CAGGUAAGU'\n",
    "tmp_df = data_df.set_index('x')\n",
    "i_cons = tmp_df.loc[cons_seq,'tot_ct']\n",
    "o_cons = tmp_df.loc[cons_seq,'ex_ct']\n",
    "relative_psi_cons = (o_cons+pseudoct)/(i_cons+pseudoct)\n",
    "\n",
    "# Compute y\n",
    "i_n = data_df['tot_ct']\n",
    "o_n = data_df['ex_ct']\n",
    "relative_psi_n = (o_n+pseudoct)/(i_n+pseudoct)\n",
    "psi_n = 100*relative_psi_n/relative_psi_cons\n",
    "y_n = np.log10(psi_n)\n",
    "data_df['y'] = y_n\n",
    "\n",
    "# Assign data to training, validation, or test sets\n",
    "np.random.seed(0)\n",
    "data_df['set'] = np.random.choice(a=['training', 'validation', 'test'], \n",
    "                                   p=[.6,.2,.2], \n",
    "                                   size=len(data_df))\n",
    "\n",
    "# Shuffle data for extra safety\n",
    "data_df = data_df.sample(frac=1).reset_index(drop=True)\n",
    "\n",
    "# Order columns\n",
    "final_df = data_df[['set', 'tot_ct', 'ex_ct', 'y', 'x']].copy()\n",
    "\n",
    "# Save to file (uncomment to execute)\n",
    "out_file = f'{dataset}_data.csv.gz'\n",
    "final_df.to_csv(out_file, index=False, compression='gzip')\n",
    "print(f'Dataset \"{dataset}\" saved to {out_file}')\n",
    "\n",
    "# Preview final dataframe\n",
    "print('Done!')\n",
    "final_df\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}