{ "cells": [ { "cell_type": "markdown", "id": "dca40d3f", "metadata": {}, "source": [ "# mpsa datasets" ] }, { "cell_type": "code", "execution_count": 1, "id": "43acb29b", "metadata": { "ExecuteTime": { "end_time": "2021-11-12T15:21:06.957930Z", "start_time": "2021-11-12T15:21:05.491772Z" } }, "outputs": [], "source": [ "# Standard imports\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "# Special imports\n", "import mavenn\n", "import os\n", "import urllib" ] }, { "cell_type": "markdown", "id": "5336886a", "metadata": { "ExecuteTime": { "end_time": "2021-11-11T17:26:47.608641Z", "start_time": "2021-11-11T17:26:47.392567Z" } }, "source": [ "## Summary" ] }, { "cell_type": "markdown", "id": "030ff69a", "metadata": { "ExecuteTime": { "end_time": "2021-11-11T17:27:24.538136Z", "start_time": "2021-11-11T17:27:24.529622Z" } }, "source": [ "The massively parallel splicing assay (MPSA) dataset of Wong et al., 2018. The authors used 3-exon minigenes to assay how inclusion of the middle exon varies with the sequence of that exon's 5' splice site. Nearly all 5' splice site variants of the form NNN/GYNNNN were measured, where the slash demarcates the exon/intron boundary. The authors performed experiments on multiple replicates of multiple libraries in three different minigene contexts: *IKBKAP* exons 19-21, *SMN1* exons 6-8, and *BRCA2* exons 17-19. The dataset ``'mpsa'`` is from library 1 replicate 1 in the *BRCA2* context, while ``'mpsa_replicate'`` is from library 2 replicate 1 in the same context. \n", "\n", "In these dataframes, the ``'tot_ct'`` column reports the number of reads obtained for each splice site from total processed mRNA transcripts, the ``'ex_ct'`` column reports the number of reads obtained from processed mRNA transcripts containing the central exon, ``'y'`` is the $\\log_{10}$ percent-spliced-in (PSI) value measured for each sequence, and ``'x'`` is the variant 5' splice site. Note that some sequences have $y > 2.0$, corresponding to PSI > 100, due to experimental noise.\n", "\n", "**Names**: ``'mpsa'``, ``'mpsa_replicate'``\n", "\n", "**Reference**: Wong MS, Kinney JB, Krainer AR. Quantitative activity profile and context dependence of all human 5' splice sites. [Mol Cell. 71:1012-1026.e3 (2018).](https://doi.org/10.1016/j.molcel.2018.07.033)" ] }, { "cell_type": "code", "execution_count": 2, "id": "3072cf25", "metadata": { "ExecuteTime": { "end_time": "2021-11-12T15:21:06.983473Z", "start_time": "2021-11-12T15:21:06.958981Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
settot_ctex_ctyx
0training2820.023406GGAGUGAUG
1test3157-0.587914AGUGUGCAA
2training19315-0.074999UUCGCGCCA
3training270-0.438475UAAGCUUUU
4training1302-0.631467AUGGUCGGG
..................
30478training19017-0.017078CUGGUUGCA
30479training15410-0.140256CGCGCACAA
30480test40716-0.371528ACUGCUCAC
30481training2656-0.571100AUAGUCUAA
30482test26220.939047GUGGUAACU
\n", "

30483 rows × 5 columns

\n", "
" ], "text/plain": [ " set tot_ct ex_ct y x\n", "0 training 28 2 0.023406 GGAGUGAUG\n", "1 test 315 7 -0.587914 AGUGUGCAA\n", "2 training 193 15 -0.074999 UUCGCGCCA\n", "3 training 27 0 -0.438475 UAAGCUUUU\n", "4 training 130 2 -0.631467 AUGGUCGGG\n", "... ... ... ... ... ...\n", "30478 training 190 17 -0.017078 CUGGUUGCA\n", "30479 training 154 10 -0.140256 CGCGCACAA\n", "30480 test 407 16 -0.371528 ACUGCUCAC\n", "30481 training 265 6 -0.571100 AUAGUCUAA\n", "30482 test 26 22 0.939047 GUGGUAACU\n", "\n", "[30483 rows x 5 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mavenn.load_example_dataset('mpsa')" ] }, { "cell_type": "markdown", "id": "6fe018cb", "metadata": {}, "source": [ "## Preprocessing" ] }, { "cell_type": "markdown", "id": "ecccf7df", "metadata": { "ExecuteTime": { "end_time": "2021-11-11T17:34:28.129865Z", "start_time": "2021-11-11T17:34:28.124430Z" } }, "source": [ "The raw datasets for both `'mpsa'` and `'mpsa_replicate'` are available at [github.com/jbkinney/15_splicing/](https://github.com/jbkinney/15_splicing/). " ] }, { "cell_type": "code", "execution_count": 3, "id": "145e2a20", "metadata": { "ExecuteTime": { "end_time": "2021-11-12T15:21:08.254891Z", "start_time": "2021-11-12T15:21:06.984520Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Creating dataset \"mpsa\"...\n", "Retrieving data from https://github.com/jbkinney/15_splicing/raw/master/for_mavenn/results.brca2_9nt_lib1_rep1.txt.gz...\n", "Done!\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tot_ctex_ctlib_ctmis_ctssbc
0377271643ACAGCGGGATTAGCTATCGGCTGACGTCT
13325971AGCGTGTATCCACCCAACGCGCCGTCAGT
23203286461CAGGTGAGATTGAGGTACACTGAACAGTC
33122248871CAGGTTAGAACCGATCTGCCACGGCGACC
429181092CAAGCCTTAAGGGACCATCCAGTTCGCCT
.....................
94496000140ACCGCGATGTGAAATTGACCCGAGCCTGC
94496100141AACGCCTCGAACCAAAATACCTTGCGCTT
94496200140TACGCATCGTACTCAGCCAATGGCGAACA
94496300140AAGGTCACGCTATGCATCTACGCTTAATG
9449640020CCAGCGCCGAAAAAAAAAAAAGATTTGTT
\n", "

944965 rows × 6 columns

\n", "
" ], "text/plain": [ " tot_ct ex_ct lib_ct mis_ct ss bc\n", "0 377 27 164 3 ACAGCGGGA TTAGCTATCGGCTGACGTCT\n", "1 332 5 97 1 AGCGTGTAT CCACCCAACGCGCCGTCAGT\n", "2 320 3286 46 1 CAGGTGAGA TTGAGGTACACTGAACAGTC\n", "3 312 2248 87 1 CAGGTTAGA ACCGATCTGCCACGGCGACC\n", "4 291 8 109 2 CAAGCCTTA AGGGACCATCCAGTTCGCCT\n", "... ... ... ... ... ... ...\n", "944960 0 0 14 0 ACCGCGATG TGAAATTGACCCGAGCCTGC\n", "944961 0 0 14 1 AACGCCTCG AACCAAAATACCTTGCGCTT\n", "944962 0 0 14 0 TACGCATCG TACTCAGCCAATGGCGAACA\n", "944963 0 0 14 0 AAGGTCACG CTATGCATCTACGCTTAATG\n", "944964 0 0 2 0 CCAGCGCCG AAAAAAAAAAAAGATTTGTT\n", "\n", "[944965 rows x 6 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Specify dataset\n", "dataset = 'mpsa' # use to make 'mpsa' dataset\n", "# dataset = 'mpsa_replicate' # uncomment to make 'mpsa_replicate' dataset instead\n", "print(f'Creating dataset \"{dataset}\"...')\n", "\n", "# Specify online directory\n", "url_dir = 'https://github.com/jbkinney/15_splicing/raw/master/for_mavenn/'\n", "\n", "# Specify raw data file and output file names\n", "if dataset=='mpsa':\n", " raw_data_file = 'results.brca2_9nt_lib1_rep1.txt.gz'\n", "elif dataset=='mpsa_replicate':\n", " raw_data_file = 'results.brca2_9nt_lib2_rep1.txt.gz'\n", "else:\n", " assert False\n", "\n", "# Download raw datasert\n", "print(f'Retrieving data from {url_dir+raw_data_file}...')\n", "urllib.request.urlretrieve(url_dir+raw_data_file, raw_data_file)\n", "\n", "# Load raw datset\n", "raw_df = pd.read_csv(raw_data_file, sep='\\t', index_col=0, compression='gzip')\n", "\n", "# Delete raw dataset\n", "os.remove(raw_data_file)\n", "\n", "# Preview raw dataset\n", "print('Done!')\n", "raw_df" ] }, { "cell_type": "markdown", "id": "f5f643aa", "metadata": {}, "source": [ "Each raw dataset lists read counts for every cloned minigene plasmid. Note that each minigene contained one 5'ss (column `'ss'`) and one associated barcode (column `'bc'`). Each barcode is associated with a unique 5'ss, but each 5'ss will typically be associated with many different barcodes. In addition to `'tot_ct'` and `'ex_ct'`, two other read count quantities are listed: '`lib_ct`' is the number of reads supporting the indicated 5'ss-barcode association, while `'mis_ct'` is the number of reads indicating association of the barcode with other 5'ss that are not shown. Filtering for `'lib_ct'` >= 2 and `'lib_ct'` >=4x `'mis_ct'` was already performed by the authors, and we will ignore these columns in what follows.\n", "\n", "To format this dataset for use in MAVE-NN, we do the following:\n", "- Drop the '`lib_ct`' and `'mis_ct'` columns.\n", "- Drop the `'bc'` column and marginalize '`tot_ct`' and '`ex_ct`' by 5'ss sequence.\n", "- Rename the `'ss'` column to `'x'`. \n", "- Convert DNA 5'ss sequences to RNA sequences by replacing `'T'` with `'U'`.\n", "- Remove 5'ss with '`tot_ct'` < 10\n", "- Remove 5'ss with sequences that don't match `'NNNGYNNNN'`. \n", "- Compute `'y'`, which lists estimated log10 PSI values. \n", "- Assign each 5'ss to the training, validation, or test set.\n", "- Shuffle the rows of the final dataframe and reorder the columns for clarity.\n", "- Save the dataframe as a .csv.gz file.\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "153b12e9", "metadata": { "ExecuteTime": { "end_time": "2021-11-12T15:21:08.526377Z", "start_time": "2021-11-12T15:21:08.255702Z" }, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2309 ss removed for having tot_ct < 10\n", "7 ss with invalid sequences removed\n", "Dataset \"mpsa\" saved to mpsa_data.csv.gz\n", "Done!\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
settot_ctex_ctyx
0training2820.023406GGAGUGAUG
1test3157-0.587914AGUGUGCAA
2training19315-0.074999UUCGCGCCA
3training270-0.438475UAAGCUUUU
4training1302-0.631467AUGGUCGGG
..................
30478training19017-0.017078CUGGUUGCA
30479training15410-0.140256CGCGCACAA
30480test40716-0.371528ACUGCUCAC
30481training2656-0.571100AUAGUCUAA
30482test26220.939047GUGGUAACU
\n", "

30483 rows × 5 columns

\n", "
" ], "text/plain": [ " set tot_ct ex_ct y x\n", "0 training 28 2 0.023406 GGAGUGAUG\n", "1 test 315 7 -0.587914 AGUGUGCAA\n", "2 training 193 15 -0.074999 UUCGCGCCA\n", "3 training 27 0 -0.438475 UAAGCUUUU\n", "4 training 130 2 -0.631467 AUGGUCGGG\n", "... ... ... ... ... ...\n", "30478 training 190 17 -0.017078 CUGGUUGCA\n", "30479 training 154 10 -0.140256 CGCGCACAA\n", "30480 test 407 16 -0.371528 ACUGCUCAC\n", "30481 training 265 6 -0.571100 AUAGUCUAA\n", "30482 test 26 22 0.939047 GUGGUAACU\n", "\n", "[30483 rows x 5 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Remove unwanted columns\n", "data_df = raw_df[['tot_ct','ex_ct','ss','bc']].copy()\n", "\n", "# Marginalize by splice site\n", "data_df = data_df.groupby('ss').sum().reset_index()\n", "\n", "# Rename columns\n", "data_df.rename(columns={'ss':'x'}, inplace=True)\n", "\n", "# Make sequences RNA\n", "data_df['x'] = [ss.replace('T','U') for ss in data_df['x']]\n", "\n", "# Remove ss with minimum tot_ct\n", "min_ct = 10\n", "ix = data_df['tot_ct'] >= min_ct\n", "data_df = data_df[ix].copy()\n", "print(f'{(~ix).sum()} ss removed for having tot_ct < {min_ct}')\n", "\n", "# Remove ss with invalid sequences\n", "ix = np.array([((x[3]=='G') and (x[4] in {'C','U'})) for x in data_df['x']])\n", "data_df = data_df[ix]\n", "print(f'{(~ix).sum()} ss with invalid sequences removed')\n", "\n", "# Get consensus i_cons and o_cons\n", "pseudoct = 1.0\n", "cons_seq = 'CAGGUAAGU'\n", "tmp_df = data_df.set_index('x')\n", "i_cons = tmp_df.loc[cons_seq,'tot_ct']\n", "o_cons = tmp_df.loc[cons_seq,'ex_ct']\n", "relative_psi_cons = (o_cons+pseudoct)/(i_cons+pseudoct)\n", "\n", "# Compute y\n", "i_n = data_df['tot_ct']\n", "o_n = data_df['ex_ct']\n", "relative_psi_n = (o_n+pseudoct)/(i_n+pseudoct)\n", "psi_n = 100*relative_psi_n/relative_psi_cons\n", "y_n = np.log10(psi_n)\n", "data_df['y'] = y_n\n", "\n", "# Assign data to training, validation, or test sets\n", "np.random.seed(0)\n", "data_df['set'] = np.random.choice(a=['training', 'validation', 'test'], \n", " p=[.6,.2,.2], \n", " size=len(data_df))\n", "\n", "# Shuffle data for extra safety\n", "data_df = data_df.sample(frac=1).reset_index(drop=True)\n", "\n", "# Order columns\n", "final_df = data_df[['set', 'tot_ct', 'ex_ct', 'y', 'x']].copy()\n", "\n", "# Save to file (uncomment to execute)\n", "out_file = f'{dataset}_data.csv.gz'\n", "final_df.to_csv(out_file, index=False, compression='gzip')\n", "print(f'Dataset \"{dataset}\" saved to {out_file}')\n", "\n", "# Preview final dataframe\n", "print('Done!')\n", "final_df\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }