{ "cells": [ { "cell_type": "markdown", "id": "dca40d3f", "metadata": {}, "source": [ "# mpsa datasets" ] }, { "cell_type": "code", "execution_count": 1, "id": "43acb29b", "metadata": { "ExecuteTime": { "end_time": "2021-11-12T15:21:06.957930Z", "start_time": "2021-11-12T15:21:05.491772Z" } }, "outputs": [], "source": [ "# Standard imports\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "# Special imports\n", "import mavenn\n", "import os\n", "import urllib" ] }, { "cell_type": "markdown", "id": "5336886a", "metadata": { "ExecuteTime": { "end_time": "2021-11-11T17:26:47.608641Z", "start_time": "2021-11-11T17:26:47.392567Z" } }, "source": [ "## Summary" ] }, { "cell_type": "markdown", "id": "030ff69a", "metadata": { "ExecuteTime": { "end_time": "2021-11-11T17:27:24.538136Z", "start_time": "2021-11-11T17:27:24.529622Z" } }, "source": [ "The massively parallel splicing assay (MPSA) dataset of Wong et al., 2018. The authors used 3-exon minigenes to assay how inclusion of the middle exon varies with the sequence of that exon's 5' splice site. Nearly all 5' splice site variants of the form NNN/GYNNNN were measured, where the slash demarcates the exon/intron boundary. The authors performed experiments on multiple replicates of multiple libraries in three different minigene contexts: *IKBKAP* exons 19-21, *SMN1* exons 6-8, and *BRCA2* exons 17-19. The dataset ``'mpsa'`` is from library 1 replicate 1 in the *BRCA2* context, while ``'mpsa_replicate'`` is from library 2 replicate 1 in the same context. \n", "\n", "In these dataframes, the ``'tot_ct'`` column reports the number of reads obtained for each splice site from total processed mRNA transcripts, the ``'ex_ct'`` column reports the number of reads obtained from processed mRNA transcripts containing the central exon, ``'y'`` is the $\\log_{10}$ percent-spliced-in (PSI) value measured for each sequence, and ``'x'`` is the variant 5' splice site. Note that some sequences have $y > 2.0$, corresponding to PSI > 100, due to experimental noise.\n", "\n", "**Names**: ``'mpsa'``, ``'mpsa_replicate'``\n", "\n", "**Reference**: Wong MS, Kinney JB, Krainer AR. Quantitative activity profile and context dependence of all human 5' splice sites. [Mol Cell. 71:1012-1026.e3 (2018).](https://doi.org/10.1016/j.molcel.2018.07.033)" ] }, { "cell_type": "code", "execution_count": 2, "id": "3072cf25", "metadata": { "ExecuteTime": { "end_time": "2021-11-12T15:21:06.983473Z", "start_time": "2021-11-12T15:21:06.958981Z" } }, "outputs": [ { "data": { "text/html": [ "
| \n", " | set | \n", "tot_ct | \n", "ex_ct | \n", "y | \n", "x | \n", "
|---|---|---|---|---|---|
| 0 | \n", "training | \n", "28 | \n", "2 | \n", "0.023406 | \n", "GGAGUGAUG | \n", "
| 1 | \n", "test | \n", "315 | \n", "7 | \n", "-0.587914 | \n", "AGUGUGCAA | \n", "
| 2 | \n", "training | \n", "193 | \n", "15 | \n", "-0.074999 | \n", "UUCGCGCCA | \n", "
| 3 | \n", "training | \n", "27 | \n", "0 | \n", "-0.438475 | \n", "UAAGCUUUU | \n", "
| 4 | \n", "training | \n", "130 | \n", "2 | \n", "-0.631467 | \n", "AUGGUCGGG | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 30478 | \n", "training | \n", "190 | \n", "17 | \n", "-0.017078 | \n", "CUGGUUGCA | \n", "
| 30479 | \n", "training | \n", "154 | \n", "10 | \n", "-0.140256 | \n", "CGCGCACAA | \n", "
| 30480 | \n", "test | \n", "407 | \n", "16 | \n", "-0.371528 | \n", "ACUGCUCAC | \n", "
| 30481 | \n", "training | \n", "265 | \n", "6 | \n", "-0.571100 | \n", "AUAGUCUAA | \n", "
| 30482 | \n", "test | \n", "26 | \n", "22 | \n", "0.939047 | \n", "GUGGUAACU | \n", "
30483 rows × 5 columns
\n", "| \n", " | tot_ct | \n", "ex_ct | \n", "lib_ct | \n", "mis_ct | \n", "ss | \n", "bc | \n", "
|---|---|---|---|---|---|---|
| 0 | \n", "377 | \n", "27 | \n", "164 | \n", "3 | \n", "ACAGCGGGA | \n", "TTAGCTATCGGCTGACGTCT | \n", "
| 1 | \n", "332 | \n", "5 | \n", "97 | \n", "1 | \n", "AGCGTGTAT | \n", "CCACCCAACGCGCCGTCAGT | \n", "
| 2 | \n", "320 | \n", "3286 | \n", "46 | \n", "1 | \n", "CAGGTGAGA | \n", "TTGAGGTACACTGAACAGTC | \n", "
| 3 | \n", "312 | \n", "2248 | \n", "87 | \n", "1 | \n", "CAGGTTAGA | \n", "ACCGATCTGCCACGGCGACC | \n", "
| 4 | \n", "291 | \n", "8 | \n", "109 | \n", "2 | \n", "CAAGCCTTA | \n", "AGGGACCATCCAGTTCGCCT | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 944960 | \n", "0 | \n", "0 | \n", "14 | \n", "0 | \n", "ACCGCGATG | \n", "TGAAATTGACCCGAGCCTGC | \n", "
| 944961 | \n", "0 | \n", "0 | \n", "14 | \n", "1 | \n", "AACGCCTCG | \n", "AACCAAAATACCTTGCGCTT | \n", "
| 944962 | \n", "0 | \n", "0 | \n", "14 | \n", "0 | \n", "TACGCATCG | \n", "TACTCAGCCAATGGCGAACA | \n", "
| 944963 | \n", "0 | \n", "0 | \n", "14 | \n", "0 | \n", "AAGGTCACG | \n", "CTATGCATCTACGCTTAATG | \n", "
| 944964 | \n", "0 | \n", "0 | \n", "2 | \n", "0 | \n", "CCAGCGCCG | \n", "AAAAAAAAAAAAGATTTGTT | \n", "
944965 rows × 6 columns
\n", "| \n", " | set | \n", "tot_ct | \n", "ex_ct | \n", "y | \n", "x | \n", "
|---|---|---|---|---|---|
| 0 | \n", "training | \n", "28 | \n", "2 | \n", "0.023406 | \n", "GGAGUGAUG | \n", "
| 1 | \n", "test | \n", "315 | \n", "7 | \n", "-0.587914 | \n", "AGUGUGCAA | \n", "
| 2 | \n", "training | \n", "193 | \n", "15 | \n", "-0.074999 | \n", "UUCGCGCCA | \n", "
| 3 | \n", "training | \n", "27 | \n", "0 | \n", "-0.438475 | \n", "UAAGCUUUU | \n", "
| 4 | \n", "training | \n", "130 | \n", "2 | \n", "-0.631467 | \n", "AUGGUCGGG | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 30478 | \n", "training | \n", "190 | \n", "17 | \n", "-0.017078 | \n", "CUGGUUGCA | \n", "
| 30479 | \n", "training | \n", "154 | \n", "10 | \n", "-0.140256 | \n", "CGCGCACAA | \n", "
| 30480 | \n", "test | \n", "407 | \n", "16 | \n", "-0.371528 | \n", "ACUGCUCAC | \n", "
| 30481 | \n", "training | \n", "265 | \n", "6 | \n", "-0.571100 | \n", "AUAGUCUAA | \n", "
| 30482 | \n", "test | \n", "26 | \n", "22 | \n", "0.939047 | \n", "GUGGUAACU | \n", "
30483 rows × 5 columns
\n", "