{ "cells": [ { "cell_type": "markdown", "id": "dca40d3f", "metadata": {}, "source": [ "# sortseq dataset" ] }, { "cell_type": "code", "execution_count": 5, "id": "43acb29b", "metadata": { "ExecuteTime": { "end_time": "2021-11-11T21:15:23.450749Z", "start_time": "2021-11-11T21:15:21.792191Z" } }, "outputs": [], "source": [ "# Standard imports\n", "import pandas as pd\n", "import numpy as np\n", "\n", "# Special imports\n", "import mavenn\n", "import os\n", "import urllib" ] }, { "cell_type": "markdown", "id": "5c257c40", "metadata": { "ExecuteTime": { "end_time": "2021-11-11T17:26:47.608641Z", "start_time": "2021-11-11T17:26:47.392567Z" } }, "source": [ "## Summary" ] }, { "cell_type": "markdown", "id": "c7e0fed0", "metadata": { "ExecuteTime": { "end_time": "2021-11-11T17:27:24.538136Z", "start_time": "2021-11-11T17:27:24.529622Z" } }, "source": [ "The sort-seq MPRA data of Kinney et al., 2010. The authors used fluoresence-activated cell sorting, followed by deep sequencing, to assay gene expression levels from variant *lac* promoters in *E. coli*. The authors performed 6 different experiments, which varied in the region of the *lac* promoter that was mutagenized, the mutation rate used, the *E. coli* host strain, cellular growth conditions, and the number of bins into which cells were sorted. See Kinney et al., 2010 for more details.\n", "\n", "In this dataframe, the `'x'` column lists (unique) variant sequences, columns `'ct_0'` through `'ct_9'` list the number of read counts for each sequence observed in each of the 10 respective FACS bins, and the `'set'` column indicates whether each sequence is assigned to the training set, the validation set, or the test set.\n", "\n", "**Names**: ``'sortseq'``\n", "\n", "**Associated datasets**: ``'sortseq_rnap-wt'``, ``'sortseq_crp-wt'``, ``'sortseq_full-500'``, ``'sortseq_full-150'``, ``'sortseq_full-0'``\n", "\n", "**Reference**: Kinney J, Murugan A, Callan C, Cox E. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. [Proc Natl Acad Sci USA. 107(20):9158-9163 (2010).](https://dx.doi.org/10.1073/pnas.1004290107)" ] }, { "cell_type": "code", "execution_count": 2, "id": "ba16bbe4", "metadata": { "ExecuteTime": { "end_time": "2021-11-11T21:15:23.496550Z", "start_time": "2021-11-11T21:15:23.451837Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
setct_0ct_1ct_2ct_3ct_4ct_5ct_6ct_7ct_8ct_9x
0training0100000000AAAAAAAGTGAGTTAGCCAACTAATTAGGCACCGTACGCTTTATAG...
1test0000000010AAAAAATCTGAGTTAGCTTACTCATTAGGCACCCCAGGCTTGACAC...
2test0000001000AAAAAATCTGAGTTTGCTCACTCTATCGGCACCCCAGTCTTTACAC...
3training0000000001AAAAAATGAGAGTTAGTTCACTCATTCGGCACCACAGGCTTTACAA...
4training0000000001AAAAAATGGGTGTTAGCTCTATCATTAGGCACCCCCGGCTTTACAC...
.......................................
50513validation0001000000TTTTGCAGAGTGTCAGCCCACTCATTACGCACCGCAGCCGTTACAC...
50514test0000000010TTTTTATGTGAGTTAGCTCACTCATTCGGCACCCTAGGCTTTACAC...
50515training0001000000TTTTTATGTGAGTTTGCTCACTCATGTGGCACCTAAGGCTTTACGC...
50516training1000000000TTTTTATGTGGGTTAGGTCGCGCATTAGGCACCGCAGGCTTTACCC...
50517training1000000000TTTTTATGTGTGTTTACTCTCTCATTAGGCACTCCACGCTTTACAC...
\n", "

50518 rows × 12 columns

\n", "
" ], "text/plain": [ " set ct_0 ct_1 ct_2 ct_3 ct_4 ct_5 ct_6 ct_7 ct_8 ct_9 \\\n", "0 training 0 1 0 0 0 0 0 0 0 0 \n", "1 test 0 0 0 0 0 0 0 0 1 0 \n", "2 test 0 0 0 0 0 0 1 0 0 0 \n", "3 training 0 0 0 0 0 0 0 0 0 1 \n", "4 training 0 0 0 0 0 0 0 0 0 1 \n", "... ... ... ... ... ... ... ... ... ... ... ... \n", "50513 validation 0 0 0 1 0 0 0 0 0 0 \n", "50514 test 0 0 0 0 0 0 0 0 1 0 \n", "50515 training 0 0 0 1 0 0 0 0 0 0 \n", "50516 training 1 0 0 0 0 0 0 0 0 0 \n", "50517 training 1 0 0 0 0 0 0 0 0 0 \n", "\n", " x \n", "0 AAAAAAAGTGAGTTAGCCAACTAATTAGGCACCGTACGCTTTATAG... \n", "1 AAAAAATCTGAGTTAGCTTACTCATTAGGCACCCCAGGCTTGACAC... \n", "2 AAAAAATCTGAGTTTGCTCACTCTATCGGCACCCCAGTCTTTACAC... \n", "3 AAAAAATGAGAGTTAGTTCACTCATTCGGCACCACAGGCTTTACAA... \n", "4 AAAAAATGGGTGTTAGCTCTATCATTAGGCACCCCCGGCTTTACAC... \n", "... ... \n", "50513 TTTTGCAGAGTGTCAGCCCACTCATTACGCACCGCAGCCGTTACAC... \n", "50514 TTTTTATGTGAGTTAGCTCACTCATTCGGCACCCTAGGCTTTACAC... \n", "50515 TTTTTATGTGAGTTTGCTCACTCATGTGGCACCTAAGGCTTTACGC... \n", "50516 TTTTTATGTGGGTTAGGTCGCGCATTAGGCACCGCAGGCTTTACCC... \n", "50517 TTTTTATGTGTGTTTACTCTCTCATTAGGCACTCCACGCTTTACAC... \n", "\n", "[50518 rows x 12 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mavenn.load_example_dataset('sortseq')" ] }, { "cell_type": "markdown", "id": "6fe018cb", "metadata": {}, "source": [ "## Preprocessing" ] }, { "cell_type": "markdown", "id": "51ae238c", "metadata": {}, "source": [ "The sort-seq MPRA dataset of Kinney et al., (2010) is available at https://github.com/jbkinney/09_sortseq/ in file `file_S2.txt.gz`. It is formatted as follows: the `'seq'` column lists (non-unique) variant 75 nt DNA sequences observed by high-throughput seuqencing, the `'experiment'` column lists which of the six reported experiments produced that sequence, and the `'bin'` column lists the FACS bin in which that sequence was observed. This dataframe is called `raw_df` in what follows." ] }, { "cell_type": "code", "execution_count": 3, "id": "29aecf0a", "metadata": { "ExecuteTime": { "end_time": "2021-11-11T21:15:24.723828Z", "start_time": "2021-11-11T21:15:23.497489Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
experimentbinx
0crp-wtB0AATTAAGGGCAGTTAACTCACCCATTAGGCACCCCAGGCTTTACAC...
1crp-wtB0AATTAATATGAGTTTGCTCACCCATTAGGCACCCCAGGCTTTACAC...
2crp-wtB0AATTAATAAGAGTTCACTCACTCATACGGCACCCCAGGCTTTACAC...
3crp-wtB0AATTTATGTGCTTTACCTCACTGATTTGGCACCCCAGGCTTTACAC...
4crp-wtB0AATTAAGGTGAGTTCGCTCGCTCATGAGGCACCCCAGGCTTTACAC...
\n", "
" ], "text/plain": [ " experiment bin x\n", "0 crp-wt B0 AATTAAGGGCAGTTAACTCACCCATTAGGCACCCCAGGCTTTACAC...\n", "1 crp-wt B0 AATTAATATGAGTTTGCTCACCCATTAGGCACCCCAGGCTTTACAC...\n", "2 crp-wt B0 AATTAATAAGAGTTCACTCACTCATACGGCACCCCAGGCTTTACAC...\n", "3 crp-wt B0 AATTTATGTGCTTTACCTCACTGATTTGGCACCCCAGGCTTTACAC...\n", "4 crp-wt B0 AATTAAGGTGAGTTCGCTCGCTCATGAGGCACCCCAGGCTTTACAC..." ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Download datset\n", "url = 'https://github.com/jbkinney/09_sortseq/raw/master/file_S2.txt.gz'\n", "raw_data_file = 'file_S2.txt.gz'\n", "urllib.request.urlretrieve(url, raw_data_file)\n", "\n", "# Load raw dataset\n", "raw_df = pd.read_csv('file_S2.txt.gz', \n", " sep='\\t',\n", " header=None, \n", " names=['experiment','bin','x'], \n", " compression='gzip')\n", "\n", "# Delete raw dataset\n", "os.remove(raw_data_file)\n", "\n", "# Preview raw_df\n", "raw_df.head()" ] }, { "cell_type": "markdown", "id": "40811846", "metadata": {}, "source": [ "To reformat `'raw_df'` into the one provided with MAVE-NN, we first trim the dataframe to keep only rows corresponding to the `'full-wt'` experiment. We then rename each FACS bin `'BX'` to `'ct_X'` for X = 0, 1, ..., 9, and create a `'ct'` column filled with ones. The result is stored in a dataframe called `sub_df`.\n", "\n", "Next we use the `pivot()` and `groupby()` functions in Pandas to obtain a dataframe in which the `'seq'` column lists only unique sequences, each of the 10 possible `'ct_X'` values in the original `'bin'` column now label a separate column, and the values in these new columns report the number of times each sequence was observed in each FACS bin. The result is stored in a dataframe called `pivot_df`.\n", "\n", "Finally, we create a `'set'` column that randomly assigns each sequence to the training, test, or validation set (using a 60:20:20 split), then reorder the columns for clarity. The resulting dataframe is called `final_df`." ] }, { "cell_type": "code", "execution_count": 4, "id": "31f260f4", "metadata": { "ExecuteTime": { "end_time": "2021-11-11T21:15:24.857703Z", "start_time": "2021-11-11T21:15:24.724567Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
setct_0ct_1ct_2ct_3ct_4ct_5ct_6ct_7ct_8ct_9x
0training0100000000AAAAAAAGTGAGTTAGCCAACTAATTAGGCACCGTACGCTTTATAG...
1test0000000010AAAAAATCTGAGTTAGCTTACTCATTAGGCACCCCAGGCTTGACAC...
2test0000001000AAAAAATCTGAGTTTGCTCACTCTATCGGCACCCCAGTCTTTACAC...
3training0000000001AAAAAATGAGAGTTAGTTCACTCATTCGGCACCACAGGCTTTACAA...
4training0000000001AAAAAATGGGTGTTAGCTCTATCATTAGGCACCCCCGGCTTTACAC...
\n", "
" ], "text/plain": [ " set ct_0 ct_1 ct_2 ct_3 ct_4 ct_5 ct_6 ct_7 ct_8 ct_9 \\\n", "0 training 0 1 0 0 0 0 0 0 0 0 \n", "1 test 0 0 0 0 0 0 0 0 1 0 \n", "2 test 0 0 0 0 0 0 1 0 0 0 \n", "3 training 0 0 0 0 0 0 0 0 0 1 \n", "4 training 0 0 0 0 0 0 0 0 0 1 \n", "\n", " x \n", "0 AAAAAAAGTGAGTTAGCCAACTAATTAGGCACCGTACGCTTTATAG... \n", "1 AAAAAATCTGAGTTAGCTTACTCATTAGGCACCCCAGGCTTGACAC... \n", "2 AAAAAATCTGAGTTTGCTCACTCTATCGGCACCCCAGTCTTTACAC... \n", "3 AAAAAATGAGAGTTAGTTCACTCATTCGGCACCACAGGCTTTACAA... \n", "4 AAAAAATGGGTGTTAGCTCTATCATTAGGCACCCCCGGCTTTACAC... " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Keep only data from the full-wt experiment\n", "ix = raw_df['experiment']=='full-wt'\n", "sub_df = raw_df[ix].copy().reset_index(drop=True)[['bin','x']]\n", "\n", "# Rename bins BX -> ct_X, where X = 0, 1, ..., 9\n", "sub_df['bin'] = [f'ct_{s[1:]}' for s in sub_df['bin']]\n", "\n", "# Add counts column\n", "sub_df['ct'] = 1\n", "\n", "# Pivot dataframe\n", "pivot_df = sub_df.pivot_table(index='x', values='ct', columns='bin').fillna(0).astype(int)\n", "pivot_df.columns.name = None\n", "\n", "# Groupby sequence\n", "pivot_df = pivot_df.groupby('x').sum()\n", "\n", "# Reindex dataframe\n", "pivot_df = pivot_df.reset_index()\n", "\n", "# Randomly assign sequences to training, validation, and test sets\n", "final_df = pivot_df.copy()\n", "np.random.seed(0)\n", "final_df['set'] = np.random.choice(a=['training','test','validation'], \n", " p=[.6,.2,.2], \n", " size=len(final_df))\n", "\n", "# Rearrange columns\n", "new_cols = ['set'] + list(final_df.columns[1:-1]) + ['x']\n", "final_df = final_df[new_cols]\n", "\n", "# Save to file (uncomment to execute)\n", "# final_df.to_csv('sortseq_data.csv.gz', index=False, compression='gzip')\n", "\n", "# Preview final_df\n", "final_df.head()" ] }, { "cell_type": "markdown", "id": "9c3f2c26", "metadata": {}, "source": [ "This final dataframe, `final_df`, has the same format as the `'sortseq'` dataset that comes with MAVE-NN. \n", "\n" ] }, { "cell_type": "markdown", "id": "902743b9", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.8" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }