{
"cells": [
{
"cell_type": "markdown",
"id": "dca40d3f",
"metadata": {},
"source": [
"# sortseq dataset"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "43acb29b",
"metadata": {
"ExecuteTime": {
"end_time": "2021-11-11T21:15:23.450749Z",
"start_time": "2021-11-11T21:15:21.792191Z"
}
},
"outputs": [],
"source": [
"# Standard imports\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"# Special imports\n",
"import mavenn\n",
"import os\n",
"import urllib"
]
},
{
"cell_type": "markdown",
"id": "5c257c40",
"metadata": {
"ExecuteTime": {
"end_time": "2021-11-11T17:26:47.608641Z",
"start_time": "2021-11-11T17:26:47.392567Z"
}
},
"source": [
"## Summary"
]
},
{
"cell_type": "markdown",
"id": "c7e0fed0",
"metadata": {
"ExecuteTime": {
"end_time": "2021-11-11T17:27:24.538136Z",
"start_time": "2021-11-11T17:27:24.529622Z"
}
},
"source": [
"The sort-seq MPRA data of Kinney et al., 2010. The authors used fluoresence-activated cell sorting, followed by deep sequencing, to assay gene expression levels from variant *lac* promoters in *E. coli*. The authors performed 6 different experiments, which varied in the region of the *lac* promoter that was mutagenized, the mutation rate used, the *E. coli* host strain, cellular growth conditions, and the number of bins into which cells were sorted. See Kinney et al., 2010 for more details.\n",
"\n",
"In this dataframe, the `'x'` column lists (unique) variant sequences, columns `'ct_0'` through `'ct_9'` list the number of read counts for each sequence observed in each of the 10 respective FACS bins, and the `'set'` column indicates whether each sequence is assigned to the training set, the validation set, or the test set.\n",
"\n",
"**Names**: ``'sortseq'``\n",
"\n",
"**Associated datasets**: ``'sortseq_rnap-wt'``, ``'sortseq_crp-wt'``, ``'sortseq_full-500'``, ``'sortseq_full-150'``, ``'sortseq_full-0'``\n",
"\n",
"**Reference**: Kinney J, Murugan A, Callan C, Cox E. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. [Proc Natl Acad Sci USA. 107(20):9158-9163 (2010).](https://dx.doi.org/10.1073/pnas.1004290107)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "ba16bbe4",
"metadata": {
"ExecuteTime": {
"end_time": "2021-11-11T21:15:23.496550Z",
"start_time": "2021-11-11T21:15:23.451837Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" set | \n",
" ct_0 | \n",
" ct_1 | \n",
" ct_2 | \n",
" ct_3 | \n",
" ct_4 | \n",
" ct_5 | \n",
" ct_6 | \n",
" ct_7 | \n",
" ct_8 | \n",
" ct_9 | \n",
" x | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" training | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" AAAAAAAGTGAGTTAGCCAACTAATTAGGCACCGTACGCTTTATAG... | \n",
"
\n",
" \n",
" | 1 | \n",
" test | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" AAAAAATCTGAGTTAGCTTACTCATTAGGCACCCCAGGCTTGACAC... | \n",
"
\n",
" \n",
" | 2 | \n",
" test | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" AAAAAATCTGAGTTTGCTCACTCTATCGGCACCCCAGTCTTTACAC... | \n",
"
\n",
" \n",
" | 3 | \n",
" training | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" AAAAAATGAGAGTTAGTTCACTCATTCGGCACCACAGGCTTTACAA... | \n",
"
\n",
" \n",
" | 4 | \n",
" training | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" AAAAAATGGGTGTTAGCTCTATCATTAGGCACCCCCGGCTTTACAC... | \n",
"
\n",
" \n",
" | ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" | 50513 | \n",
" validation | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" TTTTGCAGAGTGTCAGCCCACTCATTACGCACCGCAGCCGTTACAC... | \n",
"
\n",
" \n",
" | 50514 | \n",
" test | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" TTTTTATGTGAGTTAGCTCACTCATTCGGCACCCTAGGCTTTACAC... | \n",
"
\n",
" \n",
" | 50515 | \n",
" training | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" TTTTTATGTGAGTTTGCTCACTCATGTGGCACCTAAGGCTTTACGC... | \n",
"
\n",
" \n",
" | 50516 | \n",
" training | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" TTTTTATGTGGGTTAGGTCGCGCATTAGGCACCGCAGGCTTTACCC... | \n",
"
\n",
" \n",
" | 50517 | \n",
" training | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" TTTTTATGTGTGTTTACTCTCTCATTAGGCACTCCACGCTTTACAC... | \n",
"
\n",
" \n",
"
\n",
"
50518 rows × 12 columns
\n",
"
"
],
"text/plain": [
" set ct_0 ct_1 ct_2 ct_3 ct_4 ct_5 ct_6 ct_7 ct_8 ct_9 \\\n",
"0 training 0 1 0 0 0 0 0 0 0 0 \n",
"1 test 0 0 0 0 0 0 0 0 1 0 \n",
"2 test 0 0 0 0 0 0 1 0 0 0 \n",
"3 training 0 0 0 0 0 0 0 0 0 1 \n",
"4 training 0 0 0 0 0 0 0 0 0 1 \n",
"... ... ... ... ... ... ... ... ... ... ... ... \n",
"50513 validation 0 0 0 1 0 0 0 0 0 0 \n",
"50514 test 0 0 0 0 0 0 0 0 1 0 \n",
"50515 training 0 0 0 1 0 0 0 0 0 0 \n",
"50516 training 1 0 0 0 0 0 0 0 0 0 \n",
"50517 training 1 0 0 0 0 0 0 0 0 0 \n",
"\n",
" x \n",
"0 AAAAAAAGTGAGTTAGCCAACTAATTAGGCACCGTACGCTTTATAG... \n",
"1 AAAAAATCTGAGTTAGCTTACTCATTAGGCACCCCAGGCTTGACAC... \n",
"2 AAAAAATCTGAGTTTGCTCACTCTATCGGCACCCCAGTCTTTACAC... \n",
"3 AAAAAATGAGAGTTAGTTCACTCATTCGGCACCACAGGCTTTACAA... \n",
"4 AAAAAATGGGTGTTAGCTCTATCATTAGGCACCCCCGGCTTTACAC... \n",
"... ... \n",
"50513 TTTTGCAGAGTGTCAGCCCACTCATTACGCACCGCAGCCGTTACAC... \n",
"50514 TTTTTATGTGAGTTAGCTCACTCATTCGGCACCCTAGGCTTTACAC... \n",
"50515 TTTTTATGTGAGTTTGCTCACTCATGTGGCACCTAAGGCTTTACGC... \n",
"50516 TTTTTATGTGGGTTAGGTCGCGCATTAGGCACCGCAGGCTTTACCC... \n",
"50517 TTTTTATGTGTGTTTACTCTCTCATTAGGCACTCCACGCTTTACAC... \n",
"\n",
"[50518 rows x 12 columns]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mavenn.load_example_dataset('sortseq')"
]
},
{
"cell_type": "markdown",
"id": "6fe018cb",
"metadata": {},
"source": [
"## Preprocessing"
]
},
{
"cell_type": "markdown",
"id": "51ae238c",
"metadata": {},
"source": [
"The sort-seq MPRA dataset of Kinney et al., (2010) is available at https://github.com/jbkinney/09_sortseq/ in file `file_S2.txt.gz`. It is formatted as follows: the `'seq'` column lists (non-unique) variant 75 nt DNA sequences observed by high-throughput seuqencing, the `'experiment'` column lists which of the six reported experiments produced that sequence, and the `'bin'` column lists the FACS bin in which that sequence was observed. This dataframe is called `raw_df` in what follows."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "29aecf0a",
"metadata": {
"ExecuteTime": {
"end_time": "2021-11-11T21:15:24.723828Z",
"start_time": "2021-11-11T21:15:23.497489Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" experiment | \n",
" bin | \n",
" x | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" crp-wt | \n",
" B0 | \n",
" AATTAAGGGCAGTTAACTCACCCATTAGGCACCCCAGGCTTTACAC... | \n",
"
\n",
" \n",
" | 1 | \n",
" crp-wt | \n",
" B0 | \n",
" AATTAATATGAGTTTGCTCACCCATTAGGCACCCCAGGCTTTACAC... | \n",
"
\n",
" \n",
" | 2 | \n",
" crp-wt | \n",
" B0 | \n",
" AATTAATAAGAGTTCACTCACTCATACGGCACCCCAGGCTTTACAC... | \n",
"
\n",
" \n",
" | 3 | \n",
" crp-wt | \n",
" B0 | \n",
" AATTTATGTGCTTTACCTCACTGATTTGGCACCCCAGGCTTTACAC... | \n",
"
\n",
" \n",
" | 4 | \n",
" crp-wt | \n",
" B0 | \n",
" AATTAAGGTGAGTTCGCTCGCTCATGAGGCACCCCAGGCTTTACAC... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" experiment bin x\n",
"0 crp-wt B0 AATTAAGGGCAGTTAACTCACCCATTAGGCACCCCAGGCTTTACAC...\n",
"1 crp-wt B0 AATTAATATGAGTTTGCTCACCCATTAGGCACCCCAGGCTTTACAC...\n",
"2 crp-wt B0 AATTAATAAGAGTTCACTCACTCATACGGCACCCCAGGCTTTACAC...\n",
"3 crp-wt B0 AATTTATGTGCTTTACCTCACTGATTTGGCACCCCAGGCTTTACAC...\n",
"4 crp-wt B0 AATTAAGGTGAGTTCGCTCGCTCATGAGGCACCCCAGGCTTTACAC..."
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Download datset\n",
"url = 'https://github.com/jbkinney/09_sortseq/raw/master/file_S2.txt.gz'\n",
"raw_data_file = 'file_S2.txt.gz'\n",
"urllib.request.urlretrieve(url, raw_data_file)\n",
"\n",
"# Load raw dataset\n",
"raw_df = pd.read_csv('file_S2.txt.gz', \n",
" sep='\\t',\n",
" header=None, \n",
" names=['experiment','bin','x'], \n",
" compression='gzip')\n",
"\n",
"# Delete raw dataset\n",
"os.remove(raw_data_file)\n",
"\n",
"# Preview raw_df\n",
"raw_df.head()"
]
},
{
"cell_type": "markdown",
"id": "40811846",
"metadata": {},
"source": [
"To reformat `'raw_df'` into the one provided with MAVE-NN, we first trim the dataframe to keep only rows corresponding to the `'full-wt'` experiment. We then rename each FACS bin `'BX'` to `'ct_X'` for X = 0, 1, ..., 9, and create a `'ct'` column filled with ones. The result is stored in a dataframe called `sub_df`.\n",
"\n",
"Next we use the `pivot()` and `groupby()` functions in Pandas to obtain a dataframe in which the `'seq'` column lists only unique sequences, each of the 10 possible `'ct_X'` values in the original `'bin'` column now label a separate column, and the values in these new columns report the number of times each sequence was observed in each FACS bin. The result is stored in a dataframe called `pivot_df`.\n",
"\n",
"Finally, we create a `'set'` column that randomly assigns each sequence to the training, test, or validation set (using a 60:20:20 split), then reorder the columns for clarity. The resulting dataframe is called `final_df`."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "31f260f4",
"metadata": {
"ExecuteTime": {
"end_time": "2021-11-11T21:15:24.857703Z",
"start_time": "2021-11-11T21:15:24.724567Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" set | \n",
" ct_0 | \n",
" ct_1 | \n",
" ct_2 | \n",
" ct_3 | \n",
" ct_4 | \n",
" ct_5 | \n",
" ct_6 | \n",
" ct_7 | \n",
" ct_8 | \n",
" ct_9 | \n",
" x | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" training | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" AAAAAAAGTGAGTTAGCCAACTAATTAGGCACCGTACGCTTTATAG... | \n",
"
\n",
" \n",
" | 1 | \n",
" test | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" AAAAAATCTGAGTTAGCTTACTCATTAGGCACCCCAGGCTTGACAC... | \n",
"
\n",
" \n",
" | 2 | \n",
" test | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" AAAAAATCTGAGTTTGCTCACTCTATCGGCACCCCAGTCTTTACAC... | \n",
"
\n",
" \n",
" | 3 | \n",
" training | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" AAAAAATGAGAGTTAGTTCACTCATTCGGCACCACAGGCTTTACAA... | \n",
"
\n",
" \n",
" | 4 | \n",
" training | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" AAAAAATGGGTGTTAGCTCTATCATTAGGCACCCCCGGCTTTACAC... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" set ct_0 ct_1 ct_2 ct_3 ct_4 ct_5 ct_6 ct_7 ct_8 ct_9 \\\n",
"0 training 0 1 0 0 0 0 0 0 0 0 \n",
"1 test 0 0 0 0 0 0 0 0 1 0 \n",
"2 test 0 0 0 0 0 0 1 0 0 0 \n",
"3 training 0 0 0 0 0 0 0 0 0 1 \n",
"4 training 0 0 0 0 0 0 0 0 0 1 \n",
"\n",
" x \n",
"0 AAAAAAAGTGAGTTAGCCAACTAATTAGGCACCGTACGCTTTATAG... \n",
"1 AAAAAATCTGAGTTAGCTTACTCATTAGGCACCCCAGGCTTGACAC... \n",
"2 AAAAAATCTGAGTTTGCTCACTCTATCGGCACCCCAGTCTTTACAC... \n",
"3 AAAAAATGAGAGTTAGTTCACTCATTCGGCACCACAGGCTTTACAA... \n",
"4 AAAAAATGGGTGTTAGCTCTATCATTAGGCACCCCCGGCTTTACAC... "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Keep only data from the full-wt experiment\n",
"ix = raw_df['experiment']=='full-wt'\n",
"sub_df = raw_df[ix].copy().reset_index(drop=True)[['bin','x']]\n",
"\n",
"# Rename bins BX -> ct_X, where X = 0, 1, ..., 9\n",
"sub_df['bin'] = [f'ct_{s[1:]}' for s in sub_df['bin']]\n",
"\n",
"# Add counts column\n",
"sub_df['ct'] = 1\n",
"\n",
"# Pivot dataframe\n",
"pivot_df = sub_df.pivot_table(index='x', values='ct', columns='bin').fillna(0).astype(int)\n",
"pivot_df.columns.name = None\n",
"\n",
"# Groupby sequence\n",
"pivot_df = pivot_df.groupby('x').sum()\n",
"\n",
"# Reindex dataframe\n",
"pivot_df = pivot_df.reset_index()\n",
"\n",
"# Randomly assign sequences to training, validation, and test sets\n",
"final_df = pivot_df.copy()\n",
"np.random.seed(0)\n",
"final_df['set'] = np.random.choice(a=['training','test','validation'], \n",
" p=[.6,.2,.2], \n",
" size=len(final_df))\n",
"\n",
"# Rearrange columns\n",
"new_cols = ['set'] + list(final_df.columns[1:-1]) + ['x']\n",
"final_df = final_df[new_cols]\n",
"\n",
"# Save to file (uncomment to execute)\n",
"# final_df.to_csv('sortseq_data.csv.gz', index=False, compression='gzip')\n",
"\n",
"# Preview final_df\n",
"final_df.head()"
]
},
{
"cell_type": "markdown",
"id": "9c3f2c26",
"metadata": {},
"source": [
"This final dataframe, `final_df`, has the same format as the `'sortseq'` dataset that comes with MAVE-NN. \n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "902743b9",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.8"
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 5
}