{ "cells": [ { "cell_type": "markdown", "id": "dca40d3f", "metadata": {}, "source": [ "# gb1 dataset" ] }, { "cell_type": "code", "execution_count": 1, "id": "43acb29b", "metadata": { "ExecuteTime": { "end_time": "2021-11-11T22:23:27.743043Z", "start_time": "2021-11-11T22:23:25.875662Z" } }, "outputs": [], "source": [ "# Standard imports\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# Special imports\n", "import mavenn" ] }, { "cell_type": "markdown", "id": "5336886a", "metadata": { "ExecuteTime": { "end_time": "2021-11-11T17:26:47.608641Z", "start_time": "2021-11-11T17:26:47.392567Z" } }, "source": [ "## Summary\n", "\n", "The DMS dataset from Olson et al., 2014. The authors used an RNA display selection experiment to assay the binding of over half a million protein GB1 variants to IgG. These variants included all 1-point and 2-point mutations within the 55 residue GB1 sequence. Only the 2-point variants are included in this dataset.\n", "\n", "**Name:** ``'gb1'``\n", "\n", "**Reference**: Olson C, Wu N, Sun R. A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. [Curr Biol 24(22):2643-2651 (2014).](https://pubmed.ncbi.nlm.nih.gov/25455030/)" ] }, { "cell_type": "code", "execution_count": 2, "id": "3072cf25", "metadata": { "ExecuteTime": { "end_time": "2021-11-11T22:23:28.280522Z", "start_time": "2021-11-11T22:23:27.744157Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
setdistinput_ctselected_ctyx
0training217333-3.145154AAKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
1training2188-1.867676ACKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
2training2662-5.270800ADKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
3training2721-5.979498AEKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
4training2691680.481923AFKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
.....................
530732training2462139-2.515259QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
530733training231784-2.693165QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
530734training233577-2.896589QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
530735training214828-3.150861QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
530736training29516-3.287173QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
\n", "

530737 rows × 6 columns

\n", "
" ], "text/plain": [ " set dist input_ct selected_ct y \\\n", "0 training 2 173 33 -3.145154 \n", "1 training 2 18 8 -1.867676 \n", "2 training 2 66 2 -5.270800 \n", "3 training 2 72 1 -5.979498 \n", "4 training 2 69 168 0.481923 \n", "... ... ... ... ... ... \n", "530732 training 2 462 139 -2.515259 \n", "530733 training 2 317 84 -2.693165 \n", "530734 training 2 335 77 -2.896589 \n", "530735 training 2 148 28 -3.150861 \n", "530736 training 2 95 16 -3.287173 \n", "\n", " x \n", "0 AAKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... \n", "1 ACKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... \n", "2 ADKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... \n", "3 AEKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... \n", "4 AFKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... \n", "... ... \n", "530732 QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... \n", "530733 QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... \n", "530734 QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... \n", "530735 QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... \n", "530736 QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... \n", "\n", "[530737 rows x 6 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mavenn.load_example_dataset('gb1')" ] }, { "cell_type": "markdown", "id": "6fe018cb", "metadata": {}, "source": [ "## Preprocessing\n", "\n", "First we load the double-mutation dataset published by Olson et al. (2021).\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "ecccf7df", "metadata": { "ExecuteTime": { "end_time": "2021-11-11T17:34:28.129865Z", "start_time": "2021-11-11T17:34:28.124430Z" }, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Mut1 WT amino acidMut1 PositionMut1 MutationMut2 WT amino acidMut2 PositionMut2 MutationInput CountSelection CountMut1 FitnessMut2 Fitness
0Q2AY3A173331.5180.579
1Q2AY3C1881.5180.616
2Q2AY3D6621.5180.010
3Q2AY3E7211.5180.009
4Q2AY3F691681.5181.054
.................................
535912E56YT55R4621390.1900.941
535913E56YT55S317840.1900.840
535914E56YT55V335770.1900.669
535915E56YT55W148280.1900.798
535916E56YT55Y95160.1900.663
\n", "

535917 rows × 10 columns

\n", "
" ], "text/plain": [ " Mut1 WT amino acid Mut1 Position Mut1 Mutation Mut2 WT amino acid \\\n", "0 Q 2 A Y \n", "1 Q 2 A Y \n", "2 Q 2 A Y \n", "3 Q 2 A Y \n", "4 Q 2 A Y \n", "... ... ... ... ... \n", "535912 E 56 Y T \n", "535913 E 56 Y T \n", "535914 E 56 Y T \n", "535915 E 56 Y T \n", "535916 E 56 Y T \n", "\n", " Mut2 Position Mut2 Mutation Input Count Selection Count \\\n", "0 3 A 173 33 \n", "1 3 C 18 8 \n", "2 3 D 66 2 \n", "3 3 E 72 1 \n", "4 3 F 69 168 \n", "... ... ... ... ... \n", "535912 55 R 462 139 \n", "535913 55 S 317 84 \n", "535914 55 V 335 77 \n", "535915 55 W 148 28 \n", "535916 55 Y 95 16 \n", "\n", " Mut1 Fitness Mut2 Fitness \n", "0 1.518 0.579 \n", "1 1.518 0.616 \n", "2 1.518 0.010 \n", "3 1.518 0.009 \n", "4 1.518 1.054 \n", "... ... ... \n", "535912 0.190 0.941 \n", "535913 0.190 0.840 \n", "535914 0.190 0.669 \n", "535915 0.190 0.798 \n", "535916 0.190 0.663 \n", "\n", "[535917 rows x 10 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Dataset is is at this URL:\n", "# url = 'https://ars.els-cdn.com/content/image/1-s2.0-S0960982214012688-mmc2.xlsx'\n", "\n", "# We have downloaded this Excel file and reformatted it into a more parseable form\n", "raw_data_file = '../../mavenn/examples/datasets/raw/gb1_raw.xlsx'\n", "\n", "# Load data (takes a while)\n", "double_mut_df = pd.read_excel(raw_data_file, sheet_name='double_mutants')\n", "double_mut_df" ] }, { "cell_type": "markdown", "id": "e865c5e0", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Next we reconstruct the wild-type GB1 sequence" ] }, { "cell_type": "code", "execution_count": 4, "id": "c7dc8e80", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WT sequence: QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE\n" ] } ], "source": [ "# Get unique WT pos-aa associations, sorted by position\n", "wt_1_df = double_mut_df[['Mut1 Position', 'Mut1 WT amino acid']].copy()\n", "wt_1_df.columns = ['pos','aa']\n", "wt_2_df = double_mut_df[['Mut2 Position', 'Mut2 WT amino acid']].copy()\n", "wt_2_df.columns = ['pos','aa']\n", "wt_seq_df = pd.concat([wt_1_df, wt_2_df], axis=0).drop_duplicates().sort_values(by='pos').reset_index(drop=True)\n", "\n", "# Confirm that each position occurs at most once\n", "assert np.all(wt_seq_df['pos'].value_counts()==1)\n", "\n", "# Confirm that the set of unique positions is correct\n", "L = len(wt_seq_df)\n", "assert set(wt_seq_df['pos'].values) == set(range(2,L+2))\n", "\n", "# Compute wt_seq and confirm its identity\n", "wt_seq = ''.join(wt_seq_df['aa'])\n", "known_wt_seq = 'QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE'\n", "assert wt_seq == known_wt_seq\n", "\n", "# Print final wt sequence\n", "print(f'WT sequence: {wt_seq}')" ] }, { "cell_type": "markdown", "id": "adac1d2d", "metadata": { "ExecuteTime": { "end_time": "2021-11-11T22:23:57.928708Z", "start_time": "2021-11-11T22:23:57.115529Z" }, "pycharm": { "name": "#%% md\n" } }, "source": [ "Next we convert the list of mutations to an array `x` of variant sequences." ] }, { "cell_type": "code", "execution_count": 5, "id": "7fc92598", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "array(['AAKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE',\n", " 'ACKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE',\n", " 'ADKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE', ...,\n", " 'QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVVY',\n", " 'QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVWY',\n", " 'QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVYY'],\n", " dtype='\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
setdistinput_ctselected_ctyx
0training217333-3.145154AAKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
1training2188-1.867676ACKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
2training2662-5.270800ADKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
3training2721-5.979498AEKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
4training2691680.481923AFKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
.....................
530732training2462139-2.515259QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
530733training231784-2.693165QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
530734training233577-2.896589QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
530735training214828-3.150861QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
530736training29516-3.287173QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
\n", "

530737 rows × 6 columns

\n", "" ], "text/plain": [ " set dist input_ct selected_ct y \\\n", "0 training 2 173 33 -3.145154 \n", "1 training 2 18 8 -1.867676 \n", "2 training 2 66 2 -5.270800 \n", "3 training 2 72 1 -5.979498 \n", "4 training 2 69 168 0.481923 \n", "... ... ... ... ... ... \n", "530732 training 2 462 139 -2.515259 \n", "530733 training 2 317 84 -2.693165 \n", "530734 training 2 335 77 -2.896589 \n", "530735 training 2 148 28 -3.150861 \n", "530736 training 2 95 16 -3.287173 \n", "\n", " x \n", "0 AAKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... \n", "1 ACKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... \n", "2 ADKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... \n", "3 AEKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... \n", "4 AFKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... \n", "... ... \n", "530732 QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... \n", "530733 QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... \n", "530734 QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... \n", "530735 QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... \n", "530736 QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... \n", "\n", "[530737 rows x 6 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Assemble into dataframe\n", "final_df = pd.DataFrame({'set':sets, 'dist':2, 'input_ct':in_ct, 'selected_ct':out_ct, 'y':y, 'x':x})\n", "\n", "# Keep only sequences with input_ct >= 10\n", "final_df = final_df[final_df['input_ct']>=10].reset_index(drop=True)\n", "\n", "# Save to file (uncomment to execute)\n", "# final_df.to_csv('gb1_data.csv.gz', index=False, compression='gzip')\n", "\n", "# Preview dataframe\n", "final_df" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }