{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true, "pycharm": { "name": "#%% md\n" } }, "source": [ "# nisthal dataset" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# Standard imports\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# Special imports\n", "import mavenn\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "The DMS dataset from Nisthal et al. (2019). The authors used a high-throughput protein stability assay to measure folding energies for single-mutant variants of GB1. Column `'x'` list variant GB1 sequences (positions 2-56). Column `'y'` lists the Gibbs free energy of folding (i.e., $\\Delta G_F$) in units of kcal/mol; lower energy values correspond to increased protein stability. Sequences are not divided into training, validation, and test sets because this dataset is only used for validation in Tareen et al. (2021).\n", "\n", "**Name:** ``'nisthal'``\n", "\n", "**Reference**: Nisthal A, Wang CY, Ary ML, Mayo SL. Protein stability engineering insights revealed by domain-wide comprehensive mutagenesis. [Proc Natl Acad Sci USA 116:16367–16377 (2019)](https://pubmed.ncbi.nlm.nih.gov/31371509/)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xnamey
0AYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...T02A0.4704
1DYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...T02D0.5538
2EYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...T02E-0.1299
3FYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...T02F-0.3008
4GYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...T02G0.6680
............
913TYTLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...K04T-0.4815
914TYVLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...K04V0.2696
915TYYLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...K04Y-0.8246
916VYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...T02V-1.3090
917YYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...T02Y-0.1476
\n", "

918 rows × 3 columns

\n", "
" ], "text/plain": [ " x name y\n", "0 AYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02A 0.4704\n", "1 DYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02D 0.5538\n", "2 EYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02E -0.1299\n", "3 FYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02F -0.3008\n", "4 GYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02G 0.6680\n", ".. ... ... ...\n", "913 TYTLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... K04T -0.4815\n", "914 TYVLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... K04V 0.2696\n", "915 TYYLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... K04Y -0.8246\n", "916 VYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02V -1.3090\n", "917 YYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02Y -0.1476\n", "\n", "[918 rows x 3 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mavenn.load_example_dataset('nisthal')" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Preprocessing\n", "\n", "First we load and preview the raw dataset published by Nisthal et al. (2019)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SequenceDescriptionLigandDataUnitsAssay/Protocol
0ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD...M01ANaNNaNkcal/molddG(deepseq)_Olson
1ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD...M01ANaNNaNkcal/molddG_lit_fromOlson
2ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD...M01ANaN-1.777kcal/mol·Mm-value
3ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD...M01ANaN-0.635kcal/molFullMin
4ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD...M01ANaN-0.510kcal/molRosetta SomeMin_ddG
.....................
18856YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD...M01YNaN0.512kcal/molSD of dG(H2O)_mean
18857YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD...M01YNaN0.680kcal/molddG(mAvg)_mean
18858YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD...M01YNaN2.691M (Molar)Cm
18859YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD...M01YNaN4.519kcal/moldG(H2O)_mean
18860YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD...M01YNaN4.630kcal/moldG(mAvg)_mean
\n", "

18861 rows × 6 columns

\n", "
" ], "text/plain": [ " Sequence Description Ligand \\\n", "0 ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... M01A NaN \n", "1 ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... M01A NaN \n", "2 ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... M01A NaN \n", "3 ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... M01A NaN \n", "4 ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... M01A NaN \n", "... ... ... ... \n", "18856 YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... M01Y NaN \n", "18857 YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... M01Y NaN \n", "18858 YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... M01Y NaN \n", "18859 YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... M01Y NaN \n", "18860 YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... M01Y NaN \n", "\n", " Data Units Assay/Protocol \n", "0 NaN kcal/mol ddG(deepseq)_Olson \n", "1 NaN kcal/mol ddG_lit_fromOlson \n", "2 -1.777 kcal/mol·M m-value \n", "3 -0.635 kcal/mol FullMin \n", "4 -0.510 kcal/mol Rosetta SomeMin_ddG \n", "... ... ... ... \n", "18856 0.512 kcal/mol SD of dG(H2O)_mean \n", "18857 0.680 kcal/mol ddG(mAvg)_mean \n", "18858 2.691 M (Molar) Cm \n", "18859 4.519 kcal/mol dG(H2O)_mean \n", "18860 4.630 kcal/mol dG(mAvg)_mean \n", "\n", "[18861 rows x 6 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_data_file = '../../mavenn/examples/datasets/raw/nisthal_raw.csv'\n", "raw_df = pd.read_csv(raw_data_file)\n", "raw_df" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Next we do the following:\n", "- Select rows that have the value `'ddG(mAvg)_mean'` in the `'Assay/Protocol'` column.\n", "- Keep only the desired columns, and given them shorter names\n", "- Remove position 1 from variant sequences and drop duplicate sequences\n", "- Flip the sign of measured folding energies\n", "- Drop variants with $\\Delta G$ values of exactly $+4$ kcal/mol, as these were not precisely measured.\n", "- Save the dataframe if desired" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xnamey
0AYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...T02A0.4704
1DYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...T02D0.5538
2EYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...T02E-0.1299
3FYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...T02F-0.3008
4GYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...T02G0.6680
............
808TYTLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...K04T-0.4815
809TYVLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...K04V0.2696
810TYYLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...K04Y-0.8246
811VYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...T02V-1.3090
812YYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...T02Y-0.1476
\n", "

813 rows × 3 columns

\n", "
" ], "text/plain": [ " x name y\n", "0 AYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02A 0.4704\n", "1 DYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02D 0.5538\n", "2 EYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02E -0.1299\n", "3 FYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02F -0.3008\n", "4 GYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02G 0.6680\n", ".. ... ... ...\n", "808 TYTLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... K04T -0.4815\n", "809 TYVLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... K04V 0.2696\n", "810 TYYLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... K04Y -0.8246\n", "811 VYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02V -1.3090\n", "812 YYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02Y -0.1476\n", "\n", "[813 rows x 3 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select rows that have the value `'ddG(mAvg)_mean'` in the `'Assay/Protocol'` column.\n", "data_df = raw_df[raw_df['Assay/Protocol']=='ddG(mAvg)_mean'].copy()\n", "\n", "# Keep only the desired columns, and given them shorter names\n", "data_df.rename(columns={'Sequence':'x', 'Data': 'y', 'Description':'name'}, inplace=True)\n", "cols_to_keep = ['x', 'name', 'y']\n", "data_df = data_df[cols_to_keep]\n", "\n", "# Remove position 1 from variant sequences and drop duplicate sequences\n", "data_df['x'] = data_df['x'].str[1:]\n", "data_df.drop_duplicates(subset='x', keep=False, inplace=True)\n", "\n", "# Flip the sign of measured folding energies\n", "data_df['y'] = -data_df['y']\n", "\n", "# Drop variants with $\\Delta G$ of exactly $+4$ kcal/mol, as these were not precisely measured.\n", "ix = data_df['y']==4\n", "data_df = data_df[~ix]\n", "data_df.reset_index(inplace=True, drop=True)\n", "\n", "# Save to file (uncomment to execute)\n", "# data_df.to_csv('nisthal_data.csv.gz', index=False, compression='gzip')\n", "data_df" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }