{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true, "pycharm": { "name": "#%% md\n" } }, "source": [ "# nisthal dataset" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# Standard imports\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# Special imports\n", "import mavenn\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "The DMS dataset from Nisthal et al. (2019). The authors used a high-throughput protein stability assay to measure folding energies for single-mutant variants of GB1. Column `'x'` list variant GB1 sequences (positions 2-56). Column `'y'` lists the Gibbs free energy of folding (i.e., $\\Delta G_F$) in units of kcal/mol; lower energy values correspond to increased protein stability. Sequences are not divided into training, validation, and test sets because this dataset is only used for validation in Tareen et al. (2021).\n", "\n", "**Name:** ``'nisthal'``\n", "\n", "**Reference**: Nisthal A, Wang CY, Ary ML, Mayo SL. Protein stability engineering insights revealed by domain-wide comprehensive mutagenesis. [Proc Natl Acad Sci USA 116:16367–16377 (2019)](https://pubmed.ncbi.nlm.nih.gov/31371509/)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "
| \n", " | x | \n", "name | \n", "y | \n", "
|---|---|---|---|
| 0 | \n", "AYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | \n", "T02A | \n", "0.4704 | \n", "
| 1 | \n", "DYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | \n", "T02D | \n", "0.5538 | \n", "
| 2 | \n", "EYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | \n", "T02E | \n", "-0.1299 | \n", "
| 3 | \n", "FYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | \n", "T02F | \n", "-0.3008 | \n", "
| 4 | \n", "GYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | \n", "T02G | \n", "0.6680 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "
| 913 | \n", "TYTLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | \n", "K04T | \n", "-0.4815 | \n", "
| 914 | \n", "TYVLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | \n", "K04V | \n", "0.2696 | \n", "
| 915 | \n", "TYYLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | \n", "K04Y | \n", "-0.8246 | \n", "
| 916 | \n", "VYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | \n", "T02V | \n", "-1.3090 | \n", "
| 917 | \n", "YYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | \n", "T02Y | \n", "-0.1476 | \n", "
918 rows × 3 columns
\n", "| \n", " | Sequence | \n", "Description | \n", "Ligand | \n", "Data | \n", "Units | \n", "Assay/Protocol | \n", "
|---|---|---|---|---|---|---|
| 0 | \n", "ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... | \n", "M01A | \n", "NaN | \n", "NaN | \n", "kcal/mol | \n", "ddG(deepseq)_Olson | \n", "
| 1 | \n", "ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... | \n", "M01A | \n", "NaN | \n", "NaN | \n", "kcal/mol | \n", "ddG_lit_fromOlson | \n", "
| 2 | \n", "ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... | \n", "M01A | \n", "NaN | \n", "-1.777 | \n", "kcal/mol·M | \n", "m-value | \n", "
| 3 | \n", "ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... | \n", "M01A | \n", "NaN | \n", "-0.635 | \n", "kcal/mol | \n", "FullMin | \n", "
| 4 | \n", "ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... | \n", "M01A | \n", "NaN | \n", "-0.510 | \n", "kcal/mol | \n", "Rosetta SomeMin_ddG | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 18856 | \n", "YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... | \n", "M01Y | \n", "NaN | \n", "0.512 | \n", "kcal/mol | \n", "SD of dG(H2O)_mean | \n", "
| 18857 | \n", "YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... | \n", "M01Y | \n", "NaN | \n", "0.680 | \n", "kcal/mol | \n", "ddG(mAvg)_mean | \n", "
| 18858 | \n", "YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... | \n", "M01Y | \n", "NaN | \n", "2.691 | \n", "M (Molar) | \n", "Cm | \n", "
| 18859 | \n", "YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... | \n", "M01Y | \n", "NaN | \n", "4.519 | \n", "kcal/mol | \n", "dG(H2O)_mean | \n", "
| 18860 | \n", "YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... | \n", "M01Y | \n", "NaN | \n", "4.630 | \n", "kcal/mol | \n", "dG(mAvg)_mean | \n", "
18861 rows × 6 columns
\n", "| \n", " | x | \n", "name | \n", "y | \n", "
|---|---|---|---|
| 0 | \n", "AYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | \n", "T02A | \n", "0.4704 | \n", "
| 1 | \n", "DYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | \n", "T02D | \n", "0.5538 | \n", "
| 2 | \n", "EYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | \n", "T02E | \n", "-0.1299 | \n", "
| 3 | \n", "FYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | \n", "T02F | \n", "-0.3008 | \n", "
| 4 | \n", "GYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | \n", "T02G | \n", "0.6680 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "
| 808 | \n", "TYTLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | \n", "K04T | \n", "-0.4815 | \n", "
| 809 | \n", "TYVLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | \n", "K04V | \n", "0.2696 | \n", "
| 810 | \n", "TYYLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | \n", "K04Y | \n", "-0.8246 | \n", "
| 811 | \n", "VYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | \n", "T02V | \n", "-1.3090 | \n", "
| 812 | \n", "YYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | \n", "T02Y | \n", "-0.1476 | \n", "
813 rows × 3 columns
\n", "