nisthal dataset¶
[1]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Special imports
import mavenn
Summary¶
The DMS dataset from Nisthal et al. (2019). The authors used a high-throughput protein stability assay to measure folding energies for single-mutant variants of GB1. Column 'x'
list variant GB1 sequences (positions 2-56). Column 'y'
lists the Gibbs free energy of folding (i.e., \(\Delta G_F\)) in units of kcal/mol; lower energy values correspond to increased protein stability. Sequences are not divided into training, validation, and test sets because this dataset is only used for
validation in Tareen et al. (2021).
Name: 'nisthal'
Reference: Nisthal A, Wang CY, Ary ML, Mayo SL. Protein stability engineering insights revealed by domain-wide comprehensive mutagenesis. Proc Natl Acad Sci USA 116:16367–16377 (2019).
[2]:
mavenn.load_example_dataset('nisthal')
[2]:
x | name | y | |
---|---|---|---|
0 | AYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | T02A | 0.4704 |
1 | DYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | T02D | 0.5538 |
2 | EYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | T02E | -0.1299 |
3 | FYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | T02F | -0.3008 |
4 | GYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | T02G | 0.6680 |
... | ... | ... | ... |
913 | TYTLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | K04T | -0.4815 |
914 | TYVLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | K04V | 0.2696 |
915 | TYYLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | K04Y | -0.8246 |
916 | VYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | T02V | -1.3090 |
917 | YYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | T02Y | -0.1476 |
918 rows × 3 columns
Preprocessing¶
First we load and preview the raw dataset published by Nisthal et al. (2019)
[3]:
raw_data_file = '../../mavenn/examples/datasets/raw/nisthal_raw.csv'
raw_df = pd.read_csv(raw_data_file)
raw_df
[3]:
Sequence | Description | Ligand | Data | Units | Assay/Protocol | |
---|---|---|---|---|---|---|
0 | ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... | M01A | NaN | NaN | kcal/mol | ddG(deepseq)_Olson |
1 | ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... | M01A | NaN | NaN | kcal/mol | ddG_lit_fromOlson |
2 | ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... | M01A | NaN | -1.777 | kcal/mol·M | m-value |
3 | ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... | M01A | NaN | -0.635 | kcal/mol | FullMin |
4 | ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... | M01A | NaN | -0.510 | kcal/mol | Rosetta SomeMin_ddG |
... | ... | ... | ... | ... | ... | ... |
18856 | YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... | M01Y | NaN | 0.512 | kcal/mol | SD of dG(H2O)_mean |
18857 | YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... | M01Y | NaN | 0.680 | kcal/mol | ddG(mAvg)_mean |
18858 | YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... | M01Y | NaN | 2.691 | M (Molar) | Cm |
18859 | YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... | M01Y | NaN | 4.519 | kcal/mol | dG(H2O)_mean |
18860 | YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... | M01Y | NaN | 4.630 | kcal/mol | dG(mAvg)_mean |
18861 rows × 6 columns
Next we do the following: - Select rows that have the value 'ddG(mAvg)_mean'
in the 'Assay/Protocol'
column. - Keep only the desired columns, and given them shorter names - Remove position 1 from variant sequences and drop duplicate sequences - Flip the sign of measured folding energies - Drop variants with \(\Delta G\) values of exactly \(+4\) kcal/mol, as these were not precisely measured. - Save the dataframe if desired
[5]:
# Select rows that have the value `'ddG(mAvg)_mean'` in the `'Assay/Protocol'` column.
data_df = raw_df[raw_df['Assay/Protocol']=='ddG(mAvg)_mean'].copy()
# Keep only the desired columns, and given them shorter names
data_df.rename(columns={'Sequence':'x', 'Data': 'y', 'Description':'name'}, inplace=True)
cols_to_keep = ['x', 'name', 'y']
data_df = data_df[cols_to_keep]
# Remove position 1 from variant sequences and drop duplicate sequences
data_df['x'] = data_df['x'].str[1:]
data_df.drop_duplicates(subset='x', keep=False, inplace=True)
# Flip the sign of measured folding energies
data_df['y'] = -data_df['y']
# Drop variants with $\Delta G$ of exactly $+4$ kcal/mol, as these were not precisely measured.
ix = data_df['y']==4
data_df = data_df[~ix]
data_df.reset_index(inplace=True, drop=True)
# Save to file (uncomment to execute)
# data_df.to_csv('nisthal_data.csv.gz', index=False, compression='gzip')
data_df
[5]:
x | name | y | |
---|---|---|---|
0 | AYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | T02A | 0.4704 |
1 | DYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | T02D | 0.5538 |
2 | EYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | T02E | -0.1299 |
3 | FYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | T02F | -0.3008 |
4 | GYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | T02G | 0.6680 |
... | ... | ... | ... |
808 | TYTLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | K04T | -0.4815 |
809 | TYVLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | K04V | 0.2696 |
810 | TYYLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | K04Y | -0.8246 |
811 | VYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | T02V | -1.3090 |
812 | YYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... | T02Y | -0.1476 |
813 rows × 3 columns