nisthal dataset

[1]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Special imports
import mavenn

Summary

The DMS dataset from Nisthal et al. (2019). The authors used a high-throughput protein stability assay to measure folding energies for single-mutant variants of GB1. Column 'x' list variant GB1 sequences (positions 2-56). Column 'y' lists the Gibbs free energy of folding (i.e., \(\Delta G_F\)) in units of kcal/mol; lower energy values correspond to increased protein stability. Sequences are not divided into training, validation, and test sets because this dataset is only used for validation in Tareen et al. (2021).

Name: 'nisthal'

Reference: Nisthal A, Wang CY, Ary ML, Mayo SL. Protein stability engineering insights revealed by domain-wide comprehensive mutagenesis. Proc Natl Acad Sci USA 116:16367–16377 (2019).

[2]:
mavenn.load_example_dataset('nisthal')
[2]:
x name y
0 AYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02A 0.4704
1 DYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02D 0.5538
2 EYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02E -0.1299
3 FYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02F -0.3008
4 GYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02G 0.6680
... ... ... ...
913 TYTLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... K04T -0.4815
914 TYVLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... K04V 0.2696
915 TYYLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... K04Y -0.8246
916 VYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02V -1.3090
917 YYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02Y -0.1476

918 rows × 3 columns

Preprocessing

First we load and preview the raw dataset published by Nisthal et al. (2019)

[3]:
raw_data_file = '../../mavenn/examples/datasets/raw/nisthal_raw.csv'
raw_df = pd.read_csv(raw_data_file)
raw_df
[3]:
Sequence Description Ligand Data Units Assay/Protocol
0 ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... M01A NaN NaN kcal/mol ddG(deepseq)_Olson
1 ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... M01A NaN NaN kcal/mol ddG_lit_fromOlson
2 ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... M01A NaN -1.777 kcal/mol·M m-value
3 ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... M01A NaN -0.635 kcal/mol FullMin
4 ATYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... M01A NaN -0.510 kcal/mol Rosetta SomeMin_ddG
... ... ... ... ... ... ...
18856 YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... M01Y NaN 0.512 kcal/mol SD of dG(H2O)_mean
18857 YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... M01Y NaN 0.680 kcal/mol ddG(mAvg)_mean
18858 YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... M01Y NaN 2.691 M (Molar) Cm
18859 YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... M01Y NaN 4.519 kcal/mol dG(H2O)_mean
18860 YTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYD... M01Y NaN 4.630 kcal/mol dG(mAvg)_mean

18861 rows × 6 columns

Next we do the following: - Select rows that have the value 'ddG(mAvg)_mean' in the 'Assay/Protocol' column. - Keep only the desired columns, and given them shorter names - Remove position 1 from variant sequences and drop duplicate sequences - Flip the sign of measured folding energies - Drop variants with \(\Delta G\) values of exactly \(+4\) kcal/mol, as these were not precisely measured. - Save the dataframe if desired

[5]:
# Select rows that have the value `'ddG(mAvg)_mean'` in the `'Assay/Protocol'` column.
data_df = raw_df[raw_df['Assay/Protocol']=='ddG(mAvg)_mean'].copy()

# Keep only the desired columns, and given them shorter names
data_df.rename(columns={'Sequence':'x', 'Data': 'y', 'Description':'name'}, inplace=True)
cols_to_keep = ['x', 'name', 'y']
data_df = data_df[cols_to_keep]

# Remove position 1 from variant sequences and drop duplicate sequences
data_df['x'] = data_df['x'].str[1:]
data_df.drop_duplicates(subset='x', keep=False, inplace=True)

# Flip the sign of measured folding energies
data_df['y'] = -data_df['y']

# Drop variants with $\Delta G$ of exactly $+4$ kcal/mol, as these were not precisely measured.
ix = data_df['y']==4
data_df = data_df[~ix]
data_df.reset_index(inplace=True, drop=True)

# Save to file (uncomment to execute)
# data_df.to_csv('nisthal_data.csv.gz', index=False, compression='gzip')
data_df
[5]:
x name y
0 AYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02A 0.4704
1 DYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02D 0.5538
2 EYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02E -0.1299
3 FYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02F -0.3008
4 GYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02G 0.6680
... ... ... ...
808 TYTLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... K04T -0.4815
809 TYVLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... K04V 0.2696
810 TYYLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... K04Y -0.8246
811 VYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02V -1.3090
812 YYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD... T02Y -0.1476

813 rows × 3 columns