amyloid dataset
[1]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Special imports
import mavenn
import os
import urllib
Summary
The deep mutational scanning (DMS) dataset of Seuma et al., 2021. The function of small protein called amyloid beta (A\(\beta\)) is suspected to play a significant role in Alzheimer’s disease. By mutating each position in the protein, Seuma et al. produced more than 14,000 different versions of A\(\beta\) with single and double mutation. To globally quantify the impact of mutations, they used in-vivo selection assay using yeast cells and measured how quickly these mutants were able to aggregate. The quantification is summarized in the variable called nucleation score.
Names: 'amyloid'
Reference: Seuma M, Faure A, Badia M, Lehner B, Bolognesi B. The genetic landscape for amyloid beta fibril nucleation accurately discriminates familial Alzheimer’s disease mutations. eLife 10:e63364 (2021).
[2]:
mavenn.load_example_dataset('amyloid')
[2]:
| set | dist | y | dy | x | |
|---|---|---|---|---|---|
| 0 | training | 1 | -0.117352 | 0.387033 | KAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIA |
| 1 | training | 1 | 0.352500 | 0.062247 | NAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIA |
| 2 | training | 1 | -2.818013 | 1.068137 | TAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIA |
| 3 | training | 1 | 0.121805 | 0.376764 | SAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIA |
| 4 | training | 1 | -2.404340 | 0.278486 | IAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIA |
| ... | ... | ... | ... | ... | ... |
| 16061 | training | 2 | -0.151502 | 0.389821 | DAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVKV |
| 16062 | training | 2 | -1.360708 | 0.370517 | DAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVLV |
| 16063 | training | 2 | -0.996816 | 0.346949 | DAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVMV |
| 16064 | training | 2 | -3.238403 | 0.429008 | DAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVTV |
| 16065 | training | 2 | -1.141457 | 0.365638 | DAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVVV |
16066 rows × 5 columns
Preprocessing
The DMS dataset of single and double mutations in A\(\beta\) of Seuma et al., (2021) is publicly available in the excel format on the Gene Expression Omnibus server. It is formatted as follows:
Single mutated sequences are in
1 aa change sheet. For these sequences thePoscolumn lists the amino acid (aa) position which mutated, andMutcolumn is mutated aa residue.Double mutated sequences are in
2 aa change sheet. For these sequences thePos1andPos2columns list the first and second aa positions which mutated.Mut1andMut2columns are residues of mutation 1 and 2 in double mutant, respectively.Both single and double mutant consist of the nucleation scores across three replicates and the weighted average (
nscore) of them based on their uncertainties (sigma).
[3]:
# Download datset
url = 'https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE151147&format=file&file=GSE151147%5FMS%5FBL%5FBB%5Fprocessed%5Fdata%2Exlsx'
raw_data_file = 'Abeta_raw_data.xlsx'
urllib.request.urlretrieve(url, raw_data_file)
# Record wild-type sequence
wt_seq = 'DAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIA'
# Read single mutation sheet from raw data
single_mut_df = pd.read_excel(raw_data_file, sheet_name='1 aa change')
# Read double mutation sheet from raw data
double_mut_df = pd.read_excel(raw_data_file, sheet_name='2 aa changes')
[4]:
# Preview single-mutant data
single_mut_df.head()
[4]:
| Pos | WT_AA | Mut | Nham_nt | Nham_aa | Nmut_codons | STOP | mean_count | nscore1 | sigma1 | nscore2 | sigma2 | nscore3 | sigma3 | nscore | sigma | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | D | K | 2 | 1 | 1 | False | 210.500000 | -0.280176 | 0.482820 | 0.175372 | 0.647374 | NaN | NaN | -0.117352 | 0.387033 |
| 1 | 1 | D | N | 2 | 1 | 1 | False | 28544.000000 | 0.388480 | 0.112041 | 0.306589 | 0.077314 | 0.785219 | 0.299795 | 0.352500 | 0.062247 |
| 2 | 1 | D | T | 2 | 1 | 1 | False | 97.000000 | NaN | NaN | -2.818013 | 1.068137 | NaN | NaN | -2.818013 | 1.068137 |
| 3 | 1 | D | S | 2 | 1 | 1 | False | 150.666667 | 0.003406 | 0.525670 | 0.180478 | 0.622756 | 0.448936 | 1.086370 | 0.121805 | 0.376764 |
| 4 | 1 | D | I | 2 | 1 | 1 | False | 334.333333 | -2.364750 | 0.373224 | -2.579152 | 0.482386 | -2.074932 | 0.839842 | -2.404340 | 0.278486 |
[5]:
# Preview double-mutant data
double_mut_df.head()
[5]:
| Pos2 | Mut2 | Pos1 | Mut1 | WT_AA1 | WT_AA2 | Nham_nt | Nham_aa | Nmut_codons | STOP | mean_count | nscore1 | sigma1 | nscore2 | sigma2 | nscore3 | sigma3 | nscore | sigma | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | E | 1 | E | D | A | 2 | 2 | 2 | False | 78.500000 | 0.160562 | 0.878728 | -1.908344 | 0.999612 | NaN | NaN | -0.741292 | 0.659978 |
| 1 | 2 | E | 1 | G | D | A | 2 | 2 | 2 | False | 139.500000 | -0.461932 | 0.679144 | -0.616485 | 0.715070 | NaN | NaN | -0.535229 | 0.492438 |
| 2 | 2 | E | 1 | N | D | A | 2 | 2 | 2 | False | 146.000000 | 0.143146 | 0.530710 | -0.181673 | 0.855333 | NaN | NaN | 0.052856 | 0.450957 |
| 3 | 2 | E | 1 | V | D | A | 2 | 2 | 2 | False | 133.333333 | -0.526572 | 0.551242 | -1.427565 | 0.708833 | -0.423844 | 1.053086 | -0.801619 | 0.402165 |
| 4 | 2 | E | 1 | Y | D | A | 2 | 2 | 2 | False | 62.000000 | -0.288245 | 0.876578 | NaN | NaN | NaN | NaN | -0.288245 | 0.876578 |
To reformat single_mut_df and double_mut_df into the one provided with MAVE-NN, we first need to get the full sequence of amino acids corresponding to each mutation. Therefore, we used Pos and Mut columns to replace single aa in wild type sequence for each record for the single mutant. Then, we used Pos1, Pos2, Mut1 and Mut2 from the double mutants to replace two aa in the wild type sequence. The list of sequences with single and double mutants are called
single_mut_list and double_mut_list, respectively. Those lists are then horizontally (column wise) stacked in x variable.
Next, we stack single- and double-mutant
nucleation scores
nscoreinyscore uncertainties
sigmaindyhamming distance in
dists
Finally, we create a set column that randomly assigns each sequence to the training, test, or validation set (using a 90:05:05 split), then reorder the columns for clarity. The resulting dataframe is called final_df.
[6]:
# Introduce single mutations into wt sequence and append to a list
single_mut_list = []
for mut_pos, mut_char in zip(single_mut_df['Pos'].values,
single_mut_df['Mut'].values):
mut_seq = list(wt_seq)
mut_seq[mut_pos-1] = mut_char
single_mut_list.append(''.join(mut_seq))
# Introduce double mutations into wt sequence and append to list
double_mut_list = []
for mut1_pos, mut1_char, mut2_pos, mut2_char in zip(double_mut_df['Pos1'].values,
double_mut_df['Mut1'].values,
double_mut_df['Pos2'].values,
double_mut_df['Mut2'].values):
mut_seq = list(wt_seq)
mut_seq[mut1_pos-1] = mut1_char
mut_seq[mut2_pos-1] = mut2_char
double_mut_list.append(''.join(mut_seq))
# Stack single-mutant and double-mutant sequences
x = np.hstack([single_mut_list,
double_mut_list])
# Stack single-mutant and double-mutant nucleation scores
y = np.hstack([single_mut_df['nscore'].values,
double_mut_df['nscore'].values])
# Stack single-mutant and double-mutant nucleation score uncertainties
dy = np.hstack([single_mut_df['sigma'].values,
double_mut_df['sigma'].values])
# List hamming distances
dists = np.hstack([1*np.ones(len(single_mut_df)),
2*np.ones(len(double_mut_df))]).astype(int)
# Assign each sequence to training, validation, or test set
np.random.seed(0)
sets = np.random.choice(a=['training', 'validation', 'test'],
p=[.9,.05,.05],
size=len(x))
# Assemble into dataframe
final_df = pd.DataFrame({'set':sets, 'dist':dists, 'y':y, 'dy':dy, 'x':x})
# # Save to file (uncomment to execute)
# final_df.to_csv('amyloid_data.csv.gz', index=False, compression='gzip')
# Preview dataframe
final_df
[6]:
| set | dist | y | dy | x | |
|---|---|---|---|---|---|
| 0 | training | 1 | -0.117352 | 0.387033 | KAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIA |
| 1 | training | 1 | 0.352500 | 0.062247 | NAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIA |
| 2 | training | 1 | -2.818013 | 1.068137 | TAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIA |
| 3 | training | 1 | 0.121805 | 0.376764 | SAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIA |
| 4 | training | 1 | -2.404340 | 0.278486 | IAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIA |
| ... | ... | ... | ... | ... | ... |
| 16061 | training | 2 | -0.151502 | 0.389821 | DAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVKV |
| 16062 | training | 2 | -1.360708 | 0.370517 | DAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVLV |
| 16063 | training | 2 | -0.996816 | 0.346949 | DAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVMV |
| 16064 | training | 2 | -3.238403 | 0.429008 | DAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVTV |
| 16065 | training | 2 | -1.141457 | 0.365638 | DAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVVV |
16066 rows × 5 columns
This final dataframe, final_df, has the same format as the amyloid dataset that comes with MAVE-NN.