tdp43 dataset¶
[1]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Special imports
import mavenn
import os
import urllib
Summary¶
The deep mutagenesis dataset of Bolognesi et al., 2019. TAR DNA-binding protein 43 (TDP-43) is a heterogeneous nuclear ribonucleoprotein (hnRNP) in the cell nucleus which has a key role in regulating gene expression. Several neurodegenerative disorders have been associated with cytoplasmic aggregation of TDP-43, including amyotrophic lateral sclerosis (ALS), frontotemporal lobar degeneration (FTLD), Alzheimer’s, Parkinson’s, and Huntington’s disease. Bolognesi et al., performed a comprehensive deep mutagenesis, using error-prone oligonucleotide synthesis to comprehensively mutate the prion-like domain (PRD) of TDP-43 and reported toxicity as a function of 1266 single and 56730 double mutations.
Names: 'tdp43'
Reference: Benedetta B, Faure AJ, Seuma M, Schmiedel JM, Tartaglia GG, Lehner B. The mutational landscape of a prion-like domain. Nature Comm 10:4162 (2019).
[2]:
mavenn.load_example_dataset('tdp43')
[2]:
set | dist | y | dy | x | |
---|---|---|---|---|---|
0 | training | 1 | 0.032210 | 0.037438 | NNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... |
1 | training | 1 | -0.009898 | 0.038981 | TNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... |
2 | training | 1 | -0.010471 | 0.005176 | RNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... |
3 | training | 1 | 0.030803 | 0.005341 | SNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... |
4 | training | 1 | -0.054716 | 0.035752 | INSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... |
... | ... | ... | ... | ... | ... |
57991 | training | 2 | -0.009706 | 0.035128 | GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... |
57992 | validation | 2 | -0.030744 | 0.029436 | GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... |
57993 | validation | 2 | -0.086802 | 0.033174 | GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... |
57994 | training | 2 | -0.049587 | 0.029130 | GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... |
57995 | training | 2 | -0.105390 | 0.031189 | GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG... |
57996 rows × 5 columns
Preprocessing¶
The deep mutagenesis dataset for single and double mutations in TDP-43 is publicly available (in excel format) in the supplementary information/Supplementary Data 3 of the Bolognesi et al. published paper.
It is formatted as follows: - The wild type sequence absolute starting position is 290.
Single mutated sequences are in the
1 AA change
sheet. For these sequences thePos_abs
column lists the absolute position of the amino acid (aa) which mutated withMut
column.Double mutated sequences are in
2 AA change
sheet. For these sequences thePos_abs1
andPos_abs2
columns list the first and second aa absolute positions which mutated.Mut1
andMut2
columns are residues of mutation position 1 and 2 in double mutant, respectively.Both single and double mutants consist of the toxicity scores (measurements
y
) and corresponding uncertaintiesdy
.We will use the
toxicity
andsigma
columns for single mutant sequences.We will use the corrected relative toxicity
toxicity_cond
and the corresponding corrected uncertaintysigma_cond
(see Methods section of the Reference paper).
[5]:
# Download datset
url = 'https://github.com/jbkinney/mavenn/blob/master/mavenn/examples/datasets/raw/tdp-43_raw.xlsx?raw=true'
raw_data_file = 'tdp-43_raw.xlsx'
urllib.request.urlretrieve(url, raw_data_file)
# Record wild-type sequence
wt_seq = 'GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWGMMGMLASQQNQSGPSGNNQNQGNMQREPNQAFGSGNNS'
# Read single mutation sheet from raw data
single_mut_df = pd.read_excel(raw_data_file, sheet_name='1 AA change')
# Read double mutation sheet from raw data
double_mut_df = pd.read_excel(raw_data_file, sheet_name='2 AA change')
# Delete raw dataset
os.remove(raw_data_file)
[6]:
# Preview single-mutant data
single_mut_df.head()
[6]:
Pos | WT_AA | Mut | Nmut_nt | Nmut_aa | Nmut_codons | STOP | mean_count | is.reads0 | sigma | toxicity | region | Pos_abs | mut_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | G | N | 2 | 1 | 1 | False | 22.000000 | True | 0.037438 | 0.032210 | 290 | 290 | G290N |
1 | 1 | G | T | 2 | 1 | 1 | False | 17.333333 | True | 0.038981 | -0.009898 | 290 | 290 | G290T |
2 | 1 | G | R | 2 | 1 | 1 | False | 3888.666667 | True | 0.005176 | -0.010471 | 290 | 290 | G290R |
3 | 1 | G | S | 2 | 1 | 1 | False | 3635.666667 | True | 0.005341 | 0.030803 | 290 | 290 | G290S |
4 | 1 | G | I | 2 | 1 | 1 | False | 21.666667 | True | 0.035752 | -0.054716 | 290 | 290 | G290I |
[7]:
# Preview double-mutant data
double_mut_df.head()
[7]:
Nmut_nt | Nmut_aa | Nmut_codons | STOP | mean_count | is.reads0 | Pos1 | Pos2 | WT_AA1 | WT_AA2 | ... | sigma_cond | toxicity1 | toxicity2 | toxicity_uncorr | toxicity_cond | region | Pos_abs1 | Pos_abs2 | mut_code1 | mut_code2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 2 | 2 | True | 16.333333 | True | 1 | 4 | G | R | ... | 0.020867 | 0.001282 | -0.174307 | -0.139949 | -0.169501 | 290 | 290 | 293 | G290A | R293* |
1 | 4 | 2 | 2 | True | 30.333333 | True | 1 | 4 | G | R | ... | 0.017555 | 0.007680 | -0.174307 | -0.206614 | -0.193387 | 290 | 290 | 293 | G290C | R293* |
2 | 2 | 2 | 2 | True | 43.333333 | True | 1 | 4 | G | R | ... | 0.017882 | 0.044342 | -0.174307 | -0.123376 | -0.142809 | 290 | 290 | 293 | G290D | R293* |
3 | 2 | 2 | 2 | True | 22.333333 | True | 1 | 4 | G | R | ... | 0.018913 | -0.010471 | -0.174307 | -0.136759 | -0.165018 | 290 | 290 | 293 | G290R | R293* |
4 | 2 | 2 | 2 | True | 29.333333 | True | 1 | 4 | G | R | ... | 0.021690 | 0.030803 | -0.174307 | -0.118746 | -0.153186 | 290 | 290 | 293 | G290S | R293* |
5 rows × 25 columns
To reformat single_mut_df
and double_mut_df
into the one provided with MAVE-NN, we first need to get the full sequence of amino acids corresponding to each mutation. Therefore, we used Pos
and Mut
columns to replace single aa in the wild type sequence for each record in the single mutant dataset. Then, we used Pos_abs1
, Pos_abs2
, Mut1
and Mut2
from the double mutants to replace two aa in the wild type sequence. The list of sequences with single and double mutants
are called single_mut_list
and double_mut_list
, respectively. Those lists are then horizontally (column wise) stacked in the x
variable.
Next, we stack single- and double-mutant - nucleation scores toxicity
and toxicity_cond
in y
- score uncertainties sigma
and sigma_cond
in dy
- hamming distances in dist
Finally, we create a set
column that randomly assigns each sequence to the training, test, or validation set (using a 90:05:05 split), then reorder the columns for clarity. The resulting dataframe is called final_df
.
[ ]:
# Introduce single mutations into wt sequence and append to a list
single_mut_list = []
for mut_pos, mut_char in zip(single_mut_df['Pos_abs'].values,
single_mut_df['Mut'].values):
mut_seq = list(wt_seq)
mut_seq[mut_pos-290] = mut_char
single_mut_list.append(''.join(mut_seq))
# Introduce double mutations into wt sequence and append to list
double_mut_list = []
for mut1_pos, mut1_char, mut2_pos, mut2_char in zip(double_mut_df['Pos_abs1'].values,
double_mut_df['Mut1'].values,
double_mut_df['Pos_abs2'].values,
double_mut_df['Mut2'].values):
mut_seq = list(wt_seq)
mut_seq[mut1_pos-290] = mut1_char
mut_seq[mut2_pos-290] = mut2_char
double_mut_list.append(''.join(mut_seq))
# Stack single-mutant and double-mutant sequences
x = np.hstack([single_mut_list,
double_mut_list])
# Stack single-mutant and double-mutant nucleation scores
y = np.hstack([single_mut_df['toxicity'].values,
double_mut_df['toxicity_cond'].values])
# Stack single-mutant and double-mutant nucleation score uncertainties
dy = np.hstack([single_mut_df['sigma'].values,
double_mut_df['sigma_cond'].values])
# List hamming distances
dists = np.hstack([1*np.ones(len(single_mut_df)),
2*np.ones(len(double_mut_df))]).astype(int)
# Assign each sequence to training, validation, or test set
np.random.seed(0)
sets = np.random.choice(a=['training', 'validation', 'test'],
p=[.9,.05,.05],
size=len(x))
# Assemble into dataframe
final_df = pd.DataFrame({'set':sets, 'dist':dists, 'y':y, 'dy':dy, 'x':x})
# # Save to file (uncomment to execute)
final_df.to_csv('tdp43_data.csv.gz', index=False, compression='gzip')
# Preview dataframe
final_df
This final dataframe, final_df
, has the same format as the tdp43
dataset that comes with MAVE-NN.