tdp43 dataset

[1]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Special imports
import mavenn
import os
import urllib

Summary

The deep mutagenesis dataset of Bolognesi et al., 2019. TAR DNA-binding protein 43 (TDP-43) is a heterogeneous nuclear ribonucleoprotein (hnRNP) in the cell nucleus which has a key role in regulating gene expression. Several neurodegenerative disorders have been associated with cytoplasmic aggregation of TDP-43, including amyotrophic lateral sclerosis (ALS), frontotemporal lobar degeneration (FTLD), Alzheimer’s, Parkinson’s, and Huntington’s disease. Bolognesi et al., performed a comprehensive deep mutagenesis, using error-prone oligonucleotide synthesis to comprehensively mutate the prion-like domain (PRD) of TDP-43 and reported toxicity as a function of 1266 single and 56730 double mutations.

Names: 'tdp43'

Reference: Benedetta B, Faure AJ, Seuma M, Schmiedel JM, Tartaglia GG, Lehner B. The mutational landscape of a prion-like domain. Nature Comm 10:4162 (2019).

[2]:
mavenn.load_example_dataset('tdp43')
[2]:
set dist y dy x
0 training 1 0.032210 0.037438 NNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG...
1 training 1 -0.009898 0.038981 TNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG...
2 training 1 -0.010471 0.005176 RNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG...
3 training 1 0.030803 0.005341 SNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG...
4 training 1 -0.054716 0.035752 INSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG...
... ... ... ... ... ...
57991 training 2 -0.009706 0.035128 GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG...
57992 validation 2 -0.030744 0.029436 GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG...
57993 validation 2 -0.086802 0.033174 GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG...
57994 training 2 -0.049587 0.029130 GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG...
57995 training 2 -0.105390 0.031189 GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG...

57996 rows × 5 columns

Preprocessing

The deep mutagenesis dataset for single and double mutations in TDP-43 is publicly available (in excel format) in the supplementary information/Supplementary Data 3 of the Bolognesi et al. published paper.

It is formatted as follows: - The wild type sequence absolute starting position is 290.

  • Single mutated sequences are in the 1 AA change sheet. For these sequences the Pos_abs column lists the absolute position of the amino acid (aa) which mutated with Mut column.

  • Double mutated sequences are in 2 AA change sheet. For these sequences the Pos_abs1 and Pos_abs2 columns list the first and second aa absolute positions which mutated. Mut1 and Mut2 columns are residues of mutation position 1 and 2 in double mutant, respectively.

  • Both single and double mutants consist of the toxicity scores (measurements y) and corresponding uncertainties dy.

    • We will use the toxicity and sigma columns for single mutant sequences.

    • We will use the corrected relative toxicity toxicity_cond and the corresponding corrected uncertainty sigma_cond (see Methods section of the Reference paper).

[5]:
# Download datset
url = 'https://github.com/jbkinney/mavenn/blob/master/mavenn/examples/datasets/raw/tdp-43_raw.xlsx?raw=true'
raw_data_file = 'tdp-43_raw.xlsx'
urllib.request.urlretrieve(url, raw_data_file)

# Record wild-type sequence
wt_seq = 'GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWGMMGMLASQQNQSGPSGNNQNQGNMQREPNQAFGSGNNS'

# Read single mutation sheet from raw data
single_mut_df = pd.read_excel(raw_data_file, sheet_name='1 AA change')

# Read double mutation sheet from raw data
double_mut_df = pd.read_excel(raw_data_file, sheet_name='2 AA change')

# Delete raw dataset
os.remove(raw_data_file)
[6]:
# Preview single-mutant data
single_mut_df.head()
[6]:
Pos WT_AA Mut Nmut_nt Nmut_aa Nmut_codons STOP mean_count is.reads0 sigma toxicity region Pos_abs mut_code
0 1 G N 2 1 1 False 22.000000 True 0.037438 0.032210 290 290 G290N
1 1 G T 2 1 1 False 17.333333 True 0.038981 -0.009898 290 290 G290T
2 1 G R 2 1 1 False 3888.666667 True 0.005176 -0.010471 290 290 G290R
3 1 G S 2 1 1 False 3635.666667 True 0.005341 0.030803 290 290 G290S
4 1 G I 2 1 1 False 21.666667 True 0.035752 -0.054716 290 290 G290I
[7]:
# Preview double-mutant data
double_mut_df.head()
[7]:
Nmut_nt Nmut_aa Nmut_codons STOP mean_count is.reads0 Pos1 Pos2 WT_AA1 WT_AA2 ... sigma_cond toxicity1 toxicity2 toxicity_uncorr toxicity_cond region Pos_abs1 Pos_abs2 mut_code1 mut_code2
0 2 2 2 True 16.333333 True 1 4 G R ... 0.020867 0.001282 -0.174307 -0.139949 -0.169501 290 290 293 G290A R293*
1 4 2 2 True 30.333333 True 1 4 G R ... 0.017555 0.007680 -0.174307 -0.206614 -0.193387 290 290 293 G290C R293*
2 2 2 2 True 43.333333 True 1 4 G R ... 0.017882 0.044342 -0.174307 -0.123376 -0.142809 290 290 293 G290D R293*
3 2 2 2 True 22.333333 True 1 4 G R ... 0.018913 -0.010471 -0.174307 -0.136759 -0.165018 290 290 293 G290R R293*
4 2 2 2 True 29.333333 True 1 4 G R ... 0.021690 0.030803 -0.174307 -0.118746 -0.153186 290 290 293 G290S R293*

5 rows × 25 columns

To reformat single_mut_df and double_mut_df into the one provided with MAVE-NN, we first need to get the full sequence of amino acids corresponding to each mutation. Therefore, we used Pos and Mut columns to replace single aa in the wild type sequence for each record in the single mutant dataset. Then, we used Pos_abs1, Pos_abs2, Mut1 and Mut2 from the double mutants to replace two aa in the wild type sequence. The list of sequences with single and double mutants are called single_mut_list and double_mut_list, respectively. Those lists are then horizontally (column wise) stacked in the x variable.

Next, we stack single- and double-mutant - nucleation scores toxicity and toxicity_cond in y - score uncertainties sigma and sigma_cond in dy - hamming distances in dist

Finally, we create a set column that randomly assigns each sequence to the training, test, or validation set (using a 90:05:05 split), then reorder the columns for clarity. The resulting dataframe is called final_df.

[ ]:
# Introduce single mutations into wt sequence and append to a list
single_mut_list = []
for mut_pos, mut_char in zip(single_mut_df['Pos_abs'].values,
                             single_mut_df['Mut'].values):
    mut_seq = list(wt_seq)
    mut_seq[mut_pos-290] = mut_char
    single_mut_list.append(''.join(mut_seq))

# Introduce double mutations into wt sequence and append to list
double_mut_list = []
for mut1_pos, mut1_char, mut2_pos, mut2_char in zip(double_mut_df['Pos_abs1'].values,
                                                    double_mut_df['Mut1'].values,
                                                    double_mut_df['Pos_abs2'].values,
                                                    double_mut_df['Mut2'].values):
    mut_seq = list(wt_seq)
    mut_seq[mut1_pos-290] = mut1_char
    mut_seq[mut2_pos-290] = mut2_char
    double_mut_list.append(''.join(mut_seq))

# Stack single-mutant and double-mutant sequences
x = np.hstack([single_mut_list,
               double_mut_list])

# Stack single-mutant and double-mutant nucleation scores
y = np.hstack([single_mut_df['toxicity'].values,
               double_mut_df['toxicity_cond'].values])

# Stack single-mutant and double-mutant nucleation score uncertainties
dy = np.hstack([single_mut_df['sigma'].values,
                double_mut_df['sigma_cond'].values])

# List hamming distances
dists = np.hstack([1*np.ones(len(single_mut_df)),
                   2*np.ones(len(double_mut_df))]).astype(int)

# Assign each sequence to training, validation, or test set
np.random.seed(0)
sets = np.random.choice(a=['training', 'validation', 'test'],
                        p=[.9,.05,.05],
                        size=len(x))

# Assemble into dataframe
final_df = pd.DataFrame({'set':sets, 'dist':dists, 'y':y, 'dy':dy, 'x':x})

# # Save to file (uncomment to execute)
final_df.to_csv('tdp43_data.csv.gz', index=False, compression='gzip')

# Preview dataframe
final_df

This final dataframe, final_df, has the same format as the tdp43 dataset that comes with MAVE-NN.