tdp43 dataset

[1]:

# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Special imports
import mavenn
import os
import urllib

Summary

The deep mutagenesis dataset of Bolognesi et al., 2019. TAR DNA-binding protein 43 (TDP-43) is a heterogeneous nuclear ribonucleoprotein (hnRNP) in the cell nucleus which has a key role in regulating gene expression. Several neurodegenerative disorders have been associated with cytoplasmic aggregation of TDP-43, including amyotrophic lateral sclerosis (ALS), frontotemporal lobar degeneration (FTLD), Alzheimer’s, Parkinson’s, and Huntington’s disease. Bolognesi et al., performed a comprehensive deep mutagenesis, using error-prone oligonucleotide synthesis to comprehensively mutate the prion-like domain (PRD) of TDP-43 and reported toxicity as a function of 1266 single and 56730 double mutations.

Names: 'tdp43'

Reference: Benedetta B, Faure AJ, Seuma M, Schmiedel JM, Tartaglia GG, Lehner B. The mutational landscape of a prion-like domain. Nature Comm 10:4162 (2019).

[2]:

mavenn.load_example_dataset('tdp43')

[2]:

	set	dist	y	dy	x
0	training	1	0.032210	0.037438	NNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG...
1	training	1	-0.009898	0.038981	TNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG...
2	training	1	-0.010471	0.005176	RNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG...
3	training	1	0.030803	0.005341	SNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG...
4	training	1	-0.054716	0.035752	INSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG...
...	...	...	...	...	...
57991	training	2	-0.009706	0.035128	GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG...
57992	validation	2	-0.030744	0.029436	GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG...
57993	validation	2	-0.086802	0.033174	GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG...
57994	training	2	-0.049587	0.029130	GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG...
57995	training	2	-0.105390	0.031189	GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWG...

57996 rows × 5 columns

Preprocessing

The deep mutagenesis dataset for single and double mutations in TDP-43 is publicly available (in excel format) in the supplementary information/Supplementary Data 3 of the Bolognesi et al. published paper.

It is formatted as follows:

The wild type sequence absolute starting position is 290.
Single mutated sequences are in the 1 AA change sheet. For these sequences the Pos_abs column lists the absolute position of the amino acid (aa) which mutated with Mut column.
Double mutated sequences are in 2 AA change sheet. For these sequences the Pos_abs1 and Pos_abs2 columns list the first and second aa absolute positions which mutated. Mut1 and Mut2 columns are residues of mutation position 1 and 2 in double mutant, respectively.
Both single and double mutants consist of the toxicity scores (measurements y) and corresponding uncertainties dy.
- We will use the toxicity and sigma columns for single mutant sequences.
- We will use the corrected relative toxicity toxicity_cond and the corresponding corrected uncertainty sigma_cond (see Methods section of the Reference paper).

[5]:

# Download datset
url = 'https://github.com/jbkinney/mavenn/blob/master/mavenn/examples/datasets/raw/tdp-43_raw.xlsx?raw=true'
raw_data_file = 'tdp-43_raw.xlsx'
urllib.request.urlretrieve(url, raw_data_file)

# Record wild-type sequence
wt_seq = 'GNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWGMMGMLASQQNQSGPSGNNQNQGNMQREPNQAFGSGNNS'

# Read single mutation sheet from raw data
single_mut_df = pd.read_excel(raw_data_file, sheet_name='1 AA change')

# Read double mutation sheet from raw data
double_mut_df = pd.read_excel(raw_data_file, sheet_name='2 AA change')

# Delete raw dataset
os.remove(raw_data_file)

[6]:

# Preview single-mutant data
single_mut_df.head()

[6]:

	Pos	WT_AA	Mut	Nmut_nt	Nmut_aa	Nmut_codons	STOP	mean_count	is.reads0	sigma	toxicity	region	Pos_abs	mut_code
0	1	G	N	2	1	1	False	22.000000	True	0.037438	0.032210	290	290	G290N
1	1	G	T	2	1	1	False	17.333333	True	0.038981	-0.009898	290	290	G290T
2	1	G	R	2	1	1	False	3888.666667	True	0.005176	-0.010471	290	290	G290R
3	1	G	S	2	1	1	False	3635.666667	True	0.005341	0.030803	290	290	G290S
4	1	G	I	2	1	1	False	21.666667	True	0.035752	-0.054716	290	290	G290I

[7]:

# Preview double-mutant data
double_mut_df.head()

[7]:

	Nmut_nt	Nmut_aa	Nmut_codons	STOP	mean_count	is.reads0	Pos1	Pos2	WT_AA1	WT_AA2	...	sigma_cond	toxicity1	toxicity2	toxicity_uncorr	toxicity_cond	region	Pos_abs1	Pos_abs2	mut_code1	mut_code2
0	2	2	2	True	16.333333	True	1	4	G	R	...	0.020867	0.001282	-0.174307	-0.139949	-0.169501	290	290	293	G290A	R293*
1	4	2	2	True	30.333333	True	1	4	G	R	...	0.017555	0.007680	-0.174307	-0.206614	-0.193387	290	290	293	G290C	R293*
2	2	2	2	True	43.333333	True	1	4	G	R	...	0.017882	0.044342	-0.174307	-0.123376	-0.142809	290	290	293	G290D	R293*
3	2	2	2	True	22.333333	True	1	4	G	R	...	0.018913	-0.010471	-0.174307	-0.136759	-0.165018	290	290	293	G290R	R293*
4	2	2	2	True	29.333333	True	1	4	G	R	...	0.021690	0.030803	-0.174307	-0.118746	-0.153186	290	290	293	G290S	R293*

5 rows × 25 columns

To reformat single_mut_df and double_mut_df into the one provided with MAVE-NN, we first need to get the full sequence of amino acids corresponding to each mutation. Therefore, we used Pos and Mut columns to replace single aa in the wild type sequence for each record in the single mutant dataset. Then, we used Pos_abs1, Pos_abs2, Mut1 and Mut2 from the double mutants to replace two aa in the wild type sequence. The list of sequences with single and double mutants are called single_mut_list and double_mut_list, respectively. Those lists are then horizontally (column wise) stacked in the x variable.

Next, we stack single- and double-mutant

nucleation scores toxicity and toxicity_cond in y
score uncertainties sigma and sigma_cond in dy
hamming distances in dist

Finally, we create a set column that randomly assigns each sequence to the training, test, or validation set (using a 90:05:05 split), then reorder the columns for clarity. The resulting dataframe is called final_df.

[ ]:

# Introduce single mutations into wt sequence and append to a list
single_mut_list = []
for mut_pos, mut_char in zip(single_mut_df['Pos_abs'].values,
                             single_mut_df['Mut'].values):
    mut_seq = list(wt_seq)
    mut_seq[mut_pos-290] = mut_char
    single_mut_list.append(''.join(mut_seq))

# Introduce double mutations into wt sequence and append to list
double_mut_list = []
for mut1_pos, mut1_char, mut2_pos, mut2_char in zip(double_mut_df['Pos_abs1'].values,
                                                    double_mut_df['Mut1'].values,
                                                    double_mut_df['Pos_abs2'].values,
                                                    double_mut_df['Mut2'].values):
    mut_seq = list(wt_seq)
    mut_seq[mut1_pos-290] = mut1_char
    mut_seq[mut2_pos-290] = mut2_char
    double_mut_list.append(''.join(mut_seq))

# Stack single-mutant and double-mutant sequences
x = np.hstack([single_mut_list,
               double_mut_list])

# Stack single-mutant and double-mutant nucleation scores
y = np.hstack([single_mut_df['toxicity'].values,
               double_mut_df['toxicity_cond'].values])

# Stack single-mutant and double-mutant nucleation score uncertainties
dy = np.hstack([single_mut_df['sigma'].values,
                double_mut_df['sigma_cond'].values])

# List hamming distances
dists = np.hstack([1*np.ones(len(single_mut_df)),
                   2*np.ones(len(double_mut_df))]).astype(int)

# Assign each sequence to training, validation, or test set
np.random.seed(0)
sets = np.random.choice(a=['training', 'validation', 'test'],
                        p=[.9,.05,.05],
                        size=len(x))

# Assemble into dataframe
final_df = pd.DataFrame({'set':sets, 'dist':dists, 'y':y, 'dy':dy, 'x':x})

# # Save to file (uncomment to execute)
final_df.to_csv('tdp43_data.csv.gz', index=False, compression='gzip')

# Preview dataframe
final_df

This final dataframe, final_df, has the same format as the tdp43 dataset that comes with MAVE-NN.