mpsa datasets

[1]:

# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Special imports
import mavenn
import os
import urllib

Summary

The massively parallel splicing assay (MPSA) dataset of Wong et al., 2018. The authors used 3-exon minigenes to assay how inclusion of the middle exon varies with the sequence of that exon’s 5’ splice site. Nearly all 5’ splice site variants of the form NNN/GYNNNN were measured, where the slash demarcates the exon/intron boundary. The authors performed experiments on multiple replicates of multiple libraries in three different minigene contexts: IKBKAP exons 19-21, SMN1 exons 6-8, and BRCA2 exons 17-19. The dataset 'mpsa' is from library 1 replicate 1 in the BRCA2 context, while 'mpsa_replicate' is from library 2 replicate 1 in the same context.

In these dataframes, the 'tot_ct' column reports the number of reads obtained for each splice site from total processed mRNA transcripts, the 'ex_ct' column reports the number of reads obtained from processed mRNA transcripts containing the central exon, 'y' is the \(\log_{10}\) percent-spliced-in (PSI) value measured for each sequence, and 'x' is the variant 5’ splice site. Note that some sequences have \(y > 2.0\), corresponding to PSI > 100, due to experimental noise.

Names: 'mpsa', 'mpsa_replicate'

Reference: Wong MS, Kinney JB, Krainer AR. Quantitative activity profile and context dependence of all human 5’ splice sites. Mol Cell. 71:1012-1026.e3 (2018).

[2]:

mavenn.load_example_dataset('mpsa')

[2]:

	set	tot_ct	ex_ct	y	x
0	training	28	2	0.023406	GGAGUGAUG
1	test	315	7	-0.587914	AGUGUGCAA
2	training	193	15	-0.074999	UUCGCGCCA
3	training	27	0	-0.438475	UAAGCUUUU
4	training	130	2	-0.631467	AUGGUCGGG
...	...	...	...	...	...
30478	training	190	17	-0.017078	CUGGUUGCA
30479	training	154	10	-0.140256	CGCGCACAA
30480	test	407	16	-0.371528	ACUGCUCAC
30481	training	265	6	-0.571100	AUAGUCUAA
30482	test	26	22	0.939047	GUGGUAACU

30483 rows × 5 columns

Preprocessing

The raw datasets for both 'mpsa' and 'mpsa_replicate' are available at github.com/jbkinney/15_splicing/.

[3]:

# Specify dataset
dataset = 'mpsa' # use to make 'mpsa' dataset
# dataset = 'mpsa_replicate' # uncomment to make 'mpsa_replicate' dataset instead
print(f'Creating dataset "{dataset}"...')

# Specify online directory
url_dir = 'https://github.com/jbkinney/15_splicing/raw/master/for_mavenn/'

# Specify raw data file and output file names
if dataset=='mpsa':
    raw_data_file = 'results.brca2_9nt_lib1_rep1.txt.gz'
elif dataset=='mpsa_replicate':
    raw_data_file = 'results.brca2_9nt_lib2_rep1.txt.gz'
else:
    assert False

# Download raw datasert
print(f'Retrieving data from {url_dir+raw_data_file}...')
urllib.request.urlretrieve(url_dir+raw_data_file, raw_data_file)

# Load raw datset
raw_df = pd.read_csv(raw_data_file, sep='\t', index_col=0, compression='gzip')

# Delete raw dataset
os.remove(raw_data_file)

# Preview raw dataset
print('Done!')
raw_df

Creating dataset "mpsa"...
Retrieving data from https://github.com/jbkinney/15_splicing/raw/master/for_mavenn/results.brca2_9nt_lib1_rep1.txt.gz...
Done!

[3]:

	tot_ct	ex_ct	lib_ct	mis_ct	ss	bc
0	377	27	164	3	ACAGCGGGA	TTAGCTATCGGCTGACGTCT
1	332	5	97	1	AGCGTGTAT	CCACCCAACGCGCCGTCAGT
2	320	3286	46	1	CAGGTGAGA	TTGAGGTACACTGAACAGTC
3	312	2248	87	1	CAGGTTAGA	ACCGATCTGCCACGGCGACC
4	291	8	109	2	CAAGCCTTA	AGGGACCATCCAGTTCGCCT
...	...	...	...	...	...	...
944960	0	0	14	0	ACCGCGATG	TGAAATTGACCCGAGCCTGC
944961	0	0	14	1	AACGCCTCG	AACCAAAATACCTTGCGCTT
944962	0	0	14	0	TACGCATCG	TACTCAGCCAATGGCGAACA
944963	0	0	14	0	AAGGTCACG	CTATGCATCTACGCTTAATG
944964	0	0	2	0	CCAGCGCCG	AAAAAAAAAAAAGATTTGTT

944965 rows × 6 columns

Each raw dataset lists read counts for every cloned minigene plasmid. Note that each minigene contained one 5’ss (column 'ss') and one associated barcode (column 'bc'). Each barcode is associated with a unique 5’ss, but each 5’ss will typically be associated with many different barcodes. In addition to 'tot_ct' and 'ex_ct', two other read count quantities are listed: ‘lib_ct’ is the number of reads supporting the indicated 5’ss-barcode association, while 'mis_ct' is the number of reads indicating association of the barcode with other 5’ss that are not shown. Filtering for 'lib_ct' >= 2 and 'lib_ct' >=4x 'mis_ct' was already performed by the authors, and we will ignore these columns in what follows.

To format this dataset for use in MAVE-NN, we do the following:

Drop the ‘lib_ct’ and 'mis_ct' columns.
Drop the 'bc' column and marginalize ‘tot_ct’ and ‘ex_ct’ by 5’ss sequence.
Rename the 'ss' column to 'x'.
Convert DNA 5’ss sequences to RNA sequences by replacing 'T' with 'U'.
Remove 5’ss with ‘tot_ct' < 10
Remove 5’ss with sequences that don’t match 'NNNGYNNNN'.
Compute 'y', which lists estimated log10 PSI values.
Assign each 5’ss to the training, validation, or test set.
Shuffle the rows of the final dataframe and reorder the columns for clarity.
Save the dataframe as a .csv.gz file.

[4]:

# Remove unwanted columns
data_df = raw_df[['tot_ct','ex_ct','ss','bc']].copy()

# Marginalize by splice site
data_df = data_df.groupby('ss').sum().reset_index()

# Rename columns
data_df.rename(columns={'ss':'x'}, inplace=True)

# Make sequences RNA
data_df['x'] = [ss.replace('T','U') for ss in data_df['x']]

# Remove ss with minimum tot_ct
min_ct = 10
ix = data_df['tot_ct'] >= min_ct
data_df = data_df[ix].copy()
print(f'{(~ix).sum()} ss removed for having tot_ct < {min_ct}')

# Remove ss with invalid sequences
ix = np.array([((x[3]=='G') and (x[4] in {'C','U'})) for x in data_df['x']])
data_df = data_df[ix]
print(f'{(~ix).sum()} ss with invalid sequences removed')

# Get consensus i_cons and o_cons
pseudoct = 1.0
cons_seq = 'CAGGUAAGU'
tmp_df = data_df.set_index('x')
i_cons = tmp_df.loc[cons_seq,'tot_ct']
o_cons = tmp_df.loc[cons_seq,'ex_ct']
relative_psi_cons = (o_cons+pseudoct)/(i_cons+pseudoct)

# Compute y
i_n = data_df['tot_ct']
o_n = data_df['ex_ct']
relative_psi_n = (o_n+pseudoct)/(i_n+pseudoct)
psi_n = 100*relative_psi_n/relative_psi_cons
y_n = np.log10(psi_n)
data_df['y'] = y_n

# Assign data to training, validation, or test sets
np.random.seed(0)
data_df['set'] = np.random.choice(a=['training', 'validation', 'test'],
                                   p=[.6,.2,.2],
                                   size=len(data_df))

# Shuffle data for extra safety
data_df = data_df.sample(frac=1).reset_index(drop=True)

# Order columns
final_df = data_df[['set', 'tot_ct', 'ex_ct', 'y', 'x']].copy()

# Save to file (uncomment to execute)
out_file = f'{dataset}_data.csv.gz'
final_df.to_csv(out_file, index=False, compression='gzip')
print(f'Dataset "{dataset}" saved to {out_file}')

# Preview final dataframe
print('Done!')
final_df

2309 ss removed for having tot_ct < 10
7 ss with invalid sequences removed
Dataset "mpsa" saved to mpsa_data.csv.gz
Done!

[4]:

	set	tot_ct	ex_ct	y	x
0	training	28	2	0.023406	GGAGUGAUG
1	test	315	7	-0.587914	AGUGUGCAA
2	training	193	15	-0.074999	UUCGCGCCA
3	training	27	0	-0.438475	UAAGCUUUU
4	training	130	2	-0.631467	AUGGUCGGG
...	...	...	...	...	...
30478	training	190	17	-0.017078	CUGGUUGCA
30479	training	154	10	-0.140256	CGCGCACAA
30480	test	407	16	-0.371528	ACUGCUCAC
30481	training	265	6	-0.571100	AUAGUCUAA
30482	test	26	22	0.939047	GUGGUAACU

30483 rows × 5 columns