mpsa datasets¶
[1]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Special imports
import mavenn
import os
import urllib
Summary¶
The massively parallel splicing assay (MPSA) dataset of Wong et al., 2018. The authors used 3-exon minigenes to assay how inclusion of the middle exon varies with the sequence of that exon’s 5’ splice site. Nearly all 5’ splice site variants of the form NNN/GYNNNN were measured, where the slash demarcates the exon/intron boundary. The authors performed experiments on multiple replicates of multiple libraries in three different minigene contexts: IKBKAP exons 19-21, SMN1 exons 6-8, and
BRCA2 exons 17-19. The dataset 'mpsa'
is from library 1 replicate 1 in the BRCA2 context, while 'mpsa_replicate'
is from library 2 replicate 1 in the same context.
In these dataframes, the 'tot_ct'
column reports the number of reads obtained for each splice site from total processed mRNA transcripts, the 'ex_ct'
column reports the number of reads obtained from processed mRNA transcripts containing the central exon, 'y'
is the \(\log_{10}\) percent-spliced-in (PSI) value measured for each sequence, and 'x'
is the variant 5’ splice site. Note that some sequences have \(y > 2.0\), corresponding to PSI > 100, due to experimental noise.
Names: 'mpsa'
, 'mpsa_replicate'
Reference: Wong MS, Kinney JB, Krainer AR. Quantitative activity profile and context dependence of all human 5’ splice sites. Mol Cell. 71:1012-1026.e3 (2018).
[2]:
mavenn.load_example_dataset('mpsa')
[2]:
set | tot_ct | ex_ct | y | x | |
---|---|---|---|---|---|
0 | training | 28 | 2 | 0.023406 | GGAGUGAUG |
1 | test | 315 | 7 | -0.587914 | AGUGUGCAA |
2 | training | 193 | 15 | -0.074999 | UUCGCGCCA |
3 | training | 27 | 0 | -0.438475 | UAAGCUUUU |
4 | training | 130 | 2 | -0.631467 | AUGGUCGGG |
... | ... | ... | ... | ... | ... |
30478 | training | 190 | 17 | -0.017078 | CUGGUUGCA |
30479 | training | 154 | 10 | -0.140256 | CGCGCACAA |
30480 | test | 407 | 16 | -0.371528 | ACUGCUCAC |
30481 | training | 265 | 6 | -0.571100 | AUAGUCUAA |
30482 | test | 26 | 22 | 0.939047 | GUGGUAACU |
30483 rows × 5 columns
Preprocessing¶
The raw datasets for both 'mpsa'
and 'mpsa_replicate'
are available at github.com/jbkinney/15_splicing/.
[3]:
# Specify dataset
dataset = 'mpsa' # use to make 'mpsa' dataset
# dataset = 'mpsa_replicate' # uncomment to make 'mpsa_replicate' dataset instead
print(f'Creating dataset "{dataset}"...')
# Specify online directory
url_dir = 'https://github.com/jbkinney/15_splicing/raw/master/for_mavenn/'
# Specify raw data file and output file names
if dataset=='mpsa':
raw_data_file = 'results.brca2_9nt_lib1_rep1.txt.gz'
elif dataset=='mpsa_replicate':
raw_data_file = 'results.brca2_9nt_lib2_rep1.txt.gz'
else:
assert False
# Download raw datasert
print(f'Retrieving data from {url_dir+raw_data_file}...')
urllib.request.urlretrieve(url_dir+raw_data_file, raw_data_file)
# Load raw datset
raw_df = pd.read_csv(raw_data_file, sep='\t', index_col=0, compression='gzip')
# Delete raw dataset
os.remove(raw_data_file)
# Preview raw dataset
print('Done!')
raw_df
Creating dataset "mpsa"...
Retrieving data from https://github.com/jbkinney/15_splicing/raw/master/for_mavenn/results.brca2_9nt_lib1_rep1.txt.gz...
Done!
[3]:
tot_ct | ex_ct | lib_ct | mis_ct | ss | bc | |
---|---|---|---|---|---|---|
0 | 377 | 27 | 164 | 3 | ACAGCGGGA | TTAGCTATCGGCTGACGTCT |
1 | 332 | 5 | 97 | 1 | AGCGTGTAT | CCACCCAACGCGCCGTCAGT |
2 | 320 | 3286 | 46 | 1 | CAGGTGAGA | TTGAGGTACACTGAACAGTC |
3 | 312 | 2248 | 87 | 1 | CAGGTTAGA | ACCGATCTGCCACGGCGACC |
4 | 291 | 8 | 109 | 2 | CAAGCCTTA | AGGGACCATCCAGTTCGCCT |
... | ... | ... | ... | ... | ... | ... |
944960 | 0 | 0 | 14 | 0 | ACCGCGATG | TGAAATTGACCCGAGCCTGC |
944961 | 0 | 0 | 14 | 1 | AACGCCTCG | AACCAAAATACCTTGCGCTT |
944962 | 0 | 0 | 14 | 0 | TACGCATCG | TACTCAGCCAATGGCGAACA |
944963 | 0 | 0 | 14 | 0 | AAGGTCACG | CTATGCATCTACGCTTAATG |
944964 | 0 | 0 | 2 | 0 | CCAGCGCCG | AAAAAAAAAAAAGATTTGTT |
944965 rows × 6 columns
Each raw dataset lists read counts for every cloned minigene plasmid. Note that each minigene contained one 5’ss (column 'ss'
) and one associated barcode (column 'bc'
). Each barcode is associated with a unique 5’ss, but each 5’ss will typically be associated with many different barcodes. In addition to 'tot_ct'
and 'ex_ct'
, two other read count quantities are listed: ‘lib_ct
’ is the number of reads supporting the indicated 5’ss-barcode association, while 'mis_ct'
is the
number of reads indicating association of the barcode with other 5’ss that are not shown. Filtering for 'lib_ct'
>= 2 and 'lib_ct'
>=4x 'mis_ct'
was already performed by the authors, and we will ignore these columns in what follows.
To format this dataset for use in MAVE-NN, we do the following: - Drop the ‘lib_ct
’ and 'mis_ct'
columns. - Drop the 'bc'
column and marginalize ‘tot_ct
’ and ‘ex_ct
’ by 5’ss sequence. - Rename the 'ss'
column to 'x'
. - Convert DNA 5’ss sequences to RNA sequences by replacing 'T'
with 'U'
. - Remove 5’ss with ‘tot_ct'
< 10 - Remove 5’ss with sequences that don’t match 'NNNGYNNNN'
. - Compute 'y'
, which lists estimated log10 PSI values. - Assign each
5’ss to the training, validation, or test set. - Shuffle the rows of the final dataframe and reorder the columns for clarity. - Save the dataframe as a .csv.gz file.
[4]:
# Remove unwanted columns
data_df = raw_df[['tot_ct','ex_ct','ss','bc']].copy()
# Marginalize by splice site
data_df = data_df.groupby('ss').sum().reset_index()
# Rename columns
data_df.rename(columns={'ss':'x'}, inplace=True)
# Make sequences RNA
data_df['x'] = [ss.replace('T','U') for ss in data_df['x']]
# Remove ss with minimum tot_ct
min_ct = 10
ix = data_df['tot_ct'] >= min_ct
data_df = data_df[ix].copy()
print(f'{(~ix).sum()} ss removed for having tot_ct < {min_ct}')
# Remove ss with invalid sequences
ix = np.array([((x[3]=='G') and (x[4] in {'C','U'})) for x in data_df['x']])
data_df = data_df[ix]
print(f'{(~ix).sum()} ss with invalid sequences removed')
# Get consensus i_cons and o_cons
pseudoct = 1.0
cons_seq = 'CAGGUAAGU'
tmp_df = data_df.set_index('x')
i_cons = tmp_df.loc[cons_seq,'tot_ct']
o_cons = tmp_df.loc[cons_seq,'ex_ct']
relative_psi_cons = (o_cons+pseudoct)/(i_cons+pseudoct)
# Compute y
i_n = data_df['tot_ct']
o_n = data_df['ex_ct']
relative_psi_n = (o_n+pseudoct)/(i_n+pseudoct)
psi_n = 100*relative_psi_n/relative_psi_cons
y_n = np.log10(psi_n)
data_df['y'] = y_n
# Assign data to training, validation, or test sets
np.random.seed(0)
data_df['set'] = np.random.choice(a=['training', 'validation', 'test'],
p=[.6,.2,.2],
size=len(data_df))
# Shuffle data for extra safety
data_df = data_df.sample(frac=1).reset_index(drop=True)
# Order columns
final_df = data_df[['set', 'tot_ct', 'ex_ct', 'y', 'x']].copy()
# Save to file (uncomment to execute)
out_file = f'{dataset}_data.csv.gz'
final_df.to_csv(out_file, index=False, compression='gzip')
print(f'Dataset "{dataset}" saved to {out_file}')
# Preview final dataframe
print('Done!')
final_df
2309 ss removed for having tot_ct < 10
7 ss with invalid sequences removed
Dataset "mpsa" saved to mpsa_data.csv.gz
Done!
[4]:
set | tot_ct | ex_ct | y | x | |
---|---|---|---|---|---|
0 | training | 28 | 2 | 0.023406 | GGAGUGAUG |
1 | test | 315 | 7 | -0.587914 | AGUGUGCAA |
2 | training | 193 | 15 | -0.074999 | UUCGCGCCA |
3 | training | 27 | 0 | -0.438475 | UAAGCUUUU |
4 | training | 130 | 2 | -0.631467 | AUGGUCGGG |
... | ... | ... | ... | ... | ... |
30478 | training | 190 | 17 | -0.017078 | CUGGUUGCA |
30479 | training | 154 | 10 | -0.140256 | CGCGCACAA |
30480 | test | 407 | 16 | -0.371528 | ACUGCUCAC |
30481 | training | 265 | 6 | -0.571100 | AUAGUCUAA |
30482 | test | 26 | 22 | 0.939047 | GUGGUAACU |
30483 rows × 5 columns