mpsa datasets

[1]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Special imports
import mavenn
import os
import urllib

Summary

The massively parallel splicing assay (MPSA) dataset of Wong et al., 2018. The authors used 3-exon minigenes to assay how inclusion of the middle exon varies with the sequence of that exon’s 5’ splice site. Nearly all 5’ splice site variants of the form NNN/GYNNNN were measured, where the slash demarcates the exon/intron boundary. The authors performed experiments on multiple replicates of multiple libraries in three different minigene contexts: IKBKAP exons 19-21, SMN1 exons 6-8, and BRCA2 exons 17-19. The dataset 'mpsa' is from library 1 replicate 1 in the BRCA2 context, while 'mpsa_replicate' is from library 2 replicate 1 in the same context.

In these dataframes, the 'tot_ct' column reports the number of reads obtained for each splice site from total processed mRNA transcripts, the 'ex_ct' column reports the number of reads obtained from processed mRNA transcripts containing the central exon, 'y' is the \(\log_{10}\) percent-spliced-in (PSI) value measured for each sequence, and 'x' is the variant 5’ splice site. Note that some sequences have \(y > 2.0\), corresponding to PSI > 100, due to experimental noise.

Names: 'mpsa', 'mpsa_replicate'

Reference: Wong MS, Kinney JB, Krainer AR. Quantitative activity profile and context dependence of all human 5’ splice sites. Mol Cell. 71:1012-1026.e3 (2018).

[2]:
mavenn.load_example_dataset('mpsa')
[2]:
set tot_ct ex_ct y x
0 training 28 2 0.023406 GGAGUGAUG
1 test 315 7 -0.587914 AGUGUGCAA
2 training 193 15 -0.074999 UUCGCGCCA
3 training 27 0 -0.438475 UAAGCUUUU
4 training 130 2 -0.631467 AUGGUCGGG
... ... ... ... ... ...
30478 training 190 17 -0.017078 CUGGUUGCA
30479 training 154 10 -0.140256 CGCGCACAA
30480 test 407 16 -0.371528 ACUGCUCAC
30481 training 265 6 -0.571100 AUAGUCUAA
30482 test 26 22 0.939047 GUGGUAACU

30483 rows × 5 columns

Preprocessing

The raw datasets for both 'mpsa' and 'mpsa_replicate' are available at github.com/jbkinney/15_splicing/.

[3]:
# Specify dataset
dataset = 'mpsa' # use to make 'mpsa' dataset
# dataset = 'mpsa_replicate' # uncomment to make 'mpsa_replicate' dataset instead
print(f'Creating dataset "{dataset}"...')

# Specify online directory
url_dir = 'https://github.com/jbkinney/15_splicing/raw/master/for_mavenn/'

# Specify raw data file and output file names
if dataset=='mpsa':
    raw_data_file = 'results.brca2_9nt_lib1_rep1.txt.gz'
elif dataset=='mpsa_replicate':
    raw_data_file = 'results.brca2_9nt_lib2_rep1.txt.gz'
else:
    assert False

# Download raw datasert
print(f'Retrieving data from {url_dir+raw_data_file}...')
urllib.request.urlretrieve(url_dir+raw_data_file, raw_data_file)

# Load raw datset
raw_df = pd.read_csv(raw_data_file, sep='\t', index_col=0, compression='gzip')

# Delete raw dataset
os.remove(raw_data_file)

# Preview raw dataset
print('Done!')
raw_df
Creating dataset "mpsa"...
Retrieving data from https://github.com/jbkinney/15_splicing/raw/master/for_mavenn/results.brca2_9nt_lib1_rep1.txt.gz...
Done!
[3]:
tot_ct ex_ct lib_ct mis_ct ss bc
0 377 27 164 3 ACAGCGGGA TTAGCTATCGGCTGACGTCT
1 332 5 97 1 AGCGTGTAT CCACCCAACGCGCCGTCAGT
2 320 3286 46 1 CAGGTGAGA TTGAGGTACACTGAACAGTC
3 312 2248 87 1 CAGGTTAGA ACCGATCTGCCACGGCGACC
4 291 8 109 2 CAAGCCTTA AGGGACCATCCAGTTCGCCT
... ... ... ... ... ... ...
944960 0 0 14 0 ACCGCGATG TGAAATTGACCCGAGCCTGC
944961 0 0 14 1 AACGCCTCG AACCAAAATACCTTGCGCTT
944962 0 0 14 0 TACGCATCG TACTCAGCCAATGGCGAACA
944963 0 0 14 0 AAGGTCACG CTATGCATCTACGCTTAATG
944964 0 0 2 0 CCAGCGCCG AAAAAAAAAAAAGATTTGTT

944965 rows × 6 columns

Each raw dataset lists read counts for every cloned minigene plasmid. Note that each minigene contained one 5’ss (column 'ss') and one associated barcode (column 'bc'). Each barcode is associated with a unique 5’ss, but each 5’ss will typically be associated with many different barcodes. In addition to 'tot_ct' and 'ex_ct', two other read count quantities are listed: ‘lib_ct’ is the number of reads supporting the indicated 5’ss-barcode association, while 'mis_ct' is the number of reads indicating association of the barcode with other 5’ss that are not shown. Filtering for 'lib_ct' >= 2 and 'lib_ct' >=4x 'mis_ct' was already performed by the authors, and we will ignore these columns in what follows.

To format this dataset for use in MAVE-NN, we do the following: - Drop the ‘lib_ct’ and 'mis_ct' columns. - Drop the 'bc' column and marginalize ‘tot_ct’ and ‘ex_ct’ by 5’ss sequence. - Rename the 'ss' column to 'x'. - Convert DNA 5’ss sequences to RNA sequences by replacing 'T' with 'U'. - Remove 5’ss with ‘tot_ct' < 10 - Remove 5’ss with sequences that don’t match 'NNNGYNNNN'. - Compute 'y', which lists estimated log10 PSI values. - Assign each 5’ss to the training, validation, or test set. - Shuffle the rows of the final dataframe and reorder the columns for clarity. - Save the dataframe as a .csv.gz file.

[4]:
# Remove unwanted columns
data_df = raw_df[['tot_ct','ex_ct','ss','bc']].copy()

# Marginalize by splice site
data_df = data_df.groupby('ss').sum().reset_index()

# Rename columns
data_df.rename(columns={'ss':'x'}, inplace=True)

# Make sequences RNA
data_df['x'] = [ss.replace('T','U') for ss in data_df['x']]

# Remove ss with minimum tot_ct
min_ct = 10
ix = data_df['tot_ct'] >= min_ct
data_df = data_df[ix].copy()
print(f'{(~ix).sum()} ss removed for having tot_ct < {min_ct}')

# Remove ss with invalid sequences
ix = np.array([((x[3]=='G') and (x[4] in {'C','U'})) for x in data_df['x']])
data_df = data_df[ix]
print(f'{(~ix).sum()} ss with invalid sequences removed')

# Get consensus i_cons and o_cons
pseudoct = 1.0
cons_seq = 'CAGGUAAGU'
tmp_df = data_df.set_index('x')
i_cons = tmp_df.loc[cons_seq,'tot_ct']
o_cons = tmp_df.loc[cons_seq,'ex_ct']
relative_psi_cons = (o_cons+pseudoct)/(i_cons+pseudoct)

# Compute y
i_n = data_df['tot_ct']
o_n = data_df['ex_ct']
relative_psi_n = (o_n+pseudoct)/(i_n+pseudoct)
psi_n = 100*relative_psi_n/relative_psi_cons
y_n = np.log10(psi_n)
data_df['y'] = y_n

# Assign data to training, validation, or test sets
np.random.seed(0)
data_df['set'] = np.random.choice(a=['training', 'validation', 'test'],
                                   p=[.6,.2,.2],
                                   size=len(data_df))

# Shuffle data for extra safety
data_df = data_df.sample(frac=1).reset_index(drop=True)

# Order columns
final_df = data_df[['set', 'tot_ct', 'ex_ct', 'y', 'x']].copy()

# Save to file (uncomment to execute)
out_file = f'{dataset}_data.csv.gz'
final_df.to_csv(out_file, index=False, compression='gzip')
print(f'Dataset "{dataset}" saved to {out_file}')

# Preview final dataframe
print('Done!')
final_df


2309 ss removed for having tot_ct < 10
7 ss with invalid sequences removed
Dataset "mpsa" saved to mpsa_data.csv.gz
Done!
[4]:
set tot_ct ex_ct y x
0 training 28 2 0.023406 GGAGUGAUG
1 test 315 7 -0.587914 AGUGUGCAA
2 training 193 15 -0.074999 UUCGCGCCA
3 training 27 0 -0.438475 UAAGCUUUU
4 training 130 2 -0.631467 AUGGUCGGG
... ... ... ... ... ...
30478 training 190 17 -0.017078 CUGGUUGCA
30479 training 154 10 -0.140256 CGCGCACAA
30480 test 407 16 -0.371528 ACUGCUCAC
30481 training 265 6 -0.571100 AUAGUCUAA
30482 test 26 22 0.939047 GUGGUAACU

30483 rows × 5 columns