sortseq dataset

[5]:
# Standard imports
import pandas as pd
import numpy as np

# Special imports
import mavenn
import os
import urllib

Summary

The sort-seq MPRA data of Kinney et al., 2010. The authors used fluoresence-activated cell sorting, followed by deep sequencing, to assay gene expression levels from variant lac promoters in E. coli. The authors performed 6 different experiments, which varied in the region of the lac promoter that was mutagenized, the mutation rate used, the E. coli host strain, cellular growth conditions, and the number of bins into which cells were sorted. See Kinney et al., 2010 for more details.

In this dataframe, the 'x' column lists (unique) variant sequences, columns 'ct_0' through 'ct_9' list the number of read counts for each sequence observed in each of the 10 respective FACS bins, and the 'set' column indicates whether each sequence is assigned to the training set, the validation set, or the test set.

Names: 'sortseq'

Associated datasets: 'sortseq_rnap-wt', 'sortseq_crp-wt', 'sortseq_full-500', 'sortseq_full-150', 'sortseq_full-0'

Reference: Kinney J, Murugan A, Callan C, Cox E. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc Natl Acad Sci USA. 107(20):9158-9163 (2010).

[2]:
mavenn.load_example_dataset('sortseq')
[2]:
set ct_0 ct_1 ct_2 ct_3 ct_4 ct_5 ct_6 ct_7 ct_8 ct_9 x
0 test 0 0 0 0 0 0 0 0 1 0 GGCTTTACACTTTAAGCTGCCGCATCGTATGTTATGTGG
1 training 0 1 0 0 0 0 0 0 0 0 GGCTATACATTTTATGTTTCCGGGTCGTATTTTGTGTGG
2 training 0 0 0 0 0 0 0 0 1 0 GGCTTTACATTTTATGCTTCCTTCACGTATGTTGTGTCT
3 test 0 0 0 0 0 1 0 0 0 0 GGCATTACTCTTTGTGCTTCCGGCTCGTATGTTGTGTGG
4 test 0 0 0 0 0 0 0 1 0 0 GACTTTTCAATTTATGCTTTCAGTTGGTATGTTGTGTAG
... ... ... ... ... ... ... ... ... ... ... ... ...
45773 training 0 0 0 1 0 0 0 0 0 0 GGCTTTTCACTTTATGCTTCTGGCTCGTATGTTGTGTGG
45774 validation 2 0 0 0 0 0 0 0 0 0 GGTTTTACACTTTTTGCTTCCGGGCCAAATGTTGTGTGG
45775 training 0 0 1 0 0 0 0 0 0 0 GGCTCCACACATTATGCTTCCGGCTCGTCTGTTCGCTCG
45776 training 2 0 0 0 0 0 0 0 0 0 GGCTTTACACATTATGCTTCCGGCTCGTATGTTGTTTGG
45777 validation 0 0 0 0 2 0 0 0 0 0 GGCTTTACACTTTATGCTTCCGGCACGTTTGTTGTGTGG

45778 rows × 12 columns

Preprocessing

The sort-seq MPRA dataset of Kinney et al., (2010) is available at https://github.com/jbkinney/09_sortseq/ in file file_S2.txt.gz. It is formatted as follows: the 'seq' column lists (non-unique) variant 75 nt DNA sequences observed by high-throughput seuqencing, the 'experiment' column lists which of the six reported experiments produced that sequence, and the 'bin' column lists the FACS bin in which that sequence was observed. This dataframe is called raw_df in what follows.

[3]:
# Download datset
url = 'https://github.com/jbkinney/09_sortseq/raw/master/file_S2.txt.gz'
raw_data_file = 'file_S2.txt.gz'
urllib.request.urlretrieve(url, raw_data_file)

# Load raw dataset
raw_df = pd.read_csv('file_S2.txt.gz',
                     sep='\t',
                     header=None,
                     names=['experiment','bin','x'],
                     compression='gzip')

# Delete raw dataset
os.remove(raw_data_file)

# Preview raw_df
raw_df.head()
[3]:
experiment bin x
0 crp-wt B0 AATTAAGGGCAGTTAACTCACCCATTAGGCACCCCAGGCTTTACAC...
1 crp-wt B0 AATTAATATGAGTTTGCTCACCCATTAGGCACCCCAGGCTTTACAC...
2 crp-wt B0 AATTAATAAGAGTTCACTCACTCATACGGCACCCCAGGCTTTACAC...
3 crp-wt B0 AATTTATGTGCTTTACCTCACTGATTTGGCACCCCAGGCTTTACAC...
4 crp-wt B0 AATTAAGGTGAGTTCGCTCGCTCATGAGGCACCCCAGGCTTTACAC...

To reformat 'raw_df' into the one provided with MAVE-NN, we first trim the dataframe to keep only rows corresponding to the 'full-wt' experiment. We then rename each FACS bin 'BX' to 'ct_X' for X = 0, 1, …, 9, and create a 'ct' column filled with ones. The result is stored in a dataframe called sub_df.

Next we use the pivot() and groupby() functions in Pandas to obtain a dataframe in which the 'seq' column lists only unique sequences, each of the 10 possible 'ct_X' values in the original 'bin' column now label a separate column, and the values in these new columns report the number of times each sequence was observed in each FACS bin. The result is stored in a dataframe called pivot_df.

Finally, we create a 'set' column that randomly assigns each sequence to the training, test, or validation set (using a 60:20:20 split), then reorder the columns for clarity. The resulting dataframe is called final_df.

[4]:
# Keep only data from the full-wt experiment
ix = raw_df['experiment']=='full-wt'
sub_df = raw_df[ix].copy().reset_index(drop=True)[['bin','x']]

# Rename bins BX -> ct_X, where X = 0, 1, ..., 9
sub_df['bin'] = [f'ct_{s[1:]}' for s in sub_df['bin']]

# Add counts column
sub_df['ct'] = 1

# Pivot dataframe
pivot_df = sub_df.pivot(index='x', values='ct', columns='bin').fillna(0).astype(int)
pivot_df.columns.name = None

# Groupby sequence
pivot_df = pivot_df.groupby('x').sum()

# Reindex dataframe
pivot_df = pivot_df.reset_index()

# Randomly assign sequences to training, validation, and test sets
final_df = pivot_df.copy()
np.random.seed(0)
final_df['set'] = np.random.choice(a=['training','test','validation'],
                                   p=[.6,.2,.2],
                                   size=len(final_df))

# Rearrange columns
new_cols = ['set'] + list(final_df.columns[1:-1]) + ['x']
final_df = final_df[new_cols]

# Save to file (uncomment to execute)
# final_df.to_csv('sortseq_data.csv.gz', index=False, compression='gzip')

# Preview final_df
final_df.head()
[4]:
set ct_0 ct_1 ct_2 ct_3 ct_4 ct_5 ct_6 ct_7 ct_8 ct_9 x
0 training 0 1 0 0 0 0 0 0 0 0 AAAAAAAGTGAGTTAGCCAACTAATTAGGCACCGTACGCTTTATAG...
1 test 0 0 0 0 0 0 0 0 1 0 AAAAAATCTGAGTTAGCTTACTCATTAGGCACCCCAGGCTTGACAC...
2 test 0 0 0 0 0 0 1 0 0 0 AAAAAATCTGAGTTTGCTCACTCTATCGGCACCCCAGTCTTTACAC...
3 training 0 0 0 0 0 0 0 0 0 1 AAAAAATGAGAGTTAGTTCACTCATTCGGCACCACAGGCTTTACAA...
4 training 0 0 0 0 0 0 0 0 0 1 AAAAAATGGGTGTTAGCTCTATCATTAGGCACCCCCGGCTTTACAC...

This final dataframe, final_df, has the same format as the 'sortseq' dataset that comes with MAVE-NN.