Tutorial 5: The sort-seq E. Coli lac promoter binding analysis using a custom biophysical G-P maps¶

[1]:

# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Special imports
import mavenn

The sort-seq MPRA data of Kinney et al., 2010. The authors in Ref. [1] used fluorescence-activated cell sorting, followed by deep sequencing, to assay gene expression levels from variant lac promoters in E. coli. The data is available in MAVE-nn load_example_dataset function and it is called 'sortseq'.

[2]:

# Choose dataset
data_name = 'sortseq'

print(f"Loading dataset '{data_name}' ")

# Load datset
data_df = mavenn.load_example_dataset(data_name)

# Get and report sequence length
L = len(data_df.loc[0, 'x'])
print(f'Sequence length: {L:d} amino acids')

# Preview dataset
data_df

Loading dataset 'sortseq'
Sequence length: 75 amino acids

[2]:

	set	ct_0	ct_1	ct_2	ct_3	ct_4	ct_5	ct_6	ct_7	ct_8	ct_9	x
0	training	0	1	0	0	0	0	0	0	0	0	AAAAAAAGTGAGTTAGCCAACTAATTAGGCACCGTACGCTTTATAG...
1	test	0	0	0	0	0	0	0	0	1	0	AAAAAATCTGAGTTAGCTTACTCATTAGGCACCCCAGGCTTGACAC...
2	test	0	0	0	0	0	0	1	0	0	0	AAAAAATCTGAGTTTGCTCACTCTATCGGCACCCCAGTCTTTACAC...
3	training	0	0	0	0	0	0	0	0	0	1	AAAAAATGAGAGTTAGTTCACTCATTCGGCACCACAGGCTTTACAA...
4	training	0	0	0	0	0	0	0	0	0	1	AAAAAATGGGTGTTAGCTCTATCATTAGGCACCCCCGGCTTTACAC...
...	...	...	...	...	...	...	...	...	...	...	...	...
50513	validation	0	0	0	1	0	0	0	0	0	0	TTTTGCAGAGTGTCAGCCCACTCATTACGCACCGCAGCCGTTACAC...
50514	test	0	0	0	0	0	0	0	0	1	0	TTTTTATGTGAGTTAGCTCACTCATTCGGCACCCTAGGCTTTACAC...
50515	training	0	0	0	1	0	0	0	0	0	0	TTTTTATGTGAGTTTGCTCACTCATGTGGCACCTAAGGCTTTACGC...
50516	training	1	0	0	0	0	0	0	0	0	0	TTTTTATGTGGGTTAGGTCGCGCATTAGGCACCGCAGGCTTTACCC...
50517	training	1	0	0	0	0	0	0	0	0	0	TTTTTATGTGTGTTTACTCTCTCATTAGGCACTCCACGCTTTACAC...

50518 rows × 12 columns

[3]:

# Split dataset
trainval_df, test_df = mavenn.split_dataset(data_df)

# Show dataset sizes
print(f'Train + val set size : {len(trainval_df):6,d} observations')
print(f'Test set size        : {len(test_df):6,d} observations')

# Preview trainval_df
trainval_df

Training set   :   30,516 observations (  60.41%)
Validation set :   10,067 observations (  19.93%)
Test set       :    9,935 observations (  19.67%)
-------------------------------------------------
Total dataset  :   50,518 observations ( 100.00%)

Train + val set size : 40,583 observations
Test set size        :  9,935 observations

[3]:

	validation	ct_0	ct_1	ct_2	ct_3	ct_4	ct_5	ct_6	ct_7	ct_8	ct_9	x
0	False	0	1	0	0	0	0	0	0	0	0	AAAAAAAGTGAGTTAGCCAACTAATTAGGCACCGTACGCTTTATAG...
1	False	0	0	0	0	0	0	0	0	0	1	AAAAAATGAGAGTTAGTTCACTCATTCGGCACCACAGGCTTTACAA...
2	False	0	0	0	0	0	0	0	0	0	1	AAAAAATGGGTGTTAGCTCTATCATTAGGCACCCCCGGCTTTACAC...
3	False	0	1	0	0	0	0	0	0	0	0	AAAAAATGTCAGTTAGCTGACTCATTAGGCACCCCTGGCTTTACGT...
4	True	0	0	0	0	0	0	1	0	0	0	AAAAAATGTGAGAAAGCTCACTCCTTTGGCACCGCAGGCTTTACAC...
...	...	...	...	...	...	...	...	...	...	...	...	...
40578	True	0	1	0	0	0	0	0	0	0	0	TTTTGATGTGGGTTTGCTCTCTCTTCAGGCACCCCACGCTTTACGC...
40579	True	0	0	0	1	0	0	0	0	0	0	TTTTGCAGAGTGTCAGCCCACTCATTACGCACCGCAGCCGTTACAC...
40580	False	0	0	0	1	0	0	0	0	0	0	TTTTTATGTGAGTTTGCTCACTCATGTGGCACCTAAGGCTTTACGC...
40581	False	1	0	0	0	0	0	0	0	0	0	TTTTTATGTGGGTTAGGTCGCGCATTAGGCACCGCAGGCTTTACCC...
40582	False	1	0	0	0	0	0	0	0	0	0	TTTTTATGTGTGTTTACTCTCTCATTAGGCACTCCACGCTTTACAC...

40583 rows × 12 columns

Training data are the count columns (10 columns) of the above dataset.

[4]:

# Get the length of the sequence
L = len(data_df['x'][0])
# Get the column index for the counts
y_cols = trainval_df.columns[1:-1]
# Get the number of count columns
len_y_cols = len(y_cols)

Training¶

A four-state thermodynamic model for transcriptional activation at E. coli lac promoter which proposed in Ref. [1] is trained here using MAVE-NN. Here, \(\Delta G_R\) and \(\Delta G_C\) are RNAP-DNA and CRP-DNA binding Gibbs free energies and CRP-RNAP interaction energy is represented by a scalar \(\Delta G_I\). The four-state thermodynamic model is summarized below:

microstates	Gibbs free energies	activity
free DNA	0	0
CRP-DNA binding	\(\Delta G_C\)	0
RNAP-DNA binding	\(\Delta G_R\)	1
CRP and RNAP both bounded to DNA and interact	\(\Delta G_C+ \Delta G_R + \Delta G_I\)	1

The rate of transcription tr_rate has the following form:

\[tr_{\text{rate}} = t_{sat} \frac {\exp(-\Delta G_R)+ \exp(-\Delta G_C-\Delta G_R-\Delta G_I)}{1+\exp(-\Delta G_C)+ \exp(-\Delta G_R)+ \exp(-\Delta G_C- \Delta G_R- \Delta G_I)}\]

in which \(t_{sat}\) is the transcription rate resulting from full RNAP occupancy.

Here, the \(\Delta G_C\) and \(\Delta G_R\) are trainable matrices (weights of the neural network) and \(\Delta G_I\) and \(t_{sat}\) are trainable scalars.

To fit the above thermodynamic models, we used the Custom G-P maps layer implemented in 'mavenn.src.layers.gpmap'. For the detailed discussion on how to use the MAVE-NN custom G-P maps layer, checkout the thermodynamic model for IgG binding by GB1 tutorial.

[5]:

from mavenn.src.layers.gpmap import GPMapLayer

# Tensorflow imports
import tensorflow as tf
import tensorflow.keras.backend as K
from tensorflow.keras.initializers import Constant

class ThermodynamicLayer(GPMapLayer):
    """
    Represents a four-stage thermodynamic model
    containing the states:
    1. free DNA
    2. CPR-DNA binding
    3. RNAP-DNA binding
    4. CPR and RNAP both bounded to DNA and interact
    """

    def __init__(self,
                 tf_start,
                 tf_end,
                 rnap_start,
                 rnap_end,
                 *args, **kwargs):
        """Construct layer instance."""

        # Call superclass
        super().__init__(*args, **kwargs)

        # set attributes
        self.tf_start = tf_start            # transcription factor starting position
        self.tf_end = tf_end                 # transcription factor ending position
        self.L_tf = tf_end - tf_start        # length of transcription factor
        self.rnap_start = rnap_start         # RNAP starting position
        self.rnap_end = rnap_end             # RNAP ending position
        self.L_rnap = rnap_end - rnap_start  # length of RNAP

        # define bias/chemical potential weight for TF/CRP energy
        self.theta_tf_0 = self.add_weight(name='theta_tf_0',
                                          shape=(1,),
                                          initializer=Constant(1.),
                                          trainable=True,
                                          regularizer=self.regularizer)

        # define bias/chemical potential weight for rnap energy
        self.theta_rnap_0 = self.add_weight(name='theta_rnap_0',
                                            shape=(1,),
                                            initializer=Constant(1.),
                                            trainable=True,
                                            regularizer=self.regularizer)

        # initialize the theta_tf
        theta_tf_shape = (1, self.L_tf, self.C)
        theta_tf_init = np.random.randn(*theta_tf_shape)/np.sqrt(self.L_tf)

        # define the weights of the layer corresponds to theta_tf
        self.theta_tf = self.add_weight(name='theta_tf',
                                        shape=theta_tf_shape,
                                        initializer=Constant(theta_tf_init),
                                        trainable=True,
                                        regularizer=self.regularizer)

        # define theta_rnap parameters
        theta_rnap_shape = (1, self.L_rnap, self.C)
        theta_rnap_init = np.random.randn(*theta_rnap_shape)/np.sqrt(self.L_rnap)

        # define the weights of the layer corresponds to theta_rnap
        self.theta_rnap = self.add_weight(name='theta_rnap',
                                          shape=theta_rnap_shape,
                                          initializer=Constant(theta_rnap_init),
                                          trainable=True,
                                          regularizer=self.regularizer)

        # define trainable real number G_I, representing interaction Gibbs energy
        self.theta_dG_I = self.add_weight(name='theta_dG_I',
                                   shape=(1,),
                                   initializer=Constant(-4),
                                   trainable=True,
                                   regularizer=self.regularizer)


    def call(self, x):
        """Process layer input and return output.

        x: (tensor)
            Input tensor that represents one-hot encoded
            sequence values.
        """

        # 1kT = 0.616 kcal/mol at body temperature
        kT = 0.616

        # extract locations of binding sites from entire lac-promoter sequence.
        # for transcription factor and rnap
        x_tf = x[:, self.C * self.tf_start:self.C * self.tf_end]
        x_rnap = x[:, self.C * self.rnap_start: self.C * self.rnap_end]

        # reshape according to tf and rnap lengths.
        x_tf = tf.reshape(x_tf, [-1, self.L_tf, self.C])
        x_rnap = tf.reshape(x_rnap, [-1, self.L_rnap, self.C])

        # compute delta G for crp binding
        G_C = self.theta_tf_0 + \
            tf.reshape(K.sum(self.theta_tf * x_tf, axis=[1, 2]),
                       shape=[-1, 1])

        # compute delta G for rnap binding
        G_R = self.theta_rnap_0 + \
            tf.reshape(K.sum(self.theta_rnap * x_rnap, axis=[1, 2]),
                       shape=[-1, 1])

        G_I = self.theta_dG_I

        # compute phi
        numerator_of_rate = K.exp(-G_R/kT) + K.exp(-(G_C+G_R+G_I)/kT)
        denom_of_rate = 1.0 + K.exp(-G_C/kT) + K.exp(-G_R/kT) + K.exp(-(G_C+G_R+G_I)/kT)
        phi = numerator_of_rate/denom_of_rate

        return phi

Training the Model¶

Here we train the model with MAVEN. Note that we initialize the Gibbs energies with random negative values to help the convergence of the training.

[6]:

# define custom gp_map parameters dictionary
gpmap_kwargs = {'tf_start': 1,  # starting position of the CRP
                'tf_end': 27,  # ending position of the CRP
                'rnap_start': 34,  # starting position of the RNAP
                'rnap_end': 75,   # ending position of the RNAP
                'L': L,
                'C': 4,
                'theta_regularization': 0.1}

# Create model
model = mavenn.Model(L=L,
                     Y=len_y_cols,
                     alphabet='dna',
                     regression_type='MPA',
                     gpmap_type='custom',
                     gpmap_kwargs=gpmap_kwargs,
                     custom_gpmap=ThermodynamicLayer);

# Set training data
model.set_data(x=trainval_df['x'],
               y=trainval_df[y_cols],
               validation_flags=trainval_df['validation'],
               shuffle=True);

# Fit model to data
model.fit(learning_rate=5e-4,
          epochs=2000,
          batch_size=100,
          early_stopping=True,
          early_stopping_patience=25,
          linear_initialization=False,
          verbose=False);

N = 40,583 observations set as training data.
Using 24.8% for validation.
Data shuffled.
Time to set data: 0.447 sec.

Training time: 71.6 seconds

[7]:

model.save('sortseq_thermodynamic_mpa')

Model saved to these files:
        sortseq_thermodynamic_mpa.pickle
        sortseq_thermodynamic_mpa.h5

[8]:

# Compute predictive information on test data
I_pred, dI_pred = model.I_predictive(x=test_df['x'], y=test_df[y_cols])
print(f'test_I_pred: {I_pred:.3f} +- {dI_pred:.3f} bits')

test_I_pred: 0.685 +- 0.011 bits

[9]:

# Create figure and axes for plotting
fig, ax = plt.subplots(1, 1, figsize=[5, 5])

# Plot I_var_train, the variational information on training data as a function of epoch
ax.plot(model.history['I_var'], label=r'I_var_train')

# Plot I_var_val, the variational information on validation data as a function of epoch
ax.plot(model.history['val_I_var'], label=r'val_I_var')

# Show I_pred_test, the predictive information of the final model on test data
ax.axhline(I_pred, color='C3', linestyle=':', label=r'test_I_pred')

# Style plot
ax.set_xlabel('epochs')
ax.set_ylabel('bits')
ax.set_title('Training history: variational information')
ax.legend()

plt.tight_layout()

../_images/tutorials_5_promoter_biophysical_gpmaps_13_0.png

[10]:

# Get the trained model parameters
# Retrieve G-P map parameter dict and view dict keys
param_dict = model.layer_gpmap.get_params()
param_dict.keys()

[10]:

dict_keys(['theta_tf_0', 'theta_rnap_0', 'theta_tf', 'theta_rnap', 'theta_dG_I'])

Authors in Ref. [1], reported they inferred CRP-RNAP interaction energy to be \(\Delta G_I = −3.26\) kcal∕mol. The MAVE-NN prediction is very similar to the reported value, while it is several order of magnitude faster than the method used in Ref. [1].

[11]:

delta_G_I = param_dict['theta_dG_I']  # Gibbs energy of Interaction (scalar)
print(f'CRP-RNAP interaction energy = {delta_G_I*0.62:.3f} k_cal/mol')

CRP-RNAP interaction energy = -1.549 k_cal/mol

In addition, we can represent the CRP-DNA \(\Delta G_C\) and RNAP-DNA \(\Delta G_R\) binding energies in the weight matrix form. The weight matrices can be represented by sequence logos. To do that, we used the a Python package Logomaker which is also developed in our research group and is freely available here. To represent the weight matrices in logo

we get the trained crp_weights and rnap_weights values,
we convert them to the pandas.DataFrame with column names being the nucleotide strings.

The pandas.DataFrame can easily imported in Logomaker. See the documentation of Logomaker for detailed description of the parameters one can pass to make logos.

[12]:

# import logomaker
import logomaker

# Get the \Delta G_C trained values (theta_tf)
crp_weights = param_dict['theta_tf']
# Get the \Delta G_R trained values (theta_rnap)
rnap_weights = param_dict['theta_rnap']
# Convert them to pandas dataframe
crp_df = pd.DataFrame(crp_weights, columns=model.alphabet)
rnap_df = pd.DataFrame(rnap_weights, columns=model.alphabet)


# plot logos
fig, axs = plt.subplots(1, 2, figsize=[12, 3])

# sequence logo for the CRP-DNA binding energy matrix
logo = logomaker.Logo(crp_df,  ax=axs[0], center_values=True)
axs[0].set_title('CRP-DNA ($\Delta G_C$)')
logo.style_spines(visible=False)


# sequence logo for the RNAP-DNA binding energy matrix
logo = logomaker.Logo(rnap_df, ax=axs[1], center_values=True)
axs[1].set_title('RNAP-DNA ($\Delta G_R$)')
logo.style_spines(visible=False)

plt.tight_layout()

../_images/tutorials_5_promoter_biophysical_gpmaps_18_0.png

References¶

Kinney J, Murugan A, Callan C, Cox E (2010). Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc Natl Acad Sci USA. 107(20):9158-9163.