API Reference

Tests

A suite of automated tests are provided to ensure proper software installation and execution.

mavenn.run_tests()

Run all MAVE-NN functional tests.

Examples

A variety of real-world datasets, pre-trained models, analysis demos, and tutorials can be accessed using the following functions.

mavenn.load_example_dataset(name=None)

Load example dataset provided with MAVE-NN.

Parameters
name: (str)

Name of example dataset. If None, a list of valid dataset names will be printed.

Returns
data_df: (pd.DataFrame)

Dataframe containing the example datase.

mavenn.load_example_model(name=None)

Load an example model already inferred by MAVE-NN.

Parameters
name: (str, None)

Name of model to load. If None, a list of valid model names will be printed.

Returns
model: (mavenn.Model)

A pre-trained Model object.

mavenn.run_demo(name=None, print_code=False, print_names=True)

Perform demonstration of MAVE-NN.

Parameters
name: (str, None)

Name of demo to run. If None, a list of valid demo names will be returned.

print_code: (bool)

If True, the text of the demo file will be printed along with the output from running this file. If False, only the demo output will be shown.

print_names: (bool)

If True and name=None, the names of all demos will be printed.

Returns
demo_names: (list, None)

List of demo names, returned if user passes names=None. Otherwise None.

mavenn.list_tutorials()

Reveal local directory where MAVE-NN tutorials are stored, as well as the names of available tutorial notebook files.

Load

MAVE-NN allows users to save and load trained models.

mavenn.load(filename, verbose=True)

Load a previously saved model.

Saved models are represented by two files having the same root and two different extensions, .pickle and .h5. The .pickle file contains model metadata, including all information needed to reconstruct the model’s architecture. The .h5 file contains the values of the trained neural network weights.

Parameters
filename: (str)

File directory and root. Do not include extensions.

verbose: (bool)

Whether to print feedback.

Returns
loaded_model: (mavenn.Model)

MAVE-NN model object.

Visualization

MAVE-NN provides the following two methods to facilitate the visualization of inferred genotype-phenotype maps.

mavenn.heatmap(values, alphabet, seq=None, seq_kwargs=None, ax=None, show_spines=False, cbar=True, cax=None, clim=None, clim_quantile=1, ccenter=None, cmap='coolwarm', cmap_size='5%', cmap_pad=0.1)

Draw a heatmap illustrating an L x C matrix of values, where L is sequence length and C is the alphabet size.

Parameters
values: (np.ndarray)

Array of shape (L,C) that contains values to plot.

alphabet: (str, np.ndarray)

Alphabet name 'dna', 'rna', or 'protein', or 1D array containing characters in the alphabet.

seq: (str, None)

The sequence to show, if any, using dots plotted on top of the heatmap. Must have length L and be comprised of characters in alphabet.

seq_kwargs: (dict)

Arguments to pass to Axes.scatter() when drawing dots to illustrate the characters in seq.

ax: (matplotlib.axes.Axes)

The Axes object on which the heatmap will be drawn. If None, one will be created. If specified, cbar=True, and cax=None, ax will be split in two to make room for a colorbar.

show_spines: (bool)

Whether to show spines around the edges of the heatmap.

cbar: (bool)

Whether to draw a colorbar next to the heatmap.

cax: (matplotlib.axes.Axes, None)

The Axes object on which the colorbar will be drawn, if requested. If None, one will be created by splitting ax in two according to cmap_size and cmap_pad.

clim: (list, None)

List of the form [cmin, cmax], specifying the maximum cmax and minimum cmin values spanned by the colormap. Overrides clim_quantile.

clim_quantile: (float)

Must be a float in the range [0,1]. clim will be automatically chosen to include this central quantile of values.

ccenter: (float)

Value at which to position the center of a diverging colormap. Setting ccenter=0 often makes sense.

cmap: (str, matplotlib.colors.Colormap)

Colormap to use.

cmap_size: (str)

Fraction of ax width to be used for the colorbar. For formatting requirements, see the documentation for mpl_toolkits.axes_grid1.make_axes_locatable().

cmap_pad: (float)

Space between colorbar and the shrunken heatmap Axes. For formatting requirements, see the documentation for mpl_toolkits.axes_grid1.make_axes_locatable().

Returns
ax: (matplotlib.axes.Axes)

Axes object containing the heatmap.

cb: (matplotlib.colorbar.Colorbar, None)

Colorbar object linked to ax, or None if no colorbar was drawn.

mavenn.heatmap_pairwise(values, alphabet, seq=None, seq_kwargs=None, ax=None, gpmap_type='pairwise', show_position=False, position_size=None, position_pad=1, show_alphabet=True, alphabet_size=None, alphabet_pad=1, show_seplines=True, sepline_kwargs=None, xlim_pad=0.1, ylim_pad=0.1, cbar=True, cax=None, clim=None, clim_quantile=1, ccenter=0, cmap='coolwarm', cmap_size='5%', cmap_pad=0.1)

Draw a heatmap illustrating pairwise or neighbor values, e.g. representing model parameters, mutational effects, etc.

Note: The resulting plot has aspect ratio of 1 and is scaled so that pixels have half-diagonal lengths given by half_pixel_diag = 1/(C*2), and blocks of characters have half-diagonal lengths given by half_block_diag = 1/2. This is done so that the horizontal distance between positions (as indicated by x-ticks) is 1.

Parameters
values: (np.array)

An array, shape (L,C,L,C), containing pairwise or neighbor values. Note that only values at coordinates [l1, c1, l2, c2] with l2 > l1 will be plotted. NaN values will not be plotted.

alphabet: (str, np.ndarray)

Alphabet name 'dna', 'rna', or 'protein', or 1D array containing characters in the alphabet.

seq: (str, None)

The sequence to show, if any, using dots plotted on top of the heatmap. Must have length L and be comprised of characters in alphabet.

seq_kwargs: (dict)

Arguments to pass to Axes.scatter() when drawing dots to illustrate the characters in seq.

ax: (matplotlib.axes.Axes)

The Axes object on which the heatmap will be drawn. If None, one will be created. If specified, cbar=True, and cax=None, ax will be split in two to make room for a colorbar.

gpmap_type: (str)

Determines how many pairwise parameters are plotted. Must be 'pairwise' or 'neighbor'. If 'pairwise', a triangular heatmap will be plotted. If 'neighbor', a heatmap resembling a string of diamonds will be plotted.

show_position: (bool)

Whether to annotate the heatmap with position labels.

position_size: (float)

Font size to use for position labels. Must be >= 0.

position_pad: (float)

Additional padding, in units of half_pixel_diag, used to space the position labels further from the heatmap.

show_alphabet: (bool)

Whether to annotate the heatmap with character labels.

alphabet_size: (float)

Font size to use for alphabet. Must be >= 0.

alphabet_pad: (float)

Additional padding, in units of half_pixel_diag, used to space the alphabet labels from the heatmap.

show_seplines: (bool)

Whether to draw lines separating character blocks for different position pairs.

sepline_kwargs: (dict)

Keywords to pass to Axes.plot() when drawing seplines.

xlim_pad: (float)

Additional padding to add (in absolute units) both left and right of the heatmap.

ylim_pad: (float)

Additional padding to add (in absolute units) both above and below the heatmap.

cbar: (bool)

Whether to draw a colorbar next to the heatmap.

cax: (matplotlib.axes.Axes, None)

The Axes object on which the colorbar will be drawn, if requested. If None, one will be created by splitting ax in two according to cmap_size and cmap_pad.

clim: (list, None)

List of the form [cmin, cmax], specifying the maximum cmax and minimum cmin values spanned by the colormap. Overrides clim_quantile.

clim_quantile: (float)

Must be a float in the range [0,1]. clim will be automatically chosen to include this central quantile of values.

ccenter: (float)

Value at which to position the center of a diverging colormap. Setting ccenter=0 often makes sense.

cmap: (str, matplotlib.colors.Colormap)

Colormap to use.

cmap_size: (str)

Fraction of ax width to be used for the colorbar. For formatting requirements, see the documentation for mpl_toolkits.axes_grid1.make_axes_locatable().

cmap_pad: (float)

Space between colorbar and the shrunken heatmap Axes. For formatting requirements, see the documentation for mpl_toolkits.axes_grid1.make_axes_locatable().

Returns
ax: (matplotlib.axes.Axes)

Axes object containing the heatmap.

cb: (matplotlib.colorbar.Colorbar, None)

Colorbar object linked to ax, or None if no colorbar was drawn.

Models

The mavenn.Model class represents all neural-network-based models inferred by MAVE-NN. A variety of class methods make it easy to,

  • define models,

  • fit models to data,

  • access model parameters and metadata,

  • save models,

  • evaluate models on new data.

In particular, these methods allow users to train and analyze models without prior knowledge of TensorFlow 2, the deep learning framework used by MAVE-NN as a backend.

class mavenn.Model(L, alphabet, regression_type, gpmap_type='additive', gpmap_kwargs={}, Y=2, ge_nonlinearity_type='nonlinear', ge_nonlinearity_monotonic=True, ge_nonlinearity_hidden_nodes=50, ge_noise_model_type='Gaussian', ge_heteroskedasticity_order=0, normalize_phi=True, mpa_hidden_nodes=50, theta_regularization=0.001, eta_regularization=0.1, ohe_batch_size=50000, custom_gpmap=None, initial_weights=None)

Represents a MAVE-NN model, which includes a genotype-phenotype (G-P) map as well as a measurement process. For global epistasis (GE) regression, set regression_type='GE'; for measurement process agnostic (MPA) regression, set regression_type='MPA'.

Parameters
L: (int)

Length of each training sequence. Must be >= 1.

alphabet: (str, np.ndarray)

Either the alphabet name ('dna', 'rna', or 'protein') or a 1D array of characters to be used as the alphabet.

regression_type: (str)

Type of regression implemented by the model. Choices are 'GE' (for a global epistasis model) and 'MPA' (for a measurement process agnostic model).

gpmap_type: (str)

Type of G-P map to infer. Choices are 'additive', 'neighbor', 'pairwise', and 'blackbox'.

gpmap_kwargs: (dict)

Additional keyword arguments used for specifying the G-P map.

Y: (int)

The number if discrete y bins to use when defining an MPA model. Must be >= 2. Has no effect on MPA models.

ge_nonlinearity_monotonic: (boolean)

Whether to enforce a monotonicity constraint on the GE nonlinearity. Has no effect on MPA models.

ge_nonlinearity_hidden_nodes: (int)

Number of hidden nodes (i.e. sigmoidal contributions) to use when defining the nonlinearity component of a GE model. Has no effect on MPA models.

ge_noise_model_type: (str)

Noise model to use for when defining a GE model. Choices are 'Gaussian', 'Cauchy', 'SkewedT', or 'Empirical'. Has no effect on MPA models.

ge_heteroskedasticity_order: (int)

In the GE model context, this represents the order of the polynomial(s) used to define noise model parameters as functions of yhat. The larger this is, the more heteroskedastic an inferred noise model is likely to be. Set to 0 to enforce a homoskedastic noise model. Has no effect on MPA models. Must be >= 0.

normalize_phi: (bool)

Whether to fix diffeomorphic modes after model training.

mpa_hidden_nodes:

Number of hidden nodes (i.e. sigmoidal contributions) to use when defining the MPA measurement process. Must be >= 1.

theta_regularization: (float)

L2 regularization strength for G-P map parameters theta. Must be >= 0; use 0 for no regularization.

eta_regularization: (float)

L2 regularization strength for measurement process parameters eta. Must be >= 0; use 0 for no regularization.

ohe_batch_size: (int)

DISABLED. How many sequences to one-hot encode at a time when calling Model.set_data(). Typically, the larger this number is the quicker the encoding will happen. A number too large, however, may cause the computer’s memory to run out. Must be >= 1.

custom_gpmap: (GPMapLayer sub-class)

Defines custom gpmap, provided by user. Inherited class of GP-MAP layer, which defines the functionality for x_to_phi_layer.

initial_weights: (np.array)

Numpy array of weights that gets set as initial weights of a model if not set to None.

Methods

I_predictive(x, y[, ct, knn, knn_fuzz, ...])

Estimate predictive information.

I_variational(x, y[, ct, knn_fuzz, uncertainty])

Estimate variational information.

bootstrap(data_df[, num_models, verbose, ...])

Sample plausible models using parametric bootstrapping.

fit([epochs, learning_rate, ...])

Infer values for model parameters.

get_nn()

Return the underlying TensorFlow neural network.

get_theta([gauge, p_lc, x_wt, unobserved_value])

Return parameters of the G-P map.

p_of_y_given_phi(y, phi[, paired])

Compute probabilities p( y | phi ).

p_of_y_given_x(y, x[, paired])

Compute probabilities p( y | x ).

p_of_y_given_yhat(y, yhat[, paired])

Compute probabilities p( y | yhat); GE models only.

phi_to_yhat(phi)

Compute phi given yhat; GE models only.

save(filename[, verbose])

Save model.

set_data(x, y[, dy, ct, validation_frac, ...])

Set training data.

simulate_dataset(template_df)

Generate a simulated dataset.

x_to_phi(x)

Compute phi given x.

x_to_yhat(x)

Compute yhat given x.

yhat_to_yq(yhat[, q, paired])

Compute quantiles of p( y | yhat); GE models only.

I_predictive(x, y, ct=None, knn=5, knn_fuzz=0.01, uncertainty=True, num_subsamples=25, use_LNC=False, alpha_LNC=0.5, verbose=False)

Estimate predictive information.

Predictive information, I_pred, is the mutual information I[ phi ; y] between latent phenotypes phi and measurements y. Unlike variational information, I_pred does not assume that the inferred measurement process p( y | phi ) is correct. I_pred is estimated using the k’th nearest neighbor methods from the NPEET package.

Parameters
x: (np.ndarray)

1D array of N sequences, each of length L.

y: (np.ndarray)

Array of measurements. For GE models, y must be a 1D array of N floats. For MPA models, y must be either a 1D or 2D array of nonnegative ints. If 1D, y must be of length N, and will be interpreted as listing bin numbers, i.e. 0 , 1 , …, Y-1. If 2D, y must be of shape (N,Y), and will be interpreted as listing the observed counts for each of the N sequences in each of the Y bins.

ct: (np.ndarray, None)

Only used for MPA models when y is 1D. In this case, ct must be a 1D array, length N, of nonnegative integers, and represents the number of observations of each sequence in each bin. Use y=None for GE models, as well as for MPA models when y is 2D.

knn: (int>0)

Number of nearest neighbors to use in the entropy estimators from the NPEET package.

knn_fuzz: (float>0)

Amount of noise to add to phi values before passing them to the KNN estimators. Specifically, Gaussian noise with standard deviation knn_fuzz * np.std(phi) is added to phi values. This is a hack and is not ideal, but is needed to get the KNN estimates to behave well on real MAVE data.

uncertainty: (bool)

Whether to estimate the uncertainty in I_pred. Substantially increases runtime if True.

num_subsamples: (int)

Number of subsamples to use when estimating the uncertainty in I_pred.

use_LNC: (bool)

Whether to use the Local Nonuniform Correction (LNC) of Gao et al., 2015 when computing I_pred for GE models. Substantially increases runtime set to True.

alpha_LNC: (float in (0,1))

Value of alpha to use when computing the LNC correction. See Gao et al., 2015 for details. Used only for GE models.

verbose: (bool)

Whether to print results and execution time.

Returns
I_pred: (float)

Estimated variational information, in bits.

dI_pred: (float)

Standard error for I_pred. Is 0 if uncertainty=False is used.

I_variational(x, y, ct=None, knn_fuzz=0.01, uncertainty=True)

Estimate variational information.

Likelihood information, I_var, is the mutual information I[ phi ; y] between latent phenotypes phi and measurements y under the assumption that the inferred measurement process p( y | phi ) is correct. I_var is an affine transformation of log likelihood and thus provides a useful metric during model training. When evaluated on test data, I_var also provides a lower bound to the predictive information I_pred, which does not assume that the inferred measurement process is correct. The difference I_pred - I_var thus quantifies the mismatch between the inferred measurement process and the true conditional distribution p( y | phi ).

Parameters
x: (np.ndarray)

1D array of N sequences, each of length L.

y: (np.ndarray)

Array of measurements. For GE models, y must be a 1D array of N floats. For MPA models, y must be either a 1D or 2D array of nonnegative ints. If 1D, y must be of length N, and will be interpreted as listing bin numbers, i.e. 0 , 1 , …, Y-1. If 2D, y must be of shape (N,Y), and will be interpreted as listing the observed counts for each of the N sequences in each of the Y bins.

ct: (np.ndarray, None)

Only used for MPA models when y is 1D. In this case, ct must be a 1D array, length N, of nonnegative integers, and represents the number of observations of each sequence in each bin. Use y=None for GE models, as well as for MPA models when y is 2D.

knn_fuzz: (float>0)

Amount of noise to add to y values before passing them to the KNN estimators. Specifically, Gaussian noise with standard deviation knn_fuzz * np.std(y) is added to y values. This is a hack and is not ideal, but is needed to get the KNN estimates to behave well on real MAVE data. Only used for GE regression models.

uncertainty: (bool)

Whether to estimate the uncertainty of I_var.

Returns
I_var: (float)

Estimated variational information, in bits.

dI_var: (float)

Standard error for I_var. Is 0 if uncertainty=False is used.

bootstrap(data_df, num_models=10, verbose=True, initialize_from_self=False, fit_kwargs={})

Sample plausible models using parametric bootstrapping.

Given a copy data_df of the initial dataset used to train/test the model, this function first simulates num_models datasets, each of which has the same sequences and corresponding training, validation, and test set designations as data_df, but simulated measurement values (either y column or ct_# column values) generated using self. One model having the same form as self is then fit to each dataset, and the list of resulting models in returned to the user.

Parameters
data_df: (str)

The dataset used to fit the original model (i.e., self). Must have a column 'x' listing sequences, as well as a column 'set' whose entries are 'training', 'validation', or 'test'.

num_models: (int > 0)

Number of models to return.

verbose: (bool)

Whether to print feedback.

initialize_from_self: (bool)

Whether to initiate each bootstrapped model from the inferred parameters of self. WARNING: using this option can cause systematic underestimation of parameter uncertainty.

fit_kwargs: (dict)

Dictionary of keyword arguments. Entries will override the keyword arguments that were passed to self.fit() during initial model training, and which are used by default for training the simulation-inferred model here.

Returns
models: (list)

List of mavenn.Model objects.

fit(epochs=50, learning_rate=0.005, validation_split=0.2, verbose=True, early_stopping=True, early_stopping_patience=20, batch_size=50, linear_initialization=True, freeze_theta=False, callbacks=[], try_tqdm=True, optimizer='Adam', optimizer_kwargs={}, fit_kwargs={})

Infer values for model parameters.

Uses training algorithms from TensorFlow to learn model parameters. Before this is run, the training data must be set using Model.set_data().

Parameters
epochs: (int)

Maximum number of epochs to complete during model training. Must be >= 0.

learning_rate: (float)

Learning rate. Must be > 0.

validation_split: (float in [0,1])

Fraction of training data to reserve for validation.

verbose: (boolean)

Whether to show progress during training.

early_stopping: (bool)

Whether to use early stopping.

early_stopping_patience: (int)

Number of epochs to wait, after a minimum value of validation loss is observed, before terminating the model training process.

batch_size: (None, int)

Batch size to use for stochastic gradient descent and related algorithms. If None, a full-sized batch is used. Note that the negative log likelihood loss function used by MAVE-NN is extrinsic in batch_size.

linear_initialization: (bool)

Whether to initialize the results of a linear regression computation. Has no effect when gpmap_type='blackbox'.

freeze_theta: (bool)

Whether to set the weights of the G-P map layer to be non-trainable. Note that setting linear_initialization=True and freeze_theta=True will set theta to be initialized at the linear regression solution and then become frozen during training.

callbacks: (list)

Optional list of tf.keras.callbacks.Callback objects to use during training.

try_tqdm: (bool)

If true, mavenn will attempt to load the package tqdm and append TqdmCallback(verbose=0) to the callbacks list in order to improve the visual display of training progress. If users do not have tqdm installed, this will do nothing.

optimizer: (str)

Optimizer to use for training. Valid options include: 'SGD', 'RMSprop', 'Adam', 'Adadelta', 'Adagrad', 'Adamax', 'Nadam', 'Ftrl'.

optimizer_kwargs: (dict)

Additional keyword arguments to pass to the tf.keras.optimizers.Optimizer constructor.

fit_kwargs: (dict):

Additional keyword arguments to pass to tf.keras.Model.fit()

Returns
history: (tf.keras.callbacks.History)

Standard TensorFlow record of the training session.

get_nn()

Return the underlying TensorFlow neural network.

Parameters
None
Returns
nn: (tf.keras.Model)

The backend TensorFlow model.

get_theta(gauge='empirical', p_lc=None, x_wt=None, unobserved_value=nan)

Return parameters of the G-P map.

This function returns a dict containing the parameters of the model’s G-P map. Keys are of type str, values are of type np.ndarray . Relevant (key, value) pairs are: 'theta_0' , constant term; 'theta_lc' , additive effects in the form of a 2D array with shape (L,C); 'theta_lclc' , pairwise effects in the form of a 4D array of shape (L,C,L,C); 'theta_bb' , all parameters for gpmap_type='blackbox' models.

Importantly this function gauge-fixes model parameters before returning them, i.e., it pins down non-identifiable degrees of freedom. Gauge fixing is performed using a hierarchical gauge, which maximizes the fraction of variance in phi explained by the lowest-order terms. Computing such variances requires assuming probability distribution over sequence space, however, and using different distributions will result in different ways of fixing the gauge.

This function assumes that the distribution used to define the gauge factorizes across sequence positions, and can thus be represented by an L x C probability matrix p_lc that lists the probability of each character c at each position l.

An important special case is the wild-type gauge, in which p_lc is the one-hot encoding of a “wild-type” specific sequence x_wt. In this case, the constant parameter theta_0 is the value of phi for x_wt, additive parameters theta_lc represent the effect of single-point mutations away from x_wt, and so on.

Parameters
gauge: (str)

String specification of which gauge to use. Allowed values are: 'uniform' , hierarchical gauge using a uniform sequence distribution over the characters at each position observed in the training set (unobserved characters are assigned probability 0). 'empirical' , hierarchical gauge using an empirical distribution computed from the training data; 'consensus' , wild-type gauge using the training data consensus sequence; 'user' , gauge using either p_lc or x_wt supplied by the user; 'none' , no gauge fixing.

p_lc: (None, array)

Custom probability matrix to use for hierarchical gauge fixing. Must be a np.ndarray of shape (L,C) . If using this, also set gauge='user'.

x_wt: (str, None)

Custom wild-type sequence to use for wild-type gauge fixing. Must be a str of length L. If using this, also set gauge='user'.

unobserved_value: (float, None)

Value to use for parameters when no corresponding sequences were present in the training data. If None, these parameters will be left alone. Using np.nan can help when visualizing models using mavenn.heatmap() or mavenn.heatmap_pariwise().

Returns
theta: (dict)

Model parameters provided as a dict of numpy arrays.

p_of_y_given_phi(y, phi, paired=False)

Compute probabilities p( y | phi ).

Parameters
y: (np.ndarray)

Measurement values. For GE models, must be an array of floats. For MPA models, must be an array of ints representing bin numbers.

phi: (np.ndarray)

Latent phenotype values, provided as an array of floats.

paired: (bool)

Whether values in y and phi should be treated as paired. If True, the probability of each value in y value will be computed using the single paired value in phi. If False, the probability of each value in y will be computed against all values of in phi.

Returns
p: (np.ndarray)

Probability of y given phi. If paired=True, p.shape will be equal to both y.shape and phi.shape. If paired=False, p.shape will be given by y.shape + phi.shape.

p_of_y_given_x(y, x, paired=True)

Compute probabilities p( y | x ).

Parameters
y: (np.ndarray)

Measurement values. For GE models, must be an array of floats. For MPA models, must be an array of ints representing bin numbers.

x: (np.ndarray)

Sequences, provided as an array of strings, each of length L.

paired: (bool)

Whether values in y and x should be treated as paired. If True, the probability of each value in y value will be computed using the single paired value in x. If False, the probability of each value in y will be computed against all values of in x.

Returns
p: (np.ndarray)

Probability of y given x. If paired=True, p.shape will be equal to both y.shape and x.shape. If paired=False, p.shape will be given by y.shape + x.shape.

p_of_y_given_yhat(y, yhat, paired=False)

Compute probabilities p( y | yhat); GE models only.

Parameters
y: (np.ndarray)

Measurement values, provided as an array of floats.

yhat: (np.ndarray)

Observable values, provided as an array of floats.

paired: (bool)

Whether values in y and yhat should be treated as paired. If True, the probability of each value in y value will be computed using the single paired value in yhat. If False, the probability of each value in y will be computed against all values of in yhat.

Returns
p: (np.ndarray)

Probability of y given yhat. If paired=True, p.shape will be equal to both y.shape and yhat.shape. If paired=False, p.shape will be given by y.shape + yhat.shape .

phi_to_yhat(phi)

Compute phi given yhat; GE models only.

Parameters
phi: (array-like)

Latent phenotype values, provided as an np.ndarray of floats.

Returns
y_hat: (array-like)

Observable values in an np.ndarray the same shape as phi.

save(filename, verbose=True)

Save model.

Saved models are represented by two files having the same root and two different extensions, .pickle and .h5. The .pickle file contains model metadata, including all information needed to reconstruct the model’s architecture. The .h5 file contains the values of the trained neural network weights. Note that training data is not saved.

Parameters
filename: (str)

File directory and root. Do not include extensions.

verbose: (bool)

Whether to print feedback.

Returns
None
set_data(x, y, dy=None, ct=None, validation_frac=0.2, validation_flags=None, shuffle=True, knn_fuzz=0.01, verbose=True)

Set training data.

Prepares data for use during training, e.g. by shuffling and one-hot encoding training data sequences. Must be called before Model.fit().

Parameters
x: (np.ndarray)

1D array of N sequences, each of length L.

y: (np.ndarray)

Array of measurements. For GE models, y must be a 1D array of N floats. For MPA models, y must be either a 1D or 2D array of nonnegative ints. If 1D, y must be of length N, and will be interpreted as listing bin numbers, i.e. 0 , 1 , …, Y-1. If 2D, y must be of shape (N,Y), and will be interpreted as listing the observed counts for each of the N sequences in each of the Y bins.

dy(np.ndarray)

User supplied error bars associated with continuous measurements to be used as sigma in the Gaussian noise model.

ct: (np.ndarray, None)

Only used for MPA models when y is 1D. In this case, ct must be a 1D array, length N, of nonnegative integers, and represents the number of observations of each sequence in each bin. Use y=None for GE models, as well as for MPA models when y is 2D.

validation_frac (float):

Fraction of observations to use for the validation set. Is overridden when setting validation_flags. Must be in the range [0,1].

validation_flags (np.ndarray, None):

1D array of N boolean numbers, with True indicating which observations should be reserved for the validation set. If None, the training and validation sets will be randomly assigned based on the value of validation_frac.

shuffle: (bool)

Whether to shuffle the observations, e.g., to ensure similar composition of the training and validation sets when validation_flags is not set.

knn_fuzz: (float>0)

Amount of noise to add to y values before passing them to the KNN estimator (for computing I_var during training). Specifically, Gaussian noise with standard deviation knn_fuzz * np.std(y) is added to y values. This is needed to mitigate errors caused by multiple observations of the same sequence. Only used for GE regression.

verbose: (bool)

Whether to provide printed feedback.

Returns
None
simulate_dataset(template_df)

Generate a simulated dataset.

Parameters
template_df: (pd.DataFrame)

Dataset off of which to base the simulated dataset. Specifically, the simulated dataset will have the same sequences and the same train/validation/test flags, but different values for 'y' (in the case of a GE regression model) or 'ct_#' (in the case of an MPA regression model.

Returns
simulated_df: (pd.DataFrame)

Simulated dataset in the form of a dataframe. Columns include 'set' , 'phi' , and 'x' . For GE models, additional columns 'yhat' and 'y' are added. For MPA models, multiple columns of the form 'ct_#' are added.

x_to_phi(x)

Compute phi given x.

Parameters
x: (np.ndarray)

Sequences, provided as an np.ndarray of strings, each of length L.

Returns
phi: (array-like of float)

Latent phenotype values, provided as floats within an np.ndarray the same shape as x.

x_to_yhat(x)

Compute yhat given x.

Parameters
x: (np.ndarray)

Sequences, provided as an np.ndarray of strings, each of length L.

Returns
yhat: (np.ndarray)

Observation values, provided as floats within an np.ndarray the same shape as x.

yhat_to_yq(yhat, q=[0.16, 0.84], paired=False)

Compute quantiles of p( y | yhat); GE models only.

Parameters
yhat: (np.ndarray)

Observable values, provided as an array of floats.

q: (np.ndarray)

Quantile specifications, provided as an array of floats in the range [0,1].

paired: (bool)

Whether values in yhat and q should be treated as paired. If True, quantiles will be computed using each value in yhat paired with the corresponding value in q. If False, the quantile for each value in yhat will be computed for every value in q.

Returns
yq: (array of floats)

Quantiles of p( y | yhat ). If paired=True, yq.shape will be equal to both yhat.shape and q.shape. If paired=False, yq.shape will be given by yhat.shape + q.shape.