API Reference
Tests
A suite of automated tests are provided to ensure proper software installation and execution.
- mavenn.run_tests(verbose=True)
Run the MAVE-NN test suite using pytest.
This function runs all tests in the mavenn/tests directory using pytest. It will print test results to stdout and return True if all tests pass, False otherwise.
- Parameters:
- verbosebool, optional
If True, prints detailed test output. If False, prints minimal output. Default is True.
- Returns:
- bool
True if all tests pass, False otherwise
Examples
A variety of real-world datasets, pre-trained models, analysis demos, and tutorials can be accessed using the following functions.
- mavenn.load_example_dataset(name=None)
Load example dataset provided with MAVE-NN.
- Parameters:
- name: (str)
Name of example dataset. If
None, a list of valid dataset names will be printed.
- Returns:
- data_df: (pd.DataFrame)
Dataframe containing the example datase.
- mavenn.load_example_model(name=None)
Load an example model already inferred by MAVE-NN.
- Parameters:
- name: (str, None)
Name of model to load. If
None, a list of valid model names will be printed.
- Returns:
- model: (mavenn.Model)
A pre-trained Model object.
- mavenn.run_demo(name=None, print_code=False, print_names=True)
Perform demonstration of MAVE-NN.
- Parameters:
- name: (str, None)
Name of demo to run. If
None, a list of valid demo names will be returned.- print_code: (bool)
If
True, the text of the demo file will be printed along with the output from running this file. IfFalse, only the demo output will be shown.- print_names: (bool)
If True and
name=None, the names of all demos will be printed.
- Returns:
- demo_names: (list, None)
List of demo names, returned if user passes
names=None. Otherwise None.
- mavenn.list_tutorials()
Reveal local directory where MAVE-NN tutorials are stored, as well as the names of available tutorial notebook files.
Load
MAVE-NN allows users to save and load trained models.
- mavenn.load(filename, verbose=True)
Load a previously saved model.
Saved models are represented by two files having the same root and two different extensions,
.pickleand.h5. The.picklefile contains model metadata, including all information needed to reconstruct the model’s architecture. The.h5file contains the values of the trained neural network weights.- Parameters:
- filename: (str)
File directory and root. Do not include extensions.
- verbose: (bool)
Whether to print feedback.
- Returns:
- loaded_model: (mavenn.Model)
MAVE-NN model object.
Visualization
MAVE-NN provides the following two methods to facilitate the visualization of inferred genotype-phenotype maps.
- mavenn.heatmap(values=None, alphabet=None, df=None, seq=None, seq_kwargs=None, ax=None, show_spines=False, cbar=True, cax=None, clim=None, clim_quantile=1, ccenter=None, cmap='coolwarm', cmap_size='5%', cmap_pad=0.1)
Draw a heatmap illustrating an
LxCmatrix of values, whereLis sequence length andCis the alphabet size.- Parameters:
- values: (np.ndarray, None)
Array of shape
(L,C)that contains values to plot. Cannot be provided ifdfis provided.- alphabet: (str, np.ndarray, None)
Alphabet name
'dna','rna', or'protein', or 1D array containing characters in the alphabet. Cannot be provided ifdfis provided.- df: (pd.DataFrame)
DataFrame of shape
(L,C)that contains the values (df.values) and alphabet (df.columns) to use. If specified,valuesandalphabetcannot be provided.- seq: (str, None)
The sequence to show, if any, using dots plotted on top of the heatmap. Must have length
Land be comprised of characters inalphabet.- seq_kwargs: (dict)
Arguments to pass to
Axes.scatter()when drawing dots to illustrate the characters inseq.- ax: (matplotlib.axes.Axes)
The
Axesobject on which the heatmap will be drawn. IfNone, one will be created. If specified,cbar=True, andcax=None,axwill be split in two to make room for a colorbar.- show_spines: (bool)
Whether to show spines around the edges of the heatmap.
- cbar: (bool)
Whether to draw a colorbar next to the heatmap.
- cax: (matplotlib.axes.Axes, None)
The
Axesobject on which the colorbar will be drawn, if requested. IfNone, one will be created by splittingaxin two according tocmap_sizeandcmap_pad.- clim: (list, None)
List of the form
[cmin, cmax], specifying the maximumcmaxand minimumcminvalues spanned by the colormap. Overridesclim_quantile.- clim_quantile: (float)
Must be a float in the range [0,1].
climwill be automatically chosen to include this central quantile of values.- ccenter: (float)
Value at which to position the center of a diverging colormap. Setting
ccenter=0often makes sense.- cmap: (str, matplotlib.colors.Colormap)
Colormap to use.
- cmap_size: (str)
Fraction of
axwidth to be used for the colorbar. For formatting requirements, see the documentation formpl_toolkits.axes_grid1.make_axes_locatable().- cmap_pad: (float)
Space between colorbar and the shrunken heatmap
Axes. For formatting requirements, see the documentation formpl_toolkits.axes_grid1.make_axes_locatable().
- Returns:
- ax: (matplotlib.axes.Axes)
Axesobject containing the heatmap.- cb: (matplotlib.colorbar.Colorbar, None)
Colorbar object linked to
ax, orNoneif no colorbar was drawn.
- mavenn.heatmap_pairwise(values, alphabet, seq=None, seq_kwargs=None, ax=None, gpmap_type='pairwise', show_position=False, position_size=None, position_pad=1, show_alphabet=True, alphabet_size=None, alphabet_pad=1, show_seplines=True, sepline_kwargs=None, xlim_pad=0.1, ylim_pad=0.1, cbar=True, cax=None, clim=None, clim_quantile=1, ccenter=0, cmap='coolwarm', cmap_size='5%', cmap_pad=0.1)
Draw a heatmap illustrating pairwise or neighbor values, e.g. representing model parameters, mutational effects, etc.
Note: The resulting plot has aspect ratio of 1 and is scaled so that pixels have half-diagonal lengths given by
half_pixel_diag = 1/(C*2), and blocks of characters have half-diagonal lengths given byhalf_block_diag = 1/2. This is done so that the horizontal distance between positions (as indicated by x-ticks) is 1.- Parameters:
- values: (np.array)
An array, shape
(L,C,L,C), containing pairwise or neighbor values. Note that only values at coordinates[l1, c1, l2, c2]withl2>l1will be plotted. NaN values will not be plotted.- alphabet: (str, np.ndarray)
Alphabet name
'dna','rna', or'protein', or 1D array containing characters in the alphabet.- seq: (str, None)
The sequence to show, if any, using dots plotted on top of the heatmap. Must have length
Land be comprised of characters inalphabet.- seq_kwargs: (dict)
Arguments to pass to
Axes.scatter()when drawing dots to illustrate the characters inseq.- ax: (matplotlib.axes.Axes)
The
Axesobject on which the heatmap will be drawn. IfNone, one will be created. If specified,cbar=True, andcax=None,axwill be split in two to make room for a colorbar.- gpmap_type: (str)
Determines how many pairwise parameters are plotted. Must be
'pairwise'or'neighbor'. If'pairwise', a triangular heatmap will be plotted. If'neighbor', a heatmap resembling a string of diamonds will be plotted.- show_position: (bool)
Whether to annotate the heatmap with position labels.
- position_size: (float)
Font size to use for position labels. Must be >= 0.
- position_pad: (float)
Additional padding, in units of
half_pixel_diag, used to space the position labels further from the heatmap.- show_alphabet: (bool)
Whether to annotate the heatmap with character labels.
- alphabet_size: (float)
Font size to use for alphabet. Must be >= 0.
- alphabet_pad: (float)
Additional padding, in units of
half_pixel_diag, used to space the alphabet labels from the heatmap.- show_seplines: (bool)
Whether to draw lines separating character blocks for different position pairs.
- sepline_kwargs: (dict)
Keywords to pass to
Axes.plot()when drawing seplines.- xlim_pad: (float)
Additional padding to add (in absolute units) both left and right of the heatmap.
- ylim_pad: (float)
Additional padding to add (in absolute units) both above and below the heatmap.
- cbar: (bool)
Whether to draw a colorbar next to the heatmap.
- cax: (matplotlib.axes.Axes, None)
The
Axesobject on which the colorbar will be drawn, if requested. IfNone, one will be created by splittingaxin two according tocmap_sizeandcmap_pad.- clim: (list, None)
List of the form
[cmin, cmax], specifying the maximumcmaxand minimumcminvalues spanned by the colormap. Overridesclim_quantile.- clim_quantile: (float)
Must be a float in the range [0,1].
climwill be automatically chosen to include this central quantile of values.- ccenter: (float)
Value at which to position the center of a diverging colormap. Setting
ccenter=0often makes sense.- cmap: (str, matplotlib.colors.Colormap)
Colormap to use.
- cmap_size: (str)
Fraction of
axwidth to be used for the colorbar. For formatting requirements, see the documentation formpl_toolkits.axes_grid1.make_axes_locatable().- cmap_pad: (float)
Space between colorbar and the shrunken heatmap
Axes. For formatting requirements, see the documentation formpl_toolkits.axes_grid1.make_axes_locatable().
- Returns:
- ax: (matplotlib.axes.Axes)
Axesobject containing the heatmap.- cb: (matplotlib.colorbar.Colorbar, None)
Colorbar object linked to
ax, orNoneif no colorbar was drawn.
Models
The mavenn.Model class represents all neural-network-based models inferred
by MAVE-NN. A variety of class methods make it easy to,
define models,
fit models to data,
access model parameters and metadata,
save models,
evaluate models on new data.
In particular, these methods allow users to train and analyze models without prior knowledge of TensorFlow 2, the deep learning framework used by MAVE-NN as a backend.
- class mavenn.Model(L, alphabet, regression_type, gpmap_type='additive', gpmap_kwargs={}, Y=2, ge_nonlinearity_type='nonlinear', ge_nonlinearity_monotonic=True, ge_nonlinearity_hidden_nodes=50, ge_noise_model_type='Gaussian', ge_heteroskedasticity_order=0, normalize_phi=True, mpa_hidden_nodes=50, theta_regularization=0.01, eta_regularization=0.001, ohe_batch_size=50000, custom_gpmap=None, initial_weights=None)
Represents a MAVE-NN model, which includes a genotype-phenotype (G-P) map as well as a measurement process. For global epistasis (GE) regression, set
regression_type='GE'; for measurement process agnostic (MPA) regression, setregression_type='MPA'.- Parameters:
- L: (int)
Length of each training sequence. Must be
>= 1.- alphabet: (str, np.ndarray)
Either the alphabet name (
'dna','rna', or'protein') or a 1D array of characters to be used as the alphabet.- regression_type: (str)
Type of regression implemented by the model. Choices are
'GE'(for a global epistasis model) and'MPA'(for a measurement process agnostic model).- gpmap_type: (str)
Type of G-P map to infer. Choices are
'additive','neighbor','pairwise', and'blackbox'.- gpmap_kwargs: (dict)
Additional keyword arguments used for specifying the G-P map.
- Y: (int)
The number if discrete
ybins to use when defining an MPA model. Must be>= 2. Has no effect on MPA models.- ge_nonlinearity_type: (str)
Specifies the form of the GE nonlinearity. Options: “linear”: An affine transformation from phi to yhat. “nonlinear”: Allow and arbitrary nonlinear map from phi to yhat.
- ge_nonlinearity_monotonic: (boolean)
Whether to enforce a monotonicity constraint on the GE nonlinearity. Has no effect on MPA models.
- ge_nonlinearity_hidden_nodes: (int)
Number of hidden nodes (i.e. sigmoidal contributions) to use when defining the nonlinearity component of a GE model. Has no effect on MPA models.
- ge_noise_model_type: (str)
Noise model to use for when defining a GE model. Choices are
'Gaussian','Cauchy','SkewedT', or'Empirical'. Has no effect on MPA models.- ge_heteroskedasticity_order: (int)
In the GE model context, this represents the order of the polynomial(s) used to define noise model parameters as functions of
yhat. The larger this is, the more heteroskedastic an inferred noise model is likely to be. Set to0to enforce a homoskedastic noise model. Has no effect on MPA models. Must be>= 0.- normalize_phi: (bool)
Whether to fix diffeomorphic modes after model training.
- mpa_hidden_nodes:
Number of hidden nodes (i.e. sigmoidal contributions) to use when defining the MPA measurement process. Must be
>= 1.- theta_regularization: (float)
L2 regularization strength for G-P map parameters
theta. Must be>= 0; use0for no regularization.- eta_regularization: (float)
L2 regularization strength for measurement process parameters
eta. Must be>= 0; use0for no regularization.- ohe_batch_size: (int)
DISABLED. How many sequences to one-hot encode at a time when calling
Model.set_data(). Typically, the larger this number is the quicker the encoding will happen. A number too large, however, may cause the computer’s memory to run out. Must be>= 1.- custom_gpmap: (GPMapLayer sub-class)
Defines custom gpmap, provided by user. Inherited class of GP-MAP layer, which defines the functionality for x_to_phi_layer.
- initial_weights: (np.array)
Numpy array of weights that gets set as initial weights of a model if not set to None.
- __init__(L, alphabet, regression_type, gpmap_type='additive', gpmap_kwargs={}, Y=2, ge_nonlinearity_type='nonlinear', ge_nonlinearity_monotonic=True, ge_nonlinearity_hidden_nodes=50, ge_noise_model_type='Gaussian', ge_heteroskedasticity_order=0, normalize_phi=True, mpa_hidden_nodes=50, theta_regularization=0.01, eta_regularization=0.001, ohe_batch_size=50000, custom_gpmap=None, initial_weights=None)
Model() class constructor.
- set_data(x, y, dy=None, ct=None, validation_frac=0.2, validation_flags=None, shuffle=True, knn_fuzz=0.01, verbose=True)
Set training data.
Prepares data for use during training, e.g. by shuffling and one-hot encoding training data sequences. Must be called before
Model.fit().- Parameters:
- x: (np.ndarray)
1D array of
Nsequences, each of lengthL.- y: (np.ndarray)
Array of measurements. For GE models,
ymust be a 1D array ofNfloats. For MPA models,ymust be either a 1D or 2D array of nonnegative ints. If 1D,ymust be of lengthN, and will be interpreted as listing bin numbers, i.e.0,1, …,Y-1. If 2D,ymust be of shape(N,Y), and will be interpreted as listing the observed counts for each of theNsequences in each of theYbins.- dy(np.ndarray)
User supplied error bars associated with continuous measurements to be used as sigma in the Gaussian noise model.
- ct: (np.ndarray, None)
Only used for MPA models when
yis 1D. In this case,ctmust be a 1D array, lengthN, of nonnegative integers, and represents the number of observations of each sequence in each bin. Usey=Nonefor GE models, as well as for MPA models whenyis 2D.- validation_frac (float):
Fraction of observations to use for the validation set. Is overridden when setting
validation_flags. Must be in the range [0,1].- validation_flags (np.ndarray, None):
1D array of
Nboolean numbers, withTrueindicating which observations should be reserved for the validation set. IfNone, the training and validation sets will be randomly assigned based on the value ofvalidation_frac.- shuffle: (bool)
Whether to shuffle the observations, e.g., to ensure similar composition of the training and validation sets when
validation_flagsis not set.- knn_fuzz: (float>0)
Amount of noise to add to
yvalues before passing them to the KNN estimator (for computing I_var during training). Specifically, Gaussian noise with standard deviationknn_fuzz * np.std(y)is added toyvalues. This is needed to mitigate errors caused by multiple observations of the same sequence. Only used for GE regression.- verbose: (bool)
Whether to provide printed feedback.
- Returns:
- None
- fit(epochs=50, learning_rate=0.005, validation_split=0.2, verbose=True, early_stopping=True, early_stopping_patience=20, restore_best_weights=True, batch_size=50, linear_initialization=True, freeze_theta=False, callbacks=None, try_tqdm=True, optimizer='Adam', optimizer_kwargs={}, fit_kwargs={})
Infer values for model parameters.
Uses training algorithms from TensorFlow to learn model parameters. Before this is run, the training data must be set using
Model.set_data().- Parameters:
- epochs: (int)
Maximum number of epochs to complete during model training. Must be
>= 0.- learning_rate: (float)
Learning rate. Must be
> 0.- validation_split: (float in [0,1])
Fraction of training data to reserve for validation.
- verbose: (boolean)
Whether to show progress during training.
- early_stopping: (bool)
Whether to use early stopping.
- early_stopping_patience: (int)
Number of epochs to wait, after a minimum value of validation loss is observed, before terminating the model training process.
- restore_best_weights: (bool)
Whether to restore model weights from the epoch with the best value of the monitored quantity. If False, the model weights obtained at the last step of training are used. An epoch will be restored regardless of the performance relative to the baseline. If no epoch improves on baseline, training will run for patience epochs and restore weights from the best epoch in that set.
- batch_size: (None, int)
Batch size to use for stochastic gradient descent and related algorithms. If None, a full-sized batch is used. Note that the negative log likelihood loss function used by MAVE-NN is extrinsic in batch_size.
- linear_initialization: (bool)
Whether to initialize the results of a linear regression computation. Has no effect when
gpmap_type='blackbox'.- freeze_theta: (bool)
Whether to set the weights of the G-P map layer to be non-trainable. Note that setting
linear_initialization=Trueandfreeze_theta=Truewill set theta to be initialized at the linear regression solution and then become frozen during training.- callbacks: (list, None)
Optional list of
tf.keras.callbacks.Callbackobjects to use during training.- try_tqdm: (bool)
If true, mavenn will attempt to load the package tqdm and append TqdmCallback(verbose=0) to the callbacks list in order to improve the visual display of training progress. If users do not have tqdm installed, this will do nothing.
- optimizer: (str)
Optimizer to use for training. Valid options include:
'SGD','RMSprop','Adam','Adadelta','Adagrad','Adamax','Nadam','Ftrl'.- optimizer_kwargs: (dict)
Additional keyword arguments to pass to the
tf.keras.optimizers.Optimizerconstructor.- fit_kwargs: (dict):
Additional keyword arguments to pass to
tf.keras.Model.fit()
- Returns:
- history: (tf.keras.callbacks.History)
Standard TensorFlow record of the training session.
- phi_to_yhat(phi)
Compute
phigivenyhat; GE models only.- Parameters:
- phi: (array-like)
Latent phenotype values, provided as an
np.ndarrayof floats.
- Returns:
- y_hat: (array-like)
Observable values in an
np.ndarraythe same shape asphi.
- handle_errors()
Handle anticipated errors in a function.
This decorator allows the user to pass the keyword argument ‘should_fail’ to any wrapped function.
If should_fail is None (or is not set by user), the function executes normally, and can be called as:
result = func(*args, **kwargs)
In particular, Python execution will halt if any errors are raised.
However, if the user specifies should_fail=True or should_fail=False, then Python will not halt even in the presence of an error. Moreover, the function will return a tuple, e.g.:
result, mistake = func(*args, should_fail=True, **kwargs)
with mistake flagging whether or not the function failed or succeeded as expected.
- check(message)
Check whether a condition is satisfied; if not, throw MavennError.
- Parameters:
- condition: (bool)
A condition that, if false, halts mavenn execution and raises a clean error to user
- message: (str)
The string to show user if condition is False.
- Returns:
- None
- get_theta(gauge='empirical', p_lc=None, x_wt=None, unobserved_value=nan)
Return parameters of the G-P map.
This function returns a
dictcontaining the parameters of the model’s G-P map. Keys are of typestr, values are of typenp.ndarray. Relevant (key, value) pairs are:'theta_0', constant term;'theta_lc', additive effects in the form of a 2D array with shape(L,C);'theta_lclc', pairwise effects in the form of a 4D array of shape(L,C,L,C);'theta_bb', all parameters forgpmap_type='blackbox'models.Importantly this function gauge-fixes model parameters before returning them, i.e., it pins down non-identifiable degrees of freedom. Gauge fixing is performed using a hierarchical gauge, which maximizes the fraction of variance in
phiexplained by the lowest-order terms. Computing such variances requires assuming probability distribution over sequence space, however, and using different distributions will result in different ways of fixing the gauge.This function assumes that the distribution used to define the gauge factorizes across sequence positions, and can thus be represented by an
LxCprobability matrixp_lcthat lists the probability of each charactercat each positionl.An important special case is the wild-type gauge, in which
p_lcis the one-hot encoding of a “wild-type” specific sequencex_wt. In this case, the constant parametertheta_0is the value ofphiforx_wt, additive parameterstheta_lcrepresent the effect of single-point mutations away fromx_wt, and so on.- Parameters:
- gauge: (str)
String specification of which gauge to use. Allowed values are:
'uniform', hierarchical gauge using a uniform sequence distribution over the characters at each position observed in the training set (unobserved characters are assigned probability 0).'empirical', hierarchical gauge using an empirical distribution computed from the training data;'consensus', wild-type gauge using the training data consensus sequence;'user', gauge using eitherp_lcorx_wtsupplied by the user;'none', no gauge fixing.- p_lc: (None, array)
Custom probability matrix to use for hierarchical gauge fixing. Must be a
np.ndarrayof shape(L,C). If using this, also setgauge='user'.- x_wt: (str, None)
Custom wild-type sequence to use for wild-type gauge fixing. Must be a
strof lengthL. If using this, also setgauge='user'.- unobserved_value: (float, None)
Value to use for parameters when no corresponding sequences were present in the training data. If
None, these parameters will be left alone. Usingnp.nancan help when visualizing models usingmavenn.heatmap()ormavenn.heatmap_pariwise().
- Returns:
- theta: (dict)
Model parameters provided as a
dictof numpy arrays.
- get_nn()
Return the underlying TensorFlow neural network.
- Parameters:
- None
- Returns:
- nn: (tf.keras.Model)
The backend TensorFlow model.
- x_to_phi(x)
Compute
phigivenx.- Parameters:
- x: (np.ndarray)
Sequences, provided as an
np.ndarrayof strings, each of lengthL.
- Returns:
- phi: (array-like of float)
Latent phenotype values, provided as floats within an
np.ndarraythe same shape asx.
- x_to_yhat(x)
Compute
yhatgivenx.- Parameters:
- x: (np.ndarray)
Sequences, provided as an
np.ndarrayof strings, each of lengthL.
- Returns:
- yhat: (np.ndarray)
Observation values, provided as floats within an
np.ndarraythe same shape asx.
- simulate_dataset(template_df)
Generate a simulated dataset.
- Parameters:
- template_df: (pd.DataFrame)
Dataset off of which to base the simulated dataset. Specifically, the simulated dataset will have the same sequences and the same train/validation/test flags, but different values for
'y'(in the case of a GE regression model) or'ct_#'(in the case of an MPA regression model.
- Returns:
- simulated_df: (pd.DataFrame)
Simulated dataset in the form of a dataframe. Columns include
'set','phi', and'x'. For GE models, additional columns'yhat'and'y'are added. For MPA models, multiple columns of the form'ct_#'are added.
- I_variational(x, y, ct=None, knn_fuzz=0.01, uncertainty=True)
Estimate variational information.
Likelihood information,
I_var, is the mutual information I[phi;y] between latent phenotypesphiand measurementsyunder the assumption that the inferred measurement process p(y|phi) is correct.I_varis an affine transformation of log likelihood and thus provides a useful metric during model training. When evaluated on test data,I_varalso provides a lower bound to the predictive informationI_pred, which does not assume that the inferred measurement process is correct. The differenceI_pred - I_varthus quantifies the mismatch between the inferred measurement process and the true conditional distribution p(y|phi).- Parameters:
- x: (np.ndarray)
1D array of
Nsequences, each of lengthL.- y: (np.ndarray)
Array of measurements. For GE models,
ymust be a 1D array ofNfloats. For MPA models,ymust be either a 1D or 2D array of nonnegative ints. If 1D,ymust be of lengthN, and will be interpreted as listing bin numbers, i.e.0,1, …,Y-1. If 2D,ymust be of shape(N,Y), and will be interpreted as listing the observed counts for each of theNsequences in each of theYbins.- ct: (np.ndarray, None)
Only used for MPA models when
yis 1D. In this case,ctmust be a 1D array, lengthN, of nonnegative integers, and represents the number of observations of each sequence in each bin. Usey=Nonefor GE models, as well as for MPA models whenyis 2D.- knn_fuzz: (float>0)
Amount of noise to add to
yvalues before passing them to the KNN estimators. Specifically, Gaussian noise with standard deviationknn_fuzz * np.std(y)is added toyvalues. This is a hack and is not ideal, but is needed to get the KNN estimates to behave well on real MAVE data. Only used for GE regression models.- uncertainty: (bool)
Whether to estimate the uncertainty of
I_var.
- Returns:
- I_var: (float)
Estimated variational information, in bits.
- dI_var: (float)
Standard error for
I_var. Is0ifuncertainty=Falseis used.
- I_predictive(x, y, ct=None, knn=5, knn_fuzz=0.01, uncertainty=True, num_subsamples=25, use_LNC=False, alpha_LNC=0.5, verbose=False)
Estimate predictive information.
Predictive information,
I_pred, is the mutual information I[phi;y] between latent phenotypesphiand measurementsy. Unlike variational information,I_preddoes not assume that the inferred measurement process p(y|phi) is correct.I_predis estimated using the k’th nearest neighbor methods from the NPEET package.- Parameters:
- x: (np.ndarray)
1D array of
Nsequences, each of lengthL.- y: (np.ndarray)
Array of measurements. For GE models,
ymust be a 1D array ofNfloats. For MPA models,ymust be either a 1D or 2D array of nonnegative ints. If 1D,ymust be of lengthN, and will be interpreted as listing bin numbers, i.e.0,1, …,Y-1. If 2D,ymust be of shape(N,Y), and will be interpreted as listing the observed counts for each of theNsequences in each of theYbins.- ct: (np.ndarray, None)
Only used for MPA models when
yis 1D. In this case,ctmust be a 1D array, lengthN, of nonnegative integers, and represents the number of observations of each sequence in each bin. Usey=Nonefor GE models, as well as for MPA models whenyis 2D.- knn: (int>0)
Number of nearest neighbors to use in the entropy estimators from the NPEET package.
- knn_fuzz: (float>0)
Amount of noise to add to
phivalues before passing them to the KNN estimators. Specifically, Gaussian noise with standard deviationknn_fuzz * np.std(phi)is added tophivalues. This is a hack and is not ideal, but is needed to get the KNN estimates to behave well on real MAVE data.- uncertainty: (bool)
Whether to estimate the uncertainty in
I_pred. Substantially increases runtime ifTrue.- num_subsamples: (int)
Number of subsamples to use when estimating the uncertainty in
I_pred.- use_LNC: (bool)
Whether to use the Local Nonuniform Correction (LNC) of Gao et al., 2015 when computing
I_predfor GE models. Substantially increases runtime set toTrue.- alpha_LNC: (float in (0,1))
Value of
alphato use when computing the LNC correction. See Gao et al., 2015 for details. Used only for GE models.- verbose: (bool)
Whether to print results and execution time.
- Returns:
- I_pred: (float)
Estimated variational information, in bits.
- dI_pred: (float)
Standard error for
I_pred. Is0ifuncertainty=Falseis used.
- yhat_to_yq(yhat, q=[0.16, 0.84], paired=False)
Compute quantiles of p(
y|yhat); GE models only.- Parameters:
- yhat: (np.ndarray)
Observable values, provided as an array of floats.
- q: (np.ndarray)
Quantile specifications, provided as an array of floats in the range [0,1].
- paired: (bool)
Whether values in
yhatandqshould be treated as paired. IfTrue, quantiles will be computed using each value inyhatpaired with the corresponding value inq. IfFalse, the quantile for each value inyhatwill be computed for every value inq.
- Returns:
- yq: (array of floats)
Quantiles of p(
y|yhat). Ifpaired=True,yq.shapewill be equal to bothyhat.shapeandq.shape. Ifpaired=False,yq.shapewill be given byyhat.shape + q.shape.
- p_of_y_given_phi(y, phi, paired=False)
Compute probabilities p(
y|phi).- Parameters:
- y: (np.ndarray)
Measurement values. For GE models, must be an array of floats. For MPA models, must be an array of ints representing bin numbers.
- phi: (np.ndarray)
Latent phenotype values, provided as an array of floats.
- paired: (bool)
Whether values in
yandphishould be treated as paired. IfTrue, the probability of each value inyvalue will be computed using the single paired value inphi. IfFalse, the probability of each value inywill be computed against all values of inphi.
- Returns:
- p: (np.ndarray)
Probability of
ygivenphi. Ifpaired=True,p.shapewill be equal to bothy.shapeandphi.shape. Ifpaired=False,p.shapewill be given byy.shape + phi.shape.
- p_of_y_given_yhat(y, yhat, paired=False)
Compute probabilities p(
y|yhat); GE models only.- Parameters:
- y: (np.ndarray)
Measurement values, provided as an array of floats.
- yhat: (np.ndarray)
Observable values, provided as an array of floats.
- paired: (bool)
Whether values in
yandyhatshould be treated as paired. IfTrue, the probability of each value inyvalue will be computed using the single paired value inyhat. IfFalse, the probability of each value inywill be computed against all values of inyhat.
- Returns:
- p: (np.ndarray)
Probability of
ygivenyhat. Ifpaired=True,p.shapewill be equal to bothy.shapeandyhat.shape. Ifpaired=False,p.shapewill be given byy.shape + yhat.shape.
- p_of_y_given_x(y, x, paired=True)
Compute probabilities p(
y|x).- Parameters:
- y: (np.ndarray)
Measurement values. For GE models, must be an array of floats. For MPA models, must be an array of ints representing bin numbers.
- x: (np.ndarray)
Sequences, provided as an array of strings, each of length
L.- paired: (bool)
Whether values in
yandxshould be treated as paired. IfTrue, the probability of each value inyvalue will be computed using the single paired value inx. IfFalse, the probability of each value inywill be computed against all values of inx.
- Returns:
- p: (np.ndarray)
Probability of
ygivenx. Ifpaired=True,p.shapewill be equal to bothy.shapeandx.shape. Ifpaired=False,p.shapewill be given byy.shape + x.shape.
- save(filename, verbose=True)
Save model.
Saved models are represented by two files having the same root and two different extensions,
.pickleand.h5. The.picklefile contains model metadata, including all information needed to reconstruct the model’s architecture. The.h5file contains the values of the trained neural network weights. Note that training data is not saved.- Parameters:
- filename: (str)
File directory and root. Do not include extensions.
- verbose: (bool)
Whether to print feedback.
- Returns:
- None
- bootstrap(data_df, num_models=10, verbose=True, initialize_from_self=False, fit_kwargs={})
Sample plausible models using parametric bootstrapping.
Given a copy
data_dfof the initial dataset used to train/test the model, this function first simulatesnum_modelsdatasets, each of which has the same sequences and corresponding training, validation, and test set designations asdata_df, but simulated measurement values (eitherycolumn orct_#column values) generated usingself. One model having the same form asselfis then fit to each dataset, and the list of resulting models in returned to the user.- Parameters:
- data_df: (str)
The dataset used to fit the original model (i.e.,
self). Must have a column'x'listing sequences, as well as a column'set'whose entries are'training','validation', or'test'.- num_models: (int > 0)
Number of models to return.
- verbose: (bool)
Whether to print feedback.
- initialize_from_self: (bool)
Whether to initiate each bootstrapped model from the inferred parameters of
self. WARNING: using this option can cause systematic underestimation of parameter uncertainty.- fit_kwargs: (dict)
Dictionary of keyword arguments. Entries will override the keyword arguments that were passed to
self.fit()during initial model training, and which are used by default for training the simulation-inferred model here.
- Returns:
- models: (list)
List of
mavenn.Modelobjects.