API Reference¶
Tests¶
A suite of automated tests are provided to ensure proper software installation and execution.
- mavenn.run_tests()¶
Run all MAVE-NN functional tests.
Examples¶
A variety of real-world datasets, pre-trained models, analysis demos, and tutorials can be accessed using the following functions.
- mavenn.load_example_dataset(name=None)¶
Load example dataset provided with MAVE-NN.
- Parameters
- name: (str)
Name of example dataset. If
None
, a list of valid dataset names will be printed.
- Returns
- data_df: (pd.DataFrame)
Dataframe containing the example datase.
- mavenn.load_example_model(name=None)¶
Load an example model already inferred by MAVE-NN.
- Parameters
- name: (str, None)
Name of model to load. If
None
, a list of valid model names will be printed.
- Returns
- model: (mavenn.Model)
A pre-trained Model object.
- mavenn.run_demo(name=None, print_code=False, print_names=True)¶
Perform demonstration of MAVE-NN.
- Parameters
- name: (str, None)
Name of demo to run. If
None
, a list of valid demo names will be returned.- print_code: (bool)
If
True
, the text of the demo file will be printed along with the output from running this file. IfFalse
, only the demo output will be shown.- print_names: (bool)
If True and
name=None
, the names of all demos will be printed.
- Returns
- demo_names: (list, None)
List of demo names, returned if user passes
names=None
. Otherwise None.
- mavenn.list_tutorials()¶
Reveal local directory where MAVE-NN tutorials are stored, as well as the names of available tutorial notebook files.
Load¶
MAVE-NN allows users to save and load trained models.
- mavenn.load(filename, verbose=True)¶
Load a previously saved model.
Saved models are represented by two files having the same root and two different extensions,
.pickle
and.h5
. The.pickle
file contains model metadata, including all information needed to reconstruct the model’s architecture. The.h5
file contains the values of the trained neural network weights.- Parameters
- filename: (str)
File directory and root. Do not include extensions.
- verbose: (bool)
Whether to print feedback.
- Returns
- loaded_model: (mavenn.Model)
MAVE-NN model object.
Visualization¶
MAVE-NN provides the following two methods to facilitate the visualization of inferred genotype-phenotype maps.
- mavenn.heatmap(values, alphabet, seq=None, seq_kwargs=None, ax=None, show_spines=False, cbar=True, cax=None, clim=None, clim_quantile=1, ccenter=None, cmap='coolwarm', cmap_size='5%', cmap_pad=0.1)¶
Draw a heatmap illustrating an
L
xC
matrix of values, whereL
is sequence length andC
is the alphabet size.- Parameters
- values: (np.ndarray)
Array of shape
(L,C)
that contains values to plot.- alphabet: (str, np.ndarray)
Alphabet name
'dna'
,'rna'
, or'protein'
, or 1D array containing characters in the alphabet.- seq: (str, None)
The sequence to show, if any, using dots plotted on top of the heatmap. Must have length
L
and be comprised of characters inalphabet
.- seq_kwargs: (dict)
Arguments to pass to
Axes.scatter()
when drawing dots to illustrate the characters inseq
.- ax: (matplotlib.axes.Axes)
The
Axes
object on which the heatmap will be drawn. IfNone
, one will be created. If specified,cbar=True
, andcax=None
,ax
will be split in two to make room for a colorbar.- show_spines: (bool)
Whether to show spines around the edges of the heatmap.
- cbar: (bool)
Whether to draw a colorbar next to the heatmap.
- cax: (matplotlib.axes.Axes, None)
The
Axes
object on which the colorbar will be drawn, if requested. IfNone
, one will be created by splittingax
in two according tocmap_size
andcmap_pad
.- clim: (list, None)
List of the form
[cmin, cmax]
, specifying the maximumcmax
and minimumcmin
values spanned by the colormap. Overridesclim_quantile
.- clim_quantile: (float)
Must be a float in the range [0,1].
clim
will be automatically chosen to include this central quantile of values.- ccenter: (float)
Value at which to position the center of a diverging colormap. Setting
ccenter=0
often makes sense.- cmap: (str, matplotlib.colors.Colormap)
Colormap to use.
- cmap_size: (str)
Fraction of
ax
width to be used for the colorbar. For formatting requirements, see the documentation formpl_toolkits.axes_grid1.make_axes_locatable()
.- cmap_pad: (float)
Space between colorbar and the shrunken heatmap
Axes
. For formatting requirements, see the documentation formpl_toolkits.axes_grid1.make_axes_locatable()
.
- Returns
- ax: (matplotlib.axes.Axes)
Axes
object containing the heatmap.- cb: (matplotlib.colorbar.Colorbar, None)
Colorbar object linked to
ax
, orNone
if no colorbar was drawn.
- mavenn.heatmap_pairwise(values, alphabet, seq=None, seq_kwargs=None, ax=None, gpmap_type='pairwise', show_position=False, position_size=None, position_pad=1, show_alphabet=True, alphabet_size=None, alphabet_pad=1, show_seplines=True, sepline_kwargs=None, xlim_pad=0.1, ylim_pad=0.1, cbar=True, cax=None, clim=None, clim_quantile=1, ccenter=0, cmap='coolwarm', cmap_size='5%', cmap_pad=0.1)¶
Draw a heatmap illustrating pairwise or neighbor values, e.g. representing model parameters, mutational effects, etc.
Note: The resulting plot has aspect ratio of 1 and is scaled so that pixels have half-diagonal lengths given by
half_pixel_diag = 1/(C*2)
, and blocks of characters have half-diagonal lengths given byhalf_block_diag = 1/2
. This is done so that the horizontal distance between positions (as indicated by x-ticks) is 1.- Parameters
- values: (np.array)
An array, shape
(L,C,L,C)
, containing pairwise or neighbor values. Note that only values at coordinates[l1, c1, l2, c2]
withl2
>l1
will be plotted. NaN values will not be plotted.- alphabet: (str, np.ndarray)
Alphabet name
'dna'
,'rna'
, or'protein'
, or 1D array containing characters in the alphabet.- seq: (str, None)
The sequence to show, if any, using dots plotted on top of the heatmap. Must have length
L
and be comprised of characters inalphabet
.- seq_kwargs: (dict)
Arguments to pass to
Axes.scatter()
when drawing dots to illustrate the characters inseq
.- ax: (matplotlib.axes.Axes)
The
Axes
object on which the heatmap will be drawn. IfNone
, one will be created. If specified,cbar=True
, andcax=None
,ax
will be split in two to make room for a colorbar.- gpmap_type: (str)
Determines how many pairwise parameters are plotted. Must be
'pairwise'
or'neighbor'
. If'pairwise'
, a triangular heatmap will be plotted. If'neighbor'
, a heatmap resembling a string of diamonds will be plotted.- show_position: (bool)
Whether to annotate the heatmap with position labels.
- position_size: (float)
Font size to use for position labels. Must be >= 0.
- position_pad: (float)
Additional padding, in units of
half_pixel_diag
, used to space the position labels further from the heatmap.- show_alphabet: (bool)
Whether to annotate the heatmap with character labels.
- alphabet_size: (float)
Font size to use for alphabet. Must be >= 0.
- alphabet_pad: (float)
Additional padding, in units of
half_pixel_diag
, used to space the alphabet labels from the heatmap.- show_seplines: (bool)
Whether to draw lines separating character blocks for different position pairs.
- sepline_kwargs: (dict)
Keywords to pass to
Axes.plot()
when drawing seplines.- xlim_pad: (float)
Additional padding to add (in absolute units) both left and right of the heatmap.
- ylim_pad: (float)
Additional padding to add (in absolute units) both above and below the heatmap.
- cbar: (bool)
Whether to draw a colorbar next to the heatmap.
- cax: (matplotlib.axes.Axes, None)
The
Axes
object on which the colorbar will be drawn, if requested. IfNone
, one will be created by splittingax
in two according tocmap_size
andcmap_pad
.- clim: (list, None)
List of the form
[cmin, cmax]
, specifying the maximumcmax
and minimumcmin
values spanned by the colormap. Overridesclim_quantile
.- clim_quantile: (float)
Must be a float in the range [0,1].
clim
will be automatically chosen to include this central quantile of values.- ccenter: (float)
Value at which to position the center of a diverging colormap. Setting
ccenter=0
often makes sense.- cmap: (str, matplotlib.colors.Colormap)
Colormap to use.
- cmap_size: (str)
Fraction of
ax
width to be used for the colorbar. For formatting requirements, see the documentation formpl_toolkits.axes_grid1.make_axes_locatable()
.- cmap_pad: (float)
Space between colorbar and the shrunken heatmap
Axes
. For formatting requirements, see the documentation formpl_toolkits.axes_grid1.make_axes_locatable()
.
- Returns
- ax: (matplotlib.axes.Axes)
Axes
object containing the heatmap.- cb: (matplotlib.colorbar.Colorbar, None)
Colorbar object linked to
ax
, orNone
if no colorbar was drawn.
Models¶
The mavenn.Model
class represents all neural-network-based models inferred
by MAVE-NN. A variety of class methods make it easy to,
define models,
fit models to data,
access model parameters and metadata,
save models,
evaluate models on new data.
In particular, these methods allow users to train and analyze models without prior knowledge of TensorFlow 2, the deep learning framework used by MAVE-NN as a backend.
- class mavenn.Model(L, alphabet, regression_type, gpmap_type='additive', gpmap_kwargs={}, Y=2, ge_nonlinearity_type='nonlinear', ge_nonlinearity_monotonic=True, ge_nonlinearity_hidden_nodes=50, ge_noise_model_type='Gaussian', ge_heteroskedasticity_order=0, normalize_phi=True, mpa_hidden_nodes=50, theta_regularization=0.001, eta_regularization=0.1, ohe_batch_size=50000, custom_gpmap=None, initial_weights=None)¶
Represents a MAVE-NN model, which includes a genotype-phenotype (G-P) map as well as a measurement process. For global epistasis (GE) regression, set
regression_type='GE'
; for measurement process agnostic (MPA) regression, setregression_type='MPA'
.- Parameters
- L: (int)
Length of each training sequence. Must be
>= 1
.- alphabet: (str, np.ndarray)
Either the alphabet name (
'dna'
,'rna'
, or'protein'
) or a 1D array of characters to be used as the alphabet.- regression_type: (str)
Type of regression implemented by the model. Choices are
'GE'
(for a global epistasis model) and'MPA'
(for a measurement process agnostic model).- gpmap_type: (str)
Type of G-P map to infer. Choices are
'additive'
,'neighbor'
,'pairwise'
, and'blackbox'
.- gpmap_kwargs: (dict)
Additional keyword arguments used for specifying the G-P map.
- Y: (int)
The number if discrete
y
bins to use when defining an MPA model. Must be>= 2
. Has no effect on MPA models.- ge_nonlinearity_monotonic: (boolean)
Whether to enforce a monotonicity constraint on the GE nonlinearity. Has no effect on MPA models.
- ge_nonlinearity_hidden_nodes: (int)
Number of hidden nodes (i.e. sigmoidal contributions) to use when defining the nonlinearity component of a GE model. Has no effect on MPA models.
- ge_noise_model_type: (str)
Noise model to use for when defining a GE model. Choices are
'Gaussian'
,'Cauchy'
,'SkewedT'
, or'Empirical'
. Has no effect on MPA models.- ge_heteroskedasticity_order: (int)
In the GE model context, this represents the order of the polynomial(s) used to define noise model parameters as functions of
yhat
. The larger this is, the more heteroskedastic an inferred noise model is likely to be. Set to0
to enforce a homoskedastic noise model. Has no effect on MPA models. Must be>= 0
.- normalize_phi: (bool)
Whether to fix diffeomorphic modes after model training.
- mpa_hidden_nodes:
Number of hidden nodes (i.e. sigmoidal contributions) to use when defining the MPA measurement process. Must be
>= 1
.- theta_regularization: (float)
L2 regularization strength for G-P map parameters
theta
. Must be>= 0
; use0
for no regularization.- eta_regularization: (float)
L2 regularization strength for measurement process parameters
eta
. Must be>= 0
; use0
for no regularization.- ohe_batch_size: (int)
DISABLED. How many sequences to one-hot encode at a time when calling
Model.set_data()
. Typically, the larger this number is the quicker the encoding will happen. A number too large, however, may cause the computer’s memory to run out. Must be>= 1
.- custom_gpmap: (GPMapLayer sub-class)
Defines custom gpmap, provided by user. Inherited class of GP-MAP layer, which defines the functionality for x_to_phi_layer.
- initial_weights: (np.array)
Numpy array of weights that gets set as initial weights of a model if not set to None.
Methods
I_predictive
(x, y[, ct, knn, knn_fuzz, ...])Estimate predictive information.
I_variational
(x, y[, ct, knn_fuzz, uncertainty])Estimate variational information.
bootstrap
(data_df[, num_models, verbose, ...])Sample plausible models using parametric bootstrapping.
fit
([epochs, learning_rate, ...])Infer values for model parameters.
get_nn
()Return the underlying TensorFlow neural network.
get_theta
([gauge, p_lc, x_wt, unobserved_value])Return parameters of the G-P map.
p_of_y_given_phi
(y, phi[, paired])Compute probabilities p(
y
|phi
).p_of_y_given_x
(y, x[, paired])Compute probabilities p(
y
|x
).p_of_y_given_yhat
(y, yhat[, paired])Compute probabilities p(
y
|yhat
); GE models only.phi_to_yhat
(phi)Compute
phi
givenyhat
; GE models only.save
(filename[, verbose])Save model.
set_data
(x, y[, dy, ct, validation_frac, ...])Set training data.
simulate_dataset
(template_df)Generate a simulated dataset.
x_to_phi
(x)Compute
phi
givenx
.x_to_yhat
(x)Compute
yhat
givenx
.yhat_to_yq
(yhat[, q, paired])Compute quantiles of p(
y
|yhat
); GE models only.- I_predictive(x, y, ct=None, knn=5, knn_fuzz=0.01, uncertainty=True, num_subsamples=25, use_LNC=False, alpha_LNC=0.5, verbose=False)¶
Estimate predictive information.
Predictive information,
I_pred
, is the mutual information I[phi
;y
] between latent phenotypesphi
and measurementsy
. Unlike variational information,I_pred
does not assume that the inferred measurement process p(y
|phi
) is correct.I_pred
is estimated using the k’th nearest neighbor methods from the NPEET package.- Parameters
- x: (np.ndarray)
1D array of
N
sequences, each of lengthL
.- y: (np.ndarray)
Array of measurements. For GE models,
y
must be a 1D array ofN
floats. For MPA models,y
must be either a 1D or 2D array of nonnegative ints. If 1D,y
must be of lengthN
, and will be interpreted as listing bin numbers, i.e.0
,1
, …,Y-1
. If 2D,y
must be of shape(N,Y)
, and will be interpreted as listing the observed counts for each of theN
sequences in each of theY
bins.- ct: (np.ndarray, None)
Only used for MPA models when
y
is 1D. In this case,ct
must be a 1D array, lengthN
, of nonnegative integers, and represents the number of observations of each sequence in each bin. Usey=None
for GE models, as well as for MPA models wheny
is 2D.- knn: (int>0)
Number of nearest neighbors to use in the entropy estimators from the NPEET package.
- knn_fuzz: (float>0)
Amount of noise to add to
phi
values before passing them to the KNN estimators. Specifically, Gaussian noise with standard deviationknn_fuzz * np.std(phi)
is added tophi
values. This is a hack and is not ideal, but is needed to get the KNN estimates to behave well on real MAVE data.- uncertainty: (bool)
Whether to estimate the uncertainty in
I_pred
. Substantially increases runtime ifTrue
.- num_subsamples: (int)
Number of subsamples to use when estimating the uncertainty in
I_pred
.- use_LNC: (bool)
Whether to use the Local Nonuniform Correction (LNC) of Gao et al., 2015 when computing
I_pred
for GE models. Substantially increases runtime set toTrue
.- alpha_LNC: (float in (0,1))
Value of
alpha
to use when computing the LNC correction. See Gao et al., 2015 for details. Used only for GE models.- verbose: (bool)
Whether to print results and execution time.
- Returns
- I_pred: (float)
Estimated variational information, in bits.
- dI_pred: (float)
Standard error for
I_pred
. Is0
ifuncertainty=False
is used.
- I_variational(x, y, ct=None, knn_fuzz=0.01, uncertainty=True)¶
Estimate variational information.
Likelihood information,
I_var
, is the mutual information I[phi
;y
] between latent phenotypesphi
and measurementsy
under the assumption that the inferred measurement process p(y
|phi
) is correct.I_var
is an affine transformation of log likelihood and thus provides a useful metric during model training. When evaluated on test data,I_var
also provides a lower bound to the predictive informationI_pred
, which does not assume that the inferred measurement process is correct. The differenceI_pred - I_var
thus quantifies the mismatch between the inferred measurement process and the true conditional distribution p(y
|phi
).- Parameters
- x: (np.ndarray)
1D array of
N
sequences, each of lengthL
.- y: (np.ndarray)
Array of measurements. For GE models,
y
must be a 1D array ofN
floats. For MPA models,y
must be either a 1D or 2D array of nonnegative ints. If 1D,y
must be of lengthN
, and will be interpreted as listing bin numbers, i.e.0
,1
, …,Y-1
. If 2D,y
must be of shape(N,Y)
, and will be interpreted as listing the observed counts for each of theN
sequences in each of theY
bins.- ct: (np.ndarray, None)
Only used for MPA models when
y
is 1D. In this case,ct
must be a 1D array, lengthN
, of nonnegative integers, and represents the number of observations of each sequence in each bin. Usey=None
for GE models, as well as for MPA models wheny
is 2D.- knn_fuzz: (float>0)
Amount of noise to add to
y
values before passing them to the KNN estimators. Specifically, Gaussian noise with standard deviationknn_fuzz * np.std(y)
is added toy
values. This is a hack and is not ideal, but is needed to get the KNN estimates to behave well on real MAVE data. Only used for GE regression models.- uncertainty: (bool)
Whether to estimate the uncertainty of
I_var
.
- Returns
- I_var: (float)
Estimated variational information, in bits.
- dI_var: (float)
Standard error for
I_var
. Is0
ifuncertainty=False
is used.
- bootstrap(data_df, num_models=10, verbose=True, initialize_from_self=False, fit_kwargs={})¶
Sample plausible models using parametric bootstrapping.
Given a copy
data_df
of the initial dataset used to train/test the model, this function first simulatesnum_models
datasets, each of which has the same sequences and corresponding training, validation, and test set designations asdata_df
, but simulated measurement values (eithery
column orct_#
column values) generated usingself
. One model having the same form asself
is then fit to each dataset, and the list of resulting models in returned to the user.- Parameters
- data_df: (str)
The dataset used to fit the original model (i.e.,
self
). Must have a column'x'
listing sequences, as well as a column'set'
whose entries are'training'
,'validation'
, or'test'
.- num_models: (int > 0)
Number of models to return.
- verbose: (bool)
Whether to print feedback.
- initialize_from_self: (bool)
Whether to initiate each bootstrapped model from the inferred parameters of
self
. WARNING: using this option can cause systematic underestimation of parameter uncertainty.- fit_kwargs: (dict)
Dictionary of keyword arguments. Entries will override the keyword arguments that were passed to
self.fit()
during initial model training, and which are used by default for training the simulation-inferred model here.
- Returns
- models: (list)
List of
mavenn.Model
objects.
- fit(epochs=50, learning_rate=0.005, validation_split=0.2, verbose=True, early_stopping=True, early_stopping_patience=20, batch_size=50, linear_initialization=True, freeze_theta=False, callbacks=[], try_tqdm=True, optimizer='Adam', optimizer_kwargs={}, fit_kwargs={})¶
Infer values for model parameters.
Uses training algorithms from TensorFlow to learn model parameters. Before this is run, the training data must be set using
Model.set_data()
.- Parameters
- epochs: (int)
Maximum number of epochs to complete during model training. Must be
>= 0
.- learning_rate: (float)
Learning rate. Must be
> 0.
- validation_split: (float in [0,1])
Fraction of training data to reserve for validation.
- verbose: (boolean)
Whether to show progress during training.
- early_stopping: (bool)
Whether to use early stopping.
- early_stopping_patience: (int)
Number of epochs to wait, after a minimum value of validation loss is observed, before terminating the model training process.
- batch_size: (None, int)
Batch size to use for stochastic gradient descent and related algorithms. If None, a full-sized batch is used. Note that the negative log likelihood loss function used by MAVE-NN is extrinsic in batch_size.
- linear_initialization: (bool)
Whether to initialize the results of a linear regression computation. Has no effect when
gpmap_type='blackbox'
.- freeze_theta: (bool)
Whether to set the weights of the G-P map layer to be non-trainable. Note that setting
linear_initialization=True
andfreeze_theta=True
will set theta to be initialized at the linear regression solution and then become frozen during training.- callbacks: (list)
Optional list of
tf.keras.callbacks.Callback
objects to use during training.- try_tqdm: (bool)
If true, mavenn will attempt to load the package tqdm and append TqdmCallback(verbose=0) to the callbacks list in order to improve the visual display of training progress. If users do not have tqdm installed, this will do nothing.
- optimizer: (str)
Optimizer to use for training. Valid options include:
'SGD'
,'RMSprop'
,'Adam'
,'Adadelta'
,'Adagrad'
,'Adamax'
,'Nadam'
,'Ftrl'
.- optimizer_kwargs: (dict)
Additional keyword arguments to pass to the
tf.keras.optimizers.Optimizer
constructor.- fit_kwargs: (dict):
Additional keyword arguments to pass to
tf.keras.Model.fit()
- Returns
- history: (tf.keras.callbacks.History)
Standard TensorFlow record of the training session.
- get_nn()¶
Return the underlying TensorFlow neural network.
- Parameters
- None
- Returns
- nn: (tf.keras.Model)
The backend TensorFlow model.
- get_theta(gauge='empirical', p_lc=None, x_wt=None, unobserved_value=nan)¶
Return parameters of the G-P map.
This function returns a
dict
containing the parameters of the model’s G-P map. Keys are of typestr
, values are of typenp.ndarray
. Relevant (key, value) pairs are:'theta_0'
, constant term;'theta_lc'
, additive effects in the form of a 2D array with shape(L,C)
;'theta_lclc'
, pairwise effects in the form of a 4D array of shape(L,C,L,C)
;'theta_bb'
, all parameters forgpmap_type='blackbox'
models.Importantly this function gauge-fixes model parameters before returning them, i.e., it pins down non-identifiable degrees of freedom. Gauge fixing is performed using a hierarchical gauge, which maximizes the fraction of variance in
phi
explained by the lowest-order terms. Computing such variances requires assuming probability distribution over sequence space, however, and using different distributions will result in different ways of fixing the gauge.This function assumes that the distribution used to define the gauge factorizes across sequence positions, and can thus be represented by an
L
xC
probability matrixp_lc
that lists the probability of each characterc
at each positionl
.An important special case is the wild-type gauge, in which
p_lc
is the one-hot encoding of a “wild-type” specific sequencex_wt
. In this case, the constant parametertheta_0
is the value ofphi
forx_wt
, additive parameterstheta_lc
represent the effect of single-point mutations away fromx_wt
, and so on.- Parameters
- gauge: (str)
String specification of which gauge to use. Allowed values are:
'uniform'
, hierarchical gauge using a uniform sequence distribution over the characters at each position observed in the training set (unobserved characters are assigned probability 0).'empirical'
, hierarchical gauge using an empirical distribution computed from the training data;'consensus'
, wild-type gauge using the training data consensus sequence;'user'
, gauge using eitherp_lc
orx_wt
supplied by the user;'none'
, no gauge fixing.- p_lc: (None, array)
Custom probability matrix to use for hierarchical gauge fixing. Must be a
np.ndarray
of shape(L,C)
. If using this, also setgauge='user'
.- x_wt: (str, None)
Custom wild-type sequence to use for wild-type gauge fixing. Must be a
str
of lengthL
. If using this, also setgauge='user'
.- unobserved_value: (float, None)
Value to use for parameters when no corresponding sequences were present in the training data. If
None
, these parameters will be left alone. Usingnp.nan
can help when visualizing models usingmavenn.heatmap()
ormavenn.heatmap_pariwise()
.
- Returns
- theta: (dict)
Model parameters provided as a
dict
of numpy arrays.
- p_of_y_given_phi(y, phi, paired=False)¶
Compute probabilities p(
y
|phi
).- Parameters
- y: (np.ndarray)
Measurement values. For GE models, must be an array of floats. For MPA models, must be an array of ints representing bin numbers.
- phi: (np.ndarray)
Latent phenotype values, provided as an array of floats.
- paired: (bool)
Whether values in
y
andphi
should be treated as paired. IfTrue
, the probability of each value iny
value will be computed using the single paired value inphi
. IfFalse
, the probability of each value iny
will be computed against all values of inphi
.
- Returns
- p: (np.ndarray)
Probability of
y
givenphi
. Ifpaired=True
,p.shape
will be equal to bothy.shape
andphi.shape
. Ifpaired=False
,p.shape
will be given byy.shape + phi.shape
.
- p_of_y_given_x(y, x, paired=True)¶
Compute probabilities p(
y
|x
).- Parameters
- y: (np.ndarray)
Measurement values. For GE models, must be an array of floats. For MPA models, must be an array of ints representing bin numbers.
- x: (np.ndarray)
Sequences, provided as an array of strings, each of length
L
.- paired: (bool)
Whether values in
y
andx
should be treated as paired. IfTrue
, the probability of each value iny
value will be computed using the single paired value inx
. IfFalse
, the probability of each value iny
will be computed against all values of inx
.
- Returns
- p: (np.ndarray)
Probability of
y
givenx
. Ifpaired=True
,p.shape
will be equal to bothy.shape
andx.shape
. Ifpaired=False
,p.shape
will be given byy.shape + x.shape
.
- p_of_y_given_yhat(y, yhat, paired=False)¶
Compute probabilities p(
y
|yhat
); GE models only.- Parameters
- y: (np.ndarray)
Measurement values, provided as an array of floats.
- yhat: (np.ndarray)
Observable values, provided as an array of floats.
- paired: (bool)
Whether values in
y
andyhat
should be treated as paired. IfTrue
, the probability of each value iny
value will be computed using the single paired value inyhat
. IfFalse
, the probability of each value iny
will be computed against all values of inyhat
.
- Returns
- p: (np.ndarray)
Probability of
y
givenyhat
. Ifpaired=True
,p.shape
will be equal to bothy.shape
andyhat.shape
. Ifpaired=False
,p.shape
will be given byy.shape + yhat.shape
.
- phi_to_yhat(phi)¶
Compute
phi
givenyhat
; GE models only.- Parameters
- phi: (array-like)
Latent phenotype values, provided as an
np.ndarray
of floats.
- Returns
- y_hat: (array-like)
Observable values in an
np.ndarray
the same shape asphi
.
- save(filename, verbose=True)¶
Save model.
Saved models are represented by two files having the same root and two different extensions,
.pickle
and.h5
. The.pickle
file contains model metadata, including all information needed to reconstruct the model’s architecture. The.h5
file contains the values of the trained neural network weights. Note that training data is not saved.- Parameters
- filename: (str)
File directory and root. Do not include extensions.
- verbose: (bool)
Whether to print feedback.
- Returns
- None
- set_data(x, y, dy=None, ct=None, validation_frac=0.2, validation_flags=None, shuffle=True, knn_fuzz=0.01, verbose=True)¶
Set training data.
Prepares data for use during training, e.g. by shuffling and one-hot encoding training data sequences. Must be called before
Model.fit()
.- Parameters
- x: (np.ndarray)
1D array of
N
sequences, each of lengthL
.- y: (np.ndarray)
Array of measurements. For GE models,
y
must be a 1D array ofN
floats. For MPA models,y
must be either a 1D or 2D array of nonnegative ints. If 1D,y
must be of lengthN
, and will be interpreted as listing bin numbers, i.e.0
,1
, …,Y-1
. If 2D,y
must be of shape(N,Y)
, and will be interpreted as listing the observed counts for each of theN
sequences in each of theY
bins.- dy(np.ndarray)
User supplied error bars associated with continuous measurements to be used as sigma in the Gaussian noise model.
- ct: (np.ndarray, None)
Only used for MPA models when
y
is 1D. In this case,ct
must be a 1D array, lengthN
, of nonnegative integers, and represents the number of observations of each sequence in each bin. Usey=None
for GE models, as well as for MPA models wheny
is 2D.- validation_frac (float):
Fraction of observations to use for the validation set. Is overridden when setting
validation_flags
. Must be in the range [0,1].- validation_flags (np.ndarray, None):
1D array of
N
boolean numbers, withTrue
indicating which observations should be reserved for the validation set. IfNone
, the training and validation sets will be randomly assigned based on the value ofvalidation_frac
.- shuffle: (bool)
Whether to shuffle the observations, e.g., to ensure similar composition of the training and validation sets when
validation_flags
is not set.- knn_fuzz: (float>0)
Amount of noise to add to
y
values before passing them to the KNN estimator (for computing I_var during training). Specifically, Gaussian noise with standard deviationknn_fuzz * np.std(y)
is added toy
values. This is needed to mitigate errors caused by multiple observations of the same sequence. Only used for GE regression.- verbose: (bool)
Whether to provide printed feedback.
- Returns
- None
- simulate_dataset(template_df)¶
Generate a simulated dataset.
- Parameters
- template_df: (pd.DataFrame)
Dataset off of which to base the simulated dataset. Specifically, the simulated dataset will have the same sequences and the same train/validation/test flags, but different values for
'y'
(in the case of a GE regression model) or'ct_#'
(in the case of an MPA regression model.
- Returns
- simulated_df: (pd.DataFrame)
Simulated dataset in the form of a dataframe. Columns include
'set'
,'phi'
, and'x'
. For GE models, additional columns'yhat'
and'y'
are added. For MPA models, multiple columns of the form'ct_#'
are added.
- x_to_phi(x)¶
Compute
phi
givenx
.- Parameters
- x: (np.ndarray)
Sequences, provided as an
np.ndarray
of strings, each of lengthL
.
- Returns
- phi: (array-like of float)
Latent phenotype values, provided as floats within an
np.ndarray
the same shape asx
.
- x_to_yhat(x)¶
Compute
yhat
givenx
.- Parameters
- x: (np.ndarray)
Sequences, provided as an
np.ndarray
of strings, each of lengthL
.
- Returns
- yhat: (np.ndarray)
Observation values, provided as floats within an
np.ndarray
the same shape asx
.
- yhat_to_yq(yhat, q=[0.16, 0.84], paired=False)¶
Compute quantiles of p(
y
|yhat
); GE models only.- Parameters
- yhat: (np.ndarray)
Observable values, provided as an array of floats.
- q: (np.ndarray)
Quantile specifications, provided as an array of floats in the range [0,1].
- paired: (bool)
Whether values in
yhat
andq
should be treated as paired. IfTrue
, quantiles will be computed using each value inyhat
paired with the corresponding value inq
. IfFalse
, the quantile for each value inyhat
will be computed for every value inq
.
- Returns
- yq: (array of floats)
Quantiles of p(
y
|yhat
). Ifpaired=True
,yq.shape
will be equal to bothyhat.shape
andq.shape
. Ifpaired=False
,yq.shape
will be given byyhat.shape + q.shape
.