{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "$$\\newcommand{\\mean}{\\mathrm{mean}}\n", "\\newcommand{\\std}{\\mathrm{std}}\n", "\\newcommand{\\wt}{\\mathrm{wt}}\n", "\\newcommand{\\like}{\\mathrm{like}}\n", "\\newcommand{\\pred}{\\mathrm{pred}}\n", "\\newcommand{\\test}{\\mathrm{test}}\n", "\\newcommand{\\intr}{\\mathrm{intr}}\n", "\\newcommand{\\train}{\\mathrm{train}}\n", "\\newcommand{\\true}{\\mathrm{true}}\n", "\\newcommand{\\set}[1]{\\left\\{ #1 \\right\\}}\n", "\\newcommand{\\ev}[1]{\\left. #1 \\right|} \n", "\\newcommand{\\bra}[1]{\\left\\langle #1 \\right|}\n", "\\newcommand{\\ket}[1]{\\left| #1 \\right\\rangle}\n", "\\newcommand{\\braket}[1]{\\left\\langle #1 \\right\\rangle}\n", "\\newcommand{\\bbraket}[1]{\\left\\llangle #1 \\right\\rrangle}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Underlying Mathematics\n", "$$\\newcommand{\\eq}[1]{Eq.\\ (\\ref{#1})}\n", "\\newcommand{\\fig}[2]{Fig.\\ \\ref{#1}#2}\n", "\\newcommand{\\diag}{{\\rm diag}}\n", "\\newcommand{\\cov}{{\\rm cov}}\n", "\\newcommand{\\data}{{\\rm data}}\n", "\\newcommand{\\real}{{\\rm real}}\n", "\\newcommand{\\die}{\\mathrm{die}}\n", "\\newcommand{\\live}{\\mathrm{live}}\n", "\\newcommand{\\Lap}{\\mathrm{Lap}}\n", "\\newcommand{\\sym}{\\{ \\mathrm{sym~factor} \\}}\n", "\\newcommand{\\const}{{\\rm const}}\n", "\\newcommand{\\true}{\\mathrm{true}}\n", "\\newcommand{\\set}[1]{\\left\\{ #1 \\right\\}}\n", "\\newcommand{\\ev}[1]{\\left. #1 \\right|} \n", "\\newcommand{\\bra}[1]{\\left\\langle #1 \\right|}\n", "\\newcommand{\\ket}[1]{\\left| #1 \\right\\rangle}\n", "\\newcommand{\\braket}[1]{\\left\\langle #1 \\right\\rangle}\n", "\\newcommand{\\bbraket}[1]{\\left\\llangle #1 \\right\\rrangle}\n", "\\newcommand{\\mat}[2]{\\left( \\begin{array}{#1} #2 \\end{array} \\right)}\n", "\\newcommand{\\bracemat}[2]{\\left\\{ \\begin{array}{#1} #2 \\end{array} \\right\\}}\n", "\\newcommand{\\comment}[1]{{\\color{red}[#1]}}\n", "\\newcommand{\\code}[1]{\\texttt{#1}}\n", "\\newcommand{\\mean}{\\mathrm{mean}}\n", "\\newcommand{\\std}{\\mathrm{std}}\n", "\\newcommand{\\wt}{\\mathrm{wt}}\n", "\\newcommand{\\var}{\\mathrm{var}}\n", "\\newcommand{\\pred}{\\mathrm{pred}}\n", "\\newcommand{\\test}{\\mathrm{test}}\n", "\\newcommand{\\intr}{\\mathrm{intr}}\n", "\\newcommand{\\train}{\\mathrm{train}}\n", "\\newcommand{\\model}{\\mathrm{model}}\n", "\\newcommand{\\true}{\\mathrm{true}}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Notation\n", "\n", "Each MAVE dataset is represented a set of $N$ observations, $\\set{(\\vec{x}_n,y_n)}_{n=1}^N$, where each observation consists of a sequence $\\vec{x}_n$ and a measurement $y_n$. Here, $y_n$ can be either a continuous real-valued number, or a categorical variable representing the bin in which the $n$th sequence was found. Note that, in this representation the same sequence $\\vec{x}$ can be observed multiple times, potentially with different values for $y$ due to experimental noise. Datasets with real-valued measurements $y$ are analyzed using GE regression, while datasets with categorical $y$ values are analyzed using MPA regression. Sets of measurements are denoted $\\set{y_n}$, etc., and the mean and standard deviations of sets of numbers are denoted by the functions $\\mean(\\set{\\cdot})$ and $\\std(\\set{\\cdot})$\n", "\n", "In the ``mavenn.Model`` constructor, ``L`` controls sequence length, ``alphabet`` controls the set of characters, ``Y`` controls the number of $y$-bins (MPA regression only), and ``regression_type`` controls whether GE or MPA regression is used." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Genotype-phenotype (G-P) maps\n", "\n", "We assume that all sequences have the same length $L$, and that at each of the $L$ positions in each sequence there is one of $C$ possible characters ($C = 4$ for DNA and RNA; $C = 20$ for protein). In general, MAVE-NN represents sequences using a vector of binary features $\\vec{x}$. We adopt the following notation for the individual features in $\\vec{x}$:\n", "\n", "$$\\begin{eqnarray}\n", "x_{l:c} &=& \\left\\{ \n", "\\begin{array}{cl} \n", " 1 & \\mathrm{if~character~}c\\mathrm{~occurs~at~position~}l~~~ \\\\ \n", " 0 & \\mathrm{otherwise}, \n", "\\end{array} \\right. \n", "\\end{eqnarray}$$\n", "\n", "where $l = 1, 2, \\ldots, L$ indexes positions within the sequence, and $c$ indexes the $C$ distinct characters. \n", "\n", "We assume that the latent phenotype is given by a function $\\phi (\\vec{x}; \\vec{\\theta} )$ that depends on a vector of G-P map parameters $\\vec{\\theta}$. MAVE-NN currently supports four types of G-P maps, all of which can be inferred using either GE regression or MPA regression. \n", "\n", "In the ``mavenn.Model`` constructor, ``gpmap_type`` controls which type of G-P map is used.\n", "\n", "### Additive G-P maps\n", "Additive G-P maps assume that each position in $\\vec{x}$ contributes independently to the latent phenotype $\\phi$, and is computed using,\n", "\n", "$$\\begin{eqnarray}\n", "\\phi(\\vec{x};\\vec{\\theta})_\\text{additive} &=& \\theta_0 + \\sum_{l=0}^{L-1} \\sum_c \\theta_{l:c}x_{l:c}.\n", "\\end{eqnarray}$$\n", "\n", "To use an additive G-P map, set ``gpmap_type='additive'`` in the ``mavenn.Model`` constructor. \n", "\n", "### Neighbor G-P maps\n", "\n", "Neighbor G-P maps assume that each neighboring pair of positions in $\\vec{x}$ contributes to the latent phenotype $\\phi$, and is comptued using,\n", "\n", "$$\\begin{eqnarray}\n", "\\phi_{\\text{neighbor}}(\\vec{x};\\vec{\\theta}) &=& \\theta_0 + \\sum_{l=0}^{L-1} \\sum_c \\theta_{l:c}x_{l:c} + \\sum_{l=0}^{L-2} \\sum_{c,c^\\prime} \\theta_{l:c,l+1:c'} x_{l:c}x_{l+1:c^\\prime}, \\nonumber\n", "\\end{eqnarray}$$\n", "\n", "To use a neighbor G-P map, set ``gpmap_type='neighbor'`` in the ``mavenn.Model`` constructor. \n", "\n", "### Pairwise G-P maps\n", "\n", "Pairwise G-P maps assume that every pair of positions in $\\vec{x}$ contributes to the latent phenotype $\\phi$, and is comptued using,\n", "\n", "$$\\begin{eqnarray}\n", "\\phi_{\\text{pairwise}} (\\vec{x};\\vec{\\theta}) &=& \\theta_0 + \\sum_{l=0}^{L-1} \\sum_c \\theta_{l:c}x_{l:c} + \\sum_{l=0}^{L-2} \\sum_{l^\\prime=l+1}^{L-1} \\sum_{c,c^\\prime} \\theta_{l:c,l^\\prime:c^\\prime} x_{l:c} x_{l^\\prime:c^\\prime}. \\nonumber\n", "\\end{eqnarray}$$\n", "\n", "To use a pairwise G-P map, set ``gpmap_type='pairwise'`` in the ``mavenn.Model`` constructor. \n", "\n", "### Black box G-P maps\n", "\n", "Black box G-P maps are represented as dense feed-forward neural networks, i.e. as multi-layer perceptrons (MLPs). To use a black box G-P map, set ``gpmap_type='blackbox'`` in the ``mavenn.Model`` constructor. Note that one can set the number of nodes used in each layer of the ML. For instance, to create an MLP with three hidden layers of size 100, 30, and 10, set ``gpmap_kwargs={'hidden_layer_sizes':[100,30,10]}`` in the ``mavenn.Model`` constructor." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MPA measurement processes\n", "\n", "In MPA regression, MAVE-NN directly models the measurement process $p(y|\\phi)$. At present, MAVE-NN only supports MPA regression for discrete values of $y$, which are indexed using nonnegative integers. MAVE-NN takes two forms of input for MPA regression. One is a set of (non-unique) sequence-measurement pairs $\\set{(\\vec{x}_n, y_n)}_{n=0}^{N-1}$, where $N$ is the total number of independent measurements and $y_n \\in \\{0,1,\\ldots,Y-1\\}$, where $Y$ is the total number of bins. The other is a set of (unique) sequence-count-vector pairs $\\set{(\\vec{x}_m, \\vec{c}_m)}_{m=0}^{M-1}$, where $M$ is the total number of unique sequences in the data set, and $\\vec{c}_m = (c_{m0}, c_{m1}, \\ldots, c_{m(Y-1)})$ lists, for each value of $y$, the number of times $c_{my}$ that the sequence $\\vec{x}_m$ was observed in bin $y$. MPA measurement processes are computed internally using the latter representation via\n", "\\begin{eqnarray}\n", "p(y|\\phi) &=& \\frac{w_y(\\phi)}{\\sum_{y'} w_{y'}(\\phi)} \\\\\n", "w_y(\\phi) &=& \\exp \\left[ a_y + \\sum_{k=0}^{K-1} b_{yk} \\tanh(c_{yk} \\phi + d_{yk}) \\right]~~~~~~\n", "\\end{eqnarray}\n", "where $K$ is the number of hidden nodes per value of $y$. The trainable parameters of this measurement process are thus $\\vec{\\eta} = \\set{a_y, b_{yk}, c_{yk}, d_{yk}}$. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GE nonlinearities\n", "\n", "GE models assume that each measurement $y$ of a sequence $\\vec{x}$ is a nonlinear function $g(\\phi)$ plus some noise. In MAVE-NN, this nonlinearity is represented as a sum of $\\tanh$ sigmoids, i.e.,\n", "\n", "$$\\begin{equation}\n", "g(\\phi;\\vec{\\alpha}) = a + \\sum_{k=0}^{K-1} b_k \\tanh(c_k \\phi + d_k) \n", "\\end{equation}$$\n", "\n", "where $K$ specifies the number of \"hidden nodes\" contributing to the sum, and $\\vec{\\alpha} = \\set{a, b_k, c_k, d_k}$ are trainable parameters. By default, MAVE-NN constrains $g(\\phi;\\vec{\\alpha})$ to be monotonic in $\\phi$ by requiring all $b_k \\ge 0$ and $c_k \\ge 0$, but this constraint can be relaxed.\n", "\n", "In the ``mavenn.Model`` constructor, ``ge_nonlinearity_hidden_nodes`` controls the value for $K$, while ``ge_nonlinearity_monotonic`` controls the monotonicity constraint on $g(\\phi;\\vec{\\alpha})$.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GE noise models\n", "\n", "MAVE-NN supports three types of GE noise models: Gaussian, Cauchy, and skewed-t. These are specified using the ``ge_noise_model`` keyword argument in the ``mavenn.Model`` constructor. \n", "\n", "### Gaussian nose models\n", "\n", "The Gaussian noise model, corresponding to ``ge_noise_model='Gaussian'``, is given by\n", "\n", "$$\\begin{eqnarray}\n", " p_\\mathrm{gauss}(y|\\hat{y};s) = \\frac{1}{\\sqrt{2 \\pi s^2}} \\exp\\left[-\\frac{(y - \\hat{y})^2}{2 s^2}\\right],\n", "\\end{eqnarray}$$\n", "\n", "where $s$ denotes the standard deviation. Importantly, MAVE-NN allows this noise model to be heteroskedastic by representing $s$ as an exponentiated polynomial in $\\hat{y}$, i.e.,\n", "\n", "$$\\begin{equation}\n", "s(\\hat{y}) = \\exp \\left[ \\sum_{k=0}^K a_k \\hat{y}^k \\right],\n", "\\end{equation}$$\n", "\n", "where $K$ is the order of the polynomial and $\\set{a_k}$ are trainable parameters. The user has the option to set $K$, and using $K=0$ renders this noise model homoskedastic. Quantiles are computed using $y_q = \\hat{y} + s\\, \\sqrt{2}\\, \\mathrm{erf}^{-1}(2 q -1) $ for user-specified values of $q \\in [0,1]$.\n", "\n", "### Cauchy noise models\n", "\n", "The Cauchy noise model, corresponding to ``ge_noise_model='Cauchy'``, is given by\n", "\n", "$$\\begin{eqnarray}\n", " p_\\mathrm{cauchy}(y|\\hat{y};s) = \\left[ \\pi s \\left( 1 + \\frac{(y-\\hat{y})^2}{s^2} \\right) \\right]^{-1}\n", "\\end{eqnarray}$$\n", "\n", "where the scale parameter $s$ is an exponentiated $K$'th order polynomial in $\\hat{y}$. Quantiles are computed using $y_q = \\hat{y} + s\\, \\mathrm{tan}[\\pi(q-\\frac{1}{2})]$.\n", "\n", "### Skewed-t noise models\n", "\n", "The skewed-t noise model, corresponding to ``ge_noise_model='SkewedT'``, is of the form described by Jones & Faddy (2003), and is given by\n", "\n", "$$\\begin{equation}\n", " p_\\mathrm{skewt}(y|\\hat{y};s,a,b) = s^{-1} f(t; a,b), \n", "\\end{equation}$$\n", "\n", "where\n", "\n", "$$\\begin{equation}\n", " t = t^* + \\frac{y-\\hat{y}}{s}. ~~~t^* = \\frac{(a-b)\\sqrt{a+b}}{\\sqrt{2a+1}\\sqrt{2b+1}},\n", "\\end{equation}$$\n", "\n", "and\n", "\n", "$$\\begin{eqnarray}\n", " f(t;a,b) &=& \\frac{2^{1-a-b}}{\\sqrt{a+b}}\\frac{\\Gamma(a+b)}{\\Gamma(a)\\Gamma(b)} \\left[1 + \\frac{t}{\\sqrt{a+b+t^2}} \\right]^{a+\\frac{1}{2}} \\left[ 1 - \\frac{t}{\\sqrt{a+b+t^2}} \\right]^{b+\\frac{1}{2}}. \n", " \\end{eqnarray}$$\n", " \n", "Note that the $t$ statistic here is an affine function of $y$ chosen so that the distribution's mode (corresponding to $t^*$) is positioned at $\\hat{y}$. The three parameters of this model, $\\set{s, a, b}$, are each represented using $K$-th order exponentiated polynomial with trainable coefficients. Quantiles are computed using\n", "\n", "$$\\begin{equation}\n", "y_q = \\hat{y} + (t_q-t^*)s,\n", "\\end{equation}$$\n", "\n", "where\n", "\n", "$$\\begin{equation} \n", "t_q = \\frac{(2x_q-1)\\sqrt{a+b}}{\\sqrt{1 - (2x_q-1)^2}},~~~x_q = I^{-1}_{q}(a,b),\n", " \\end{equation}$$\n", " \n", "and $I^{-1}$ denotes the inverse of the regularized incomplete Beta function $I_x(a,b)$.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gauge modes and diffemorphic modes\n", "\n", "These G-P maps also have non-identifiable degrees of freedom that must be \"fixed\", i.e. pinned down, before the values of individual parameters can be meaningfully interpreted. These degrees of freedom come in two flavors: gauge modes and diffeomorphic modes. Gauge modes are changes to $\\vec{\\theta}$ that do not alter the values of the latent phenotype $\\phi$. Diffeomorphic modes (Kinney & Atwal, 2014; Atwal & Kinney, 2016) are changes to $\\vec{\\theta}$ that do alter $\\phi$, but do so in ways that can be undone by transformations of the measurement process $p(y|\\phi)$ along corresponding \"dual modes\". As shown in (Kinney & Atwal, 2014), the diffeomorphic modes of linear G-P maps like those considered here will in general correspond to affine transformations of $\\phi$ (though there are exceptions at special values of $\\vec{\\theta}$). MAVE-NN fixes both gauge modes and diffeomorphic modes before providing parameter values or other access to model internals to the user, e.g. when ``model.get_theta()`` is called. \n", "\n", "### Fixing diffeomorphic modes\n", "\n", "Diffeomorphic modes of G-P maps are fit by transforming G-P map parameters $\\vec{\\theta}$ via \n", "\n", "$$\\begin{equation}\n", " \\theta_0 \\to \\theta_0 - a\n", "\\end{equation}$$\n", "\n", "and\n", "\n", "$$\\begin{eqnarray}\n", "\\vec{\\theta} \\to \\vec{\\theta}/b \n", "\\end{eqnarray}$$\n", "\n", "where $a= \\mean(\\set{\\phi_n})$ and $b = \\std(\\set{\\phi_n})$ are the mean and standard deviation of $\\phi$ values computed on the training data. This produces a corresponding change in latent phenotype values $\\phi \\to (\\phi - a)/b$. To avoid altering likelihood contributions $p(y|x)$, MAVE-NN further transforms the measurement process $p(y|\\phi)$ along dual modes in order to compensate. In GE regression this is done by adjusting the GE nonlinearity while keeping the noise model $p(y|\\hat{y})$ fixed,\n", "\n", "$$\\begin{eqnarray}\n", "g(\\phi) \\to g(a + b \\phi),\n", "\\end{eqnarray}$$\n", "\n", "while in MPA regression MAVE-NN adjusts the full measurement process, \n", "\n", "$$\\begin{eqnarray}\n", "p(y | \\phi) \\to p(y | a + b \\phi). \n", "\\end{eqnarray}$$\n", "\n", "### Fixing the gauge\n", "\n", "Gauge modes are fit using a \"hierarchical gauge\", in which lower-order terms in $\\phi(\\vec{x}; \\vec{\\theta})$ account for the highest possible fraction of variance in $\\phi$. Such gauges further require a probability distribution on sequence space, which is used to compute these variances. MAVE-NN assumes that such distributions factorize by position, and can thus be represented by a probability matrix with elements $p_{l:c}$, representing the probability of character $c$ at position $l$. MAVE-NN then transforms G-P map parameters via,\n", "\n", "$$\\begin{eqnarray}\n", " \\theta_0 &\\to & \\theta_0 \n", " + \\sum_{l} \\sum_{c'} \\theta_{l:c'}\\,p_{l:c'} \n", " + \\sum_{l} \\sum_{l'>l} \\sum_{c,c'}\\theta_{l:c,l':c'}\\,p_{l:c}\\,p_{l':c'}, \\nonumber \n", "\\end{eqnarray}$$\n", "\n", "$$\\begin{eqnarray}\n", " \\theta_{l:c}&\\to &{\\theta}_{l:c}\n", " -\\sum_{c'}\\theta_{l:c'}\\,p_{l:c} \\nonumber \\\\\n", " & & +\\sum_{l'>l} \\sum_{c'} \\theta_{l:c,l':c'}\\,p_{l':c'} \n", " +\\sum_{l'l} \\sum_{c',c''} \\theta_{l:c',l':c''}\\,p_{l:c'}\\,p_{l':c''},\n", " -\\sum_{l'