{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\\newcommand{\\mean}{\\mathrm{mean}}\n",
    "\\newcommand{\\std}{\\mathrm{std}}\n",
    "\\newcommand{\\wt}{\\mathrm{wt}}\n",
    "\\newcommand{\\like}{\\mathrm{like}}\n",
    "\\newcommand{\\pred}{\\mathrm{pred}}\n",
    "\\newcommand{\\test}{\\mathrm{test}}\n",
    "\\newcommand{\\intr}{\\mathrm{intr}}\n",
    "\\newcommand{\\train}{\\mathrm{train}}\n",
    "\\newcommand{\\true}{\\mathrm{true}}\n",
    "\\newcommand{\\set}[1]{\\left\\{ #1 \\right\\}}\n",
    "\\newcommand{\\ev}[1]{\\left. #1 \\right|} \n",
    "\\newcommand{\\bra}[1]{\\left\\langle #1 \\right|}\n",
    "\\newcommand{\\ket}[1]{\\left| #1 \\right\\rangle}\n",
    "\\newcommand{\\braket}[1]{\\left\\langle #1 \\right\\rangle}\n",
    "\\newcommand{\\bbraket}[1]{\\left\\llangle #1 \\right\\rrangle}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Underlying Mathematics\n",
    "$$\\newcommand{\\eq}[1]{Eq.\\ (\\ref{#1})}\n",
    "\\newcommand{\\fig}[2]{Fig.\\ \\ref{#1}#2}\n",
    "\\newcommand{\\diag}{{\\rm diag}}\n",
    "\\newcommand{\\cov}{{\\rm cov}}\n",
    "\\newcommand{\\data}{{\\rm data}}\n",
    "\\newcommand{\\real}{{\\rm real}}\n",
    "\\newcommand{\\die}{\\mathrm{die}}\n",
    "\\newcommand{\\live}{\\mathrm{live}}\n",
    "\\newcommand{\\Lap}{\\mathrm{Lap}}\n",
    "\\newcommand{\\sym}{\\{ \\mathrm{sym~factor} \\}}\n",
    "\\newcommand{\\const}{{\\rm const}}\n",
    "\\newcommand{\\true}{\\mathrm{true}}\n",
    "\\newcommand{\\set}[1]{\\left\\{ #1 \\right\\}}\n",
    "\\newcommand{\\ev}[1]{\\left. #1 \\right|} \n",
    "\\newcommand{\\bra}[1]{\\left\\langle #1 \\right|}\n",
    "\\newcommand{\\ket}[1]{\\left| #1 \\right\\rangle}\n",
    "\\newcommand{\\braket}[1]{\\left\\langle #1 \\right\\rangle}\n",
    "\\newcommand{\\bbraket}[1]{\\left\\llangle #1 \\right\\rrangle}\n",
    "\\newcommand{\\mat}[2]{\\left( \\begin{array}{#1} #2 \\end{array} \\right)}\n",
    "\\newcommand{\\bracemat}[2]{\\left\\{ \\begin{array}{#1} #2 \\end{array} \\right\\}}\n",
    "\\newcommand{\\comment}[1]{{\\color{red}[#1]}}\n",
    "\\newcommand{\\code}[1]{\\texttt{#1}}\n",
    "\\newcommand{\\mean}{\\mathrm{mean}}\n",
    "\\newcommand{\\std}{\\mathrm{std}}\n",
    "\\newcommand{\\wt}{\\mathrm{wt}}\n",
    "\\newcommand{\\var}{\\mathrm{var}}\n",
    "\\newcommand{\\pred}{\\mathrm{pred}}\n",
    "\\newcommand{\\test}{\\mathrm{test}}\n",
    "\\newcommand{\\intr}{\\mathrm{intr}}\n",
    "\\newcommand{\\train}{\\mathrm{train}}\n",
    "\\newcommand{\\model}{\\mathrm{model}}\n",
    "\\newcommand{\\true}{\\mathrm{true}}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notation\n",
    "\n",
    "Each MAVE dataset is represented a set of $N$ observations, $\\set{(\\vec{x}_n,y_n)}_{n=1}^N$, where each observation consists of a sequence $\\vec{x}_n$ and a measurement $y_n$. Here, $y_n$ can be either a continuous real-valued number, or a categorical variable representing the bin in which the $n$th sequence was found. Note that, in this representation the same sequence $\\vec{x}$ can be observed multiple times, potentially with different values for $y$ due to experimental noise. Datasets with real-valued measurements $y$ are analyzed using GE regression, while datasets with categorical $y$ values are analyzed using MPA regression. Sets of measurements are denoted $\\set{y_n}$, etc., and the mean and standard deviations of sets of numbers are denoted by the functions $\\mean(\\set{\\cdot})$ and $\\std(\\set{\\cdot})$\n",
    "\n",
    "In the ``mavenn.Model`` constructor, ``L`` controls sequence length, ``alphabet`` controls the set of characters, ``Y`` controls the number of $y$-bins (MPA regression only), and ``regression_type`` controls whether GE or MPA regression is used."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Genotype-phenotype (G-P) maps\n",
    "\n",
    "We assume that all sequences have the same length $L$, and that at each of the $L$ positions in each sequence there is one of $C$ possible characters ($C = 4$ for DNA and RNA; $C = 20$ for protein). In general, MAVE-NN represents sequences using a vector of binary features $\\vec{x}$. We adopt the following notation for the individual features in $\\vec{x}$:\n",
    "\n",
    "$$\\begin{eqnarray}\n",
    "x_{l:c} &=& \\left\\{ \n",
    "\\begin{array}{cl} \n",
    "    1 & \\mathrm{if~character~}c\\mathrm{~occurs~at~position~}l~~~ \\\\ \n",
    "    0 & \\mathrm{otherwise}, \n",
    "\\end{array} \\right.   \n",
    "\\end{eqnarray}$$\n",
    "\n",
    "where $l = 1, 2, \\ldots, L$ indexes positions within the sequence, and $c$ indexes the $C$ distinct characters. \n",
    "\n",
    "We assume that the latent phenotype is given by a function $\\phi (\\vec{x}; \\vec{\\theta} )$ that depends on a vector of G-P map parameters $\\vec{\\theta}$. MAVE-NN currently supports four types of G-P maps, all of which can be inferred using either GE regression or MPA regression. \n",
    "\n",
    "In the ``mavenn.Model`` constructor, ``gpmap_type`` controls which type of G-P map is used.\n",
    "\n",
    "### Additive G-P maps\n",
    "Additive G-P maps assume that each position in $\\vec{x}$ contributes independently to the latent phenotype $\\phi$, and is computed using,\n",
    "\n",
    "$$\\begin{eqnarray}\n",
    "\\phi(\\vec{x};\\vec{\\theta})_\\text{additive} &=& \\theta_0 + \\sum_{l=0}^{L-1} \\sum_c \\theta_{l:c}x_{l:c}.\n",
    "\\end{eqnarray}$$\n",
    "\n",
    "To use an additive G-P map, set ``gpmap_type='additive'`` in the ``mavenn.Model`` constructor. \n",
    "\n",
    "### Neighbor G-P maps\n",
    "\n",
    "Neighbor G-P maps assume that each neighboring pair of positions in $\\vec{x}$ contributes to the latent phenotype $\\phi$, and is comptued using,\n",
    "\n",
    "$$\\begin{eqnarray}\n",
    "\\phi_{\\text{neighbor}}(\\vec{x};\\vec{\\theta}) &=& \\theta_0 + \\sum_{l=0}^{L-1} \\sum_c \\theta_{l:c}x_{l:c} + \\sum_{l=0}^{L-2} \\sum_{c,c^\\prime} \\theta_{l:c,l+1:c'} x_{l:c}x_{l+1:c^\\prime}, \\nonumber\n",
    "\\end{eqnarray}$$\n",
    "\n",
    "To use a neighbor G-P map, set ``gpmap_type='neighbor'`` in the ``mavenn.Model`` constructor. \n",
    "\n",
    "### Pairwise G-P maps\n",
    "\n",
    "Pairwise G-P maps assume that every pair of positions in $\\vec{x}$ contributes to the latent phenotype $\\phi$, and is comptued using,\n",
    "\n",
    "$$\\begin{eqnarray}\n",
    "\\phi_{\\text{pairwise}} (\\vec{x};\\vec{\\theta})  &=&   \\theta_0 + \\sum_{l=0}^{L-1} \\sum_c \\theta_{l:c}x_{l:c} + \\sum_{l=0}^{L-2} \\sum_{l^\\prime=l+1}^{L-1} \\sum_{c,c^\\prime}  \\theta_{l:c,l^\\prime:c^\\prime} x_{l:c} x_{l^\\prime:c^\\prime}. \\nonumber\n",
    "\\end{eqnarray}$$\n",
    "\n",
    "To use a pairwise G-P map, set ``gpmap_type='pairwise'`` in the ``mavenn.Model`` constructor. \n",
    "\n",
    "### Black box G-P maps\n",
    "\n",
    "Black box G-P maps are represented as dense feed-forward neural networks, i.e. as multi-layer perceptrons (MLPs). To use a black box G-P map, set ``gpmap_type='blackbox'`` in the ``mavenn.Model`` constructor. Note that one can set the number of nodes used in each layer of the ML. For instance, to create an MLP with three hidden layers of size 100, 30, and 10, set ``gpmap_kwargs={'hidden_layer_sizes':[100,30,10]}`` in the ``mavenn.Model`` constructor."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## MPA measurement processes\n",
    "\n",
    "In MPA regression, MAVE-NN directly models the measurement process $p(y|\\phi)$. At present, MAVE-NN only supports MPA regression for discrete values of $y$, which are indexed using nonnegative integers. MAVE-NN takes two forms of input for MPA regression. One is a set of (non-unique) sequence-measurement pairs $\\set{(\\vec{x}_n, y_n)}_{n=0}^{N-1}$, where $N$ is the total number of independent measurements and $y_n \\in  \\{0,1,\\ldots,Y-1\\}$, where $Y$ is the total number of bins. The other is a set of (unique) sequence-count-vector pairs $\\set{(\\vec{x}_m, \\vec{c}_m)}_{m=0}^{M-1}$, where $M$ is the total number of unique sequences in the data set, and $\\vec{c}_m = (c_{m0}, c_{m1}, \\ldots, c_{m(Y-1)})$ lists, for each value of $y$, the number of times $c_{my}$ that the sequence $\\vec{x}_m$ was observed in bin $y$. MPA measurement processes are computed internally using the latter representation via\n",
    "\\begin{eqnarray}\n",
    "p(y|\\phi) &=& \\frac{w_y(\\phi)}{\\sum_{y'} w_{y'}(\\phi)} \\\\\n",
    "w_y(\\phi) &=& \\exp \\left[ a_y + \\sum_{k=0}^{K-1} b_{yk} \\tanh(c_{yk} \\phi + d_{yk}) \\right]~~~~~~\n",
    "\\end{eqnarray}\n",
    "where $K$ is the number of hidden nodes per value of $y$. The trainable parameters of this measurement process are thus $\\vec{\\eta} = \\set{a_y, b_{yk}, c_{yk}, d_{yk}}$. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GE nonlinearities\n",
    "\n",
    "GE models assume that each measurement $y$ of a sequence $\\vec{x}$ is a nonlinear function $g(\\phi)$ plus some noise. In MAVE-NN, this nonlinearity is represented as a sum of $\\tanh$ sigmoids, i.e.,\n",
    "\n",
    "$$\\begin{equation}\n",
    "g(\\phi;\\vec{\\alpha}) = a + \\sum_{k=0}^{K-1} b_k \\tanh(c_k \\phi + d_k) \n",
    "\\end{equation}$$\n",
    "\n",
    "where $K$ specifies the number of \"hidden nodes\" contributing to the sum, and $\\vec{\\alpha} = \\set{a, b_k, c_k, d_k}$ are trainable parameters. By default, MAVE-NN constrains $g(\\phi;\\vec{\\alpha})$ to be monotonic in $\\phi$ by requiring all $b_k \\ge 0$ and $c_k \\ge 0$, but this constraint can be relaxed.\n",
    "\n",
    "In the ``mavenn.Model`` constructor, ``ge_nonlinearity_hidden_nodes`` controls the value for $K$, while  ``ge_nonlinearity_monotonic`` controls the monotonicity constraint on $g(\\phi;\\vec{\\alpha})$.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GE noise models\n",
    "\n",
    "MAVE-NN supports three types of GE noise models: Gaussian, Cauchy, and skewed-t. These are specified using the ``ge_noise_model`` keyword argument in the ``mavenn.Model`` constructor. \n",
    "\n",
    "### Gaussian nose models\n",
    "\n",
    "The Gaussian noise model, corresponding to ``ge_noise_model='Gaussian'``, is given by\n",
    "\n",
    "$$\\begin{eqnarray}\n",
    "    p_\\mathrm{gauss}(y|\\hat{y};s) = \\frac{1}{\\sqrt{2 \\pi s^2}} \\exp\\left[-\\frac{(y - \\hat{y})^2}{2 s^2}\\right],\n",
    "\\end{eqnarray}$$\n",
    "\n",
    "where $s$ denotes the standard deviation. Importantly, MAVE-NN allows this noise model to be heteroskedastic by representing $s$ as an exponentiated polynomial in $\\hat{y}$, i.e.,\n",
    "\n",
    "$$\\begin{equation}\n",
    "s(\\hat{y}) = \\exp \\left[ \\sum_{k=0}^K a_k \\hat{y}^k \\right],\n",
    "\\end{equation}$$\n",
    "\n",
    "where $K$ is the order of the polynomial and $\\set{a_k}$ are trainable parameters. The user has the option to set $K$, and using $K=0$ renders this noise model homoskedastic. Quantiles are computed using $y_q = \\hat{y} + s\\, \\sqrt{2}\\, \\mathrm{erf}^{-1}(2 q -1) $ for user-specified values of $q \\in [0,1]$.\n",
    "\n",
    "### Cauchy noise models\n",
    "\n",
    "The Cauchy noise model, corresponding to ``ge_noise_model='Cauchy'``, is given by\n",
    "\n",
    "$$\\begin{eqnarray}\n",
    "    p_\\mathrm{cauchy}(y|\\hat{y};s) = \\left[ \\pi s \\left( 1 + \\frac{(y-\\hat{y})^2}{s^2} \\right) \\right]^{-1}\n",
    "\\end{eqnarray}$$\n",
    "\n",
    "where the scale parameter $s$ is an exponentiated $K$'th order polynomial in $\\hat{y}$. Quantiles are computed using $y_q = \\hat{y} + s\\, \\mathrm{tan}[\\pi(q-\\frac{1}{2})]$.\n",
    "\n",
    "### Skewed-t noise models\n",
    "\n",
    "The skewed-t noise model, corresponding to ``ge_noise_model='SkewedT'``, is of the form described by Jones & Faddy (2003), and is given by\n",
    "\n",
    "$$\\begin{equation}\n",
    "     p_\\mathrm{skewt}(y|\\hat{y};s,a,b) = s^{-1} f(t; a,b), \n",
    "\\end{equation}$$\n",
    "\n",
    "where\n",
    "\n",
    "$$\\begin{equation}\n",
    "     t = t^* + \\frac{y-\\hat{y}}{s}. ~~~t^* = \\frac{(a-b)\\sqrt{a+b}}{\\sqrt{2a+1}\\sqrt{2b+1}},\n",
    "\\end{equation}$$\n",
    "\n",
    "and\n",
    "\n",
    "$$\\begin{eqnarray}\n",
    " f(t;a,b) &=&  \\frac{2^{1-a-b}}{\\sqrt{a+b}}\\frac{\\Gamma(a+b)}{\\Gamma(a)\\Gamma(b)}  \\left[1 + \\frac{t}{\\sqrt{a+b+t^2}} \\right]^{a+\\frac{1}{2}} \\left[ 1 - \\frac{t}{\\sqrt{a+b+t^2}} \\right]^{b+\\frac{1}{2}}. \n",
    " \\end{eqnarray}$$\n",
    " \n",
    "Note that the $t$ statistic here is an affine function of $y$ chosen so that the distribution's mode (corresponding to $t^*$) is positioned at $\\hat{y}$. The three parameters of this model, $\\set{s, a, b}$, are each represented using $K$-th order exponentiated polynomial with trainable coefficients. Quantiles are computed using\n",
    "\n",
    "$$\\begin{equation}\n",
    "y_q = \\hat{y}  + (t_q-t^*)s,\n",
    "\\end{equation}$$\n",
    "\n",
    "where\n",
    "\n",
    "$$\\begin{equation} \n",
    "t_q = \\frac{(2x_q-1)\\sqrt{a+b}}{\\sqrt{1 - (2x_q-1)^2}},~~~x_q = I^{-1}_{q}(a,b),\n",
    " \\end{equation}$$\n",
    " \n",
    "and $I^{-1}$ denotes the inverse of the regularized incomplete Beta function $I_x(a,b)$.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gauge modes and diffemorphic modes\n",
    "\n",
    "These G-P maps also have non-identifiable degrees of freedom that must be \"fixed\", i.e. pinned down, before the values of individual parameters can be meaningfully interpreted. These degrees of freedom come in two flavors: gauge modes and diffeomorphic modes. Gauge modes are changes to $\\vec{\\theta}$ that do not alter the values of the latent phenotype $\\phi$. Diffeomorphic modes (Kinney & Atwal, 2014; Atwal & Kinney, 2016) are changes to $\\vec{\\theta}$ that do alter $\\phi$, but do so in ways that can be undone by transformations of the measurement process $p(y|\\phi)$ along corresponding \"dual modes\". As shown in (Kinney & Atwal, 2014), the diffeomorphic modes of linear G-P maps like those considered here will in general correspond to affine transformations of $\\phi$ (though there are exceptions at special values of $\\vec{\\theta}$). MAVE-NN fixes both gauge modes and diffeomorphic modes before providing parameter values or other access to model internals to the user, e.g. when ``model.get_theta()`` is called. \n",
    "\n",
    "### Fixing diffeomorphic modes\n",
    "\n",
    "Diffeomorphic modes of G-P maps are fit by transforming G-P map parameters $\\vec{\\theta}$ via \n",
    "\n",
    "$$\\begin{equation}\n",
    "    \\theta_0 \\to \\theta_0 - a\n",
    "\\end{equation}$$\n",
    "\n",
    "and\n",
    "\n",
    "$$\\begin{eqnarray}\n",
    "\\vec{\\theta} \\to \\vec{\\theta}/b    \n",
    "\\end{eqnarray}$$\n",
    "\n",
    "where $a= \\mean(\\set{\\phi_n})$ and $b = \\std(\\set{\\phi_n})$ are the mean and standard deviation of $\\phi$ values computed on the training data. This produces a corresponding change in latent phenotype values $\\phi \\to (\\phi - a)/b$. To avoid altering likelihood contributions $p(y|x)$, MAVE-NN further transforms the measurement process $p(y|\\phi)$ along dual modes in order to compensate. In GE regression this is done by adjusting the GE nonlinearity while keeping the noise model $p(y|\\hat{y})$ fixed,\n",
    "\n",
    "$$\\begin{eqnarray}\n",
    "g(\\phi) \\to g(a + b \\phi),\n",
    "\\end{eqnarray}$$\n",
    "\n",
    "while in MPA regression MAVE-NN adjusts the full measurement process, \n",
    "\n",
    "$$\\begin{eqnarray}\n",
    "p(y | \\phi) \\to p(y | a + b \\phi).    \n",
    "\\end{eqnarray}$$\n",
    "\n",
    "### Fixing the gauge\n",
    "\n",
    "Gauge modes are fit using a \"hierarchical gauge\", in which lower-order terms in $\\phi(\\vec{x}; \\vec{\\theta})$ account for the highest possible fraction of variance in $\\phi$. Such gauges further require a probability distribution on sequence space, which is used to compute these variances. MAVE-NN assumes that such distributions factorize by position, and can thus be represented by a probability matrix with elements $p_{l:c}$, representing the probability of character $c$ at position $l$. MAVE-NN then transforms G-P map parameters via,\n",
    "\n",
    "$$\\begin{eqnarray}\n",
    "    \\theta_0 &\\to & \\theta_0 \n",
    "             + \\sum_{l} \\sum_{c'} \\theta_{l:c'}\\,p_{l:c'} \n",
    "             + \\sum_{l} \\sum_{l'>l} \\sum_{c,c'}\\theta_{l:c,l':c'}\\,p_{l:c}\\,p_{l':c'}, \\nonumber \n",
    "\\end{eqnarray}$$\n",
    "\n",
    "$$\\begin{eqnarray}\n",
    "    \\theta_{l:c}&\\to &{\\theta}_{l:c}\n",
    "                -\\sum_{c'}\\theta_{l:c'}\\,p_{l:c} \\nonumber \\\\\n",
    "                & &  +\\sum_{l'>l} \\sum_{c'} \\theta_{l:c,l':c'}\\,p_{l':c'} \n",
    "                +\\sum_{l'<l} \\sum_{c'} \\theta_{l':c',l:c}\\,p_{l':c'} \\nonumber \\\\\n",
    "                & &  -\\sum_{l'>l} \\sum_{c',c''}  \\theta_{l:c',l':c''}\\,p_{l:c'}\\,p_{l':c''},\n",
    "                -\\sum_{l'<l} \\sum_{c',c''}  \\theta_{l':c'',l:c'}\\,p_{l:c'}\\,p_{l':c''},\n",
    "\\end{eqnarray}$$\n",
    "\n",
    "and\n",
    "\n",
    "$$\\begin{eqnarray}\n",
    "    \\theta_{l:c,l':c'} & \\to & \\theta_{l:c,l':c'}\n",
    "    -\\sum_{c''} \\theta_{l:c'',l':c'}\\,p_{l:c''} \n",
    "    -\\sum_{c''} \\theta_{l:c,l':c''}\\,p_{l':c''} \n",
    "    +\\sum_{c'',c'''}\\theta_{l:c'',l':c'''}\\,p_{l:c''}\\,p_{l':c'''}.\\nonumber\n",
    "\\end{eqnarray}$$\n",
    "\n",
    "The gauge-fixing transformations for the both the additive and neighbor models follow the same rules, but with $\\theta_{l:c,l':c'} = 0$ for all $l,l'$ (additive model) or whenever $l' \\neq l+1$ (neighbor model). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loss function\n",
    "\n",
    "Let $\\vec{\\theta}$ denote the parameters of G-P map, and $\\vec{\\eta}$ denote the parameters of the measurement process. MAVE-NN optimizations these parameters using stochastic gradient descent (SGD) on a loss function given by \n",
    "\n",
    "$$\\begin{equation}\n",
    "    \\mathcal{L} = \\mathcal{L}_\\var + \\mathcal{L}_\\mathrm{reg}\n",
    "\\end{equation}$$\n",
    "\n",
    "where $\\mathcal{L}_\\like$ is the negative log likelihood of the model, given by\n",
    "\n",
    "$$\\begin{eqnarray}\n",
    "    \\mathcal{L}_\\like[\\vec{\\theta},\\vec{\\eta}] =\n",
    "        -\\sum_{n=0}^{N-1} \\log \\left[ p(y_n|\\phi_n;\\vec{\\eta})\\right],~\\phi_n = \\phi(\\vec{x}_n;\\vec{\\theta}),\n",
    "\\end{eqnarray}$$\n",
    "\n",
    "and $\\mathcal{L}_\\mathrm{reg}$ provides for regularization of the model parameters. \n",
    "\n",
    "In the context of GE regression, we can write $\\vec{\\eta} = (\\vec{\\alpha},\\vec{\\beta})$ where $\\vec{\\alpha}$ represents the parameters of the GE nonlinearity $g(\\phi)$, and $\\vec{\\beta}$ denotes the parameters of the noise model $p(y|\\hat{y})$. The likelihood contributions from each observation $n$ then becomes $p(y_n|\\phi_n;\\vec{\\eta}) = p(y_n|\\hat{y}_n;\\vec{\\beta})$ where $\\hat{y}_n = g(\\phi_n;\\vec{\\alpha})$. In the context of MPA regression with a dataset of the form $\\set{(\\vec{x}_m, \\vec{c}_m)}_{m=1}^M$, the loss function simplifies to\n",
    "\n",
    "$$\\begin{eqnarray}\n",
    "    \\mathcal{L}_\\like[\\vec{\\theta},\\vec{\\eta}] &=& \\sum_{m=0}^{M-1} \\sum_{y=0}^{Y-1} c_{my} \\log [p(y|\\phi_m;\\vec{\\eta})],\n",
    "\\end{eqnarray}$$\n",
    "\n",
    "where $\\phi_m = \\phi(x_m;\\vec{\\theta})$. For the regularization term, MAVE-NN uses an $L_2$ penalty of the form\n",
    "\n",
    "$$\\begin{equation}\n",
    "    \\mathcal{L}_\\mathrm{reg}[\\vec{\\theta},\\vec{\\eta}] = \\lambda_{\\theta}|\\vec{\\theta}|^2 + \\lambda_{\\eta}|\\vec{\\eta}|^2,\n",
    "\\end{equation}$$\n",
    "\n",
    "where $\\lambda_{\\theta}$ and $\\lambda_{\\eta}$ respectively control the strength of regularization for the G-P map and measurement process parameters. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Variational information\n",
    "\n",
    "It is sometimes useful to consider a quantity we call the \"variational information\", $I_\\var[y;\\phi]$, which is an affine transformation of $\\mathcal{L}_\\like$: \n",
    "\n",
    "$$\\begin{equation}\n",
    "    I_\\var[y;\\phi] = H[y] - \\frac{\\log_2(e)}{N}\\mathcal{L}_\\like.\n",
    "\\end{equation}$$\n",
    "\n",
    "Here $H[y]$ is the entropy of the data $\\set{y_n}$, which is estimated using the $k$'th nearest neighbor (kNN) estimator from the NPEET package (Ver Steeg, 2014). Noting that this quantity can also be written as $I_\\var[y;\\phi] = H[y] - \\mean(\\set{Q_n})$ where $Q_n = -\\log_2 p(y_n|\\phi_n)$, we estimate the associated uncertainty using \n",
    "\n",
    "$$\\begin{equation}\n",
    "    \\delta I_\\var[y;\\phi] = \\sqrt{\\delta H[y]^2 + \\var(\\set{Q_n})/N}.\n",
    "\\end{equation} $$\n",
    "\n",
    "Variational information quantifies the mutual information between $\\phi$ and $y$ assuming that the inferred measurement process $p(y|\\phi)$ is correct. We therefore propose $I_\\var \\pm \\delta I_\\var$ as a diagnostic for model fitting and performance, as it is directly comparable to the predictive information of the model, $I_\\pred[y;\\phi]$, and can be computed during model training without any costly entropy estimation steps. \n",
    "\n",
    "It is worth noting that $I_\\var[y;\\phi]$ provides an approximate lower bound to $I_\\pred[y;\\phi]$. In the $N_\\test \\to \\infty$ limit,\n",
    "\n",
    "$$\\begin{eqnarray}\n",
    "I_\\pred[y;\\phi] &=& H[y] - H[y|\\phi] \\\\\n",
    "&=& H[y] + \\braket{ \\log_2 p_\\true(y|\\phi) }_\\true  \\\\\n",
    "&=& H[y] + \\braket{ \\log_2 p_\\model(y|\\phi)}_\\true + \\braket{ \\log_2 \\frac{p_\\true(y|\\phi)}{p_\\model(y|\\phi)} }_\\true  \\\\\n",
    "& = & H[y] - \\frac{\\log_2(e)}{N_\\test} \\mathcal{L}_\\like + D_{KL}(p_\\true || p_\\model) \\\\\n",
    "&=& I_\\var[y;\\phi] + D_\\mathrm{KL}(p_\\true || p_\\model) \\\\\n",
    "&\\geq & I_\\var[y;\\phi],\n",
    "\\end{eqnarray}$$\n",
    "\n",
    "where $D_{KL}$ denotes the Kullback-Leibler divergence and $\\braket{\\cdot}_\\true$ indicates averaging over $p_\\true(y,\\phi)$. When $N_\\test$ is finite this lower bound is only approximate, of course, but the uncertainties in these information quantities are often quite small due to the large size of typical MAVE datasets. Moreover the difference between these quantities, \n",
    "\n",
    "$$\\begin{equation} I_\\pred[y;\\phi] - I_\\var[y;\\phi] \\approx D_\\mathrm{KL}(p_\\true || p_\\model) \\end{equation},$$ \n",
    "\n",
    "quantifies how much $p_\\model$ deviates from $p_\\true$, and can thus be used as a diagnostic for how accurate the assumed form of the measurement process is. \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predictive information\n",
    "The predictive information $I_\\pred[y;\\phi]$, when computed on data not used for training, provides a measure of how strongly a G-P map predicts experimental measurements irrespective of the corresponding measurement process. To estimate this quantity we use k'th nearest neighbor (kNN) estimators of entropy and mutual information adapted from the NPEET Python package (VanSteeg, 2014). Here, the user has the option of adjusting $k$, which controls a variance/bias tradeoff. When $y$ is discrete (MPA regression), $I_\\pred[y;\\phi]$ is computed using the classic kNN entropy estimator (Vasicek, 1976; Kraskov et al., 2004) via the decomposition $I_\\pred[y;\\phi] = H[\\phi] - \\sum_y p(y) H_y[\\phi]$, where $H_y[\\phi]$ denotes the entropy of $p(\\phi|y)$ for fixed $y$. When $y$ is continuous (GE regression), $I_\\pred[y;\\phi]$ is estimated using the kNN-based KSG algorithm (Kraskov et al, 2004). This approach optionally supports a local nonuniformity correction of (Gao et al., 2014), which is important when $y$ and $\\phi$ exhibit strong dependencies, but which also requires substantially more time to compute. We emphasize that both estimates of $I_\\pred[y;\\phi]$ make no use of the measurement process $p(y|\\phi)$ inferred by MAVE-NN. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Uncertainties in kNN estimates\n",
    "\n",
    "MAVE-NN quantifies uncertainties in $H[y]$ and $I[y;\\phi]$ using multiple random samples of half the data. Let $\\mathcal{D}_{100\\%}$ denote a full dataset, and let $\\mathcal{D}_{50\\%,r}$ denote a 50\\% subsample (indexed by $r$) of this dataset. Given an estimator $E(\\cdot)$ of either entropy or mutual information, as well as the number of subsamples $R$ to use, the uncertainty in $E(\\mathcal{D}_{100\\%})$ is estimated as\n",
    "\n",
    "$$\\begin{equation}\n",
    "    \\delta E(\\mathcal{D}_{100\\%}) = \\frac{1}{\\sqrt{2}} \\std \\left[\\set{E(\\mathcal{D}_{50\\%,r})}_{r=0}^{R-1} \\right].\n",
    "\\end{equation}$$\n",
    "\n",
    "By default MAVE-NN uses $R=25$. We note that computing such uncertainty estimates substantially increase computation time, as $E(\\cdot)$ needs to be evaluated $R+1$ times instead of just once. We also note that boostrap resampling is not advisable in this context, as it systematically underestimates $H[y]$ and overestimates $I[y;z]$ (data not shown)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "1. Jones M, Faddy M (2003). A skew extension of the t-distribution, with applications. [J Roy Stat Soc B Met.  65(1):159-174.](https://dx.doi.org/10.1111/1467-9868.00378)\n",
    "\n",
    "1. Kinney J, Atwal G (2014). Parametric Inference in the Large Data Limit Using Maximally Informative Models. [Neural Comput 26(4):637-653.](https://dx.doi.org/10.1162/neco_a_00568)\n",
    "\n",
    "1. Atwal G, Kinney J (2016). Learning Quantitative Sequence–Function Relationships from Massively Parallel Experiments. [J Stat Phys. 162(5):1203-1243.](https://dx.doi.org/10.1007/s10955-015-1398-3)\n",
    "\n",
    "1. Ver Steeg G (2014). Non-Parametric Entropy Estimation Toolbox (NPEET). https://github.com/gregversteeg/NPEET\n",
    "\n",
    "1. Kraskov A, Stögbauer H, Grassberger P (2004). Estimating mutual information. [Phys Rev E. 69(6):066138.]( https://dx.doi.org/10.1103/physreve.69.066138)\n",
    "\n",
    "1. Gao S, Steeg G, Galstyan A (2015). Efficient Estimation of Mutual Information for Strongly Dependent Variables. [PMLR 38:277-286](http://proceedings.mlr.press/v38/gao15.html)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}