Numerical Solution of Ordinary Differential Equations (ODEs)

Problem Definition

Most ODE’s don’t have explicit solutions, so we resort to Numerical solution of ODEs

Runge-Kutta

$y^\prime(t) = f\Big(t, y(t) \Big)$, for $t \in [a,b]$

Log determinant of Jacobian equals integrating trace(d(dx/dt)/dx)

Neural ODE

ODE defined by the parametric function $f(\mathbf{z}(t), t; \theta)$, solve the initial value problem

\[\begin{aligned} \mathbf{z}(t_0) &= \mathbf{z}_0 \\ \frac{d\mathbf{}z(t)}{dt} &= f(\mathbf{z}(t), t; \theta) \end{aligned}\]

Flow Function

Karras et al (2022) define a ODE flow function for Diffusion models:

$f(\mathbf{x}(t), t) = \frac{\mathrm{d} \boldsymbol{x}(t) }{~\mathrm{d}t } = -\dot\sigma(t) ~\sigma(t) ~\nabla_{\hspace{-0.5mm} \boldsymbol{x} } \log p \big( \boldsymbol{x}; \sigma(t) \big)$ where the dot denotes a time derivative.

FFJORD

Free-form Jacobian of Reversible Dynamics (FFJORD).

Cost of computing $\mbox{Tr}\Big( \frac{df}{dz(t)}$ exactly costs $\mathcal{O}(D^2)$, or $D$ evaluations of $f$.

Vector Jacobian products $\mathcal{v}^T \frac{df}{dz}$.

Unbiased estimate of the trace of a matrix $A$ by computing a quadratic form? of that matrix with a noise vector

\[\mbox{Tr}(A) = \mathbb{E}_{p(\mathbf{\epsilon})} [ \mathbf{\epsilon}^T A \mathbf{\epsilon} ]\]

The above equations holds for any $D$-by-$D$ matrix $A$ and distribution $p(\mathbf{\epsilon})$ over $D$-dimensional vectors such that $\mathbb{E}[\mathbf{\epsilon}] = 0$ and $\mbox{Cov}(\mathbf{\epsilon}) = I$. This Monte Carlo estimator is known as Hutchinson’s trace estimator

Hutchinson, 1989 Adams et al, 2018

TODO: Copy equation 8 from FFJORD verbatim here Typical choices of $p(\mathbf{epsilon})$ are a standard Gaussian or Rademacher distribution (Hutchinson, 1989).

Neural ODE

Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. The output of the network is computed using a black-box differential equation solver.

As appears in Dinh 2014 NICE and Rezende 2015 Normalizing Flows: \begin{align} \mathbf{h}_{t+1} = \mathbf{h}_t + f(\mathbf{h}_t, \theta_t)% \end{align} where $t \in \{0 \dots T\}$ and $\mathbf{h}_t\in \mathbb{R}^D$.

we parameterize the continuous dynamics of hidden units using an ordinary differential equation (ODE) specified by a neural network: \begin{align}\label{eq:ivp} \frac{d\mathbf{h}(t)}{dt} = f(\mathbf{h}(t), t, \theta) % \end{align} Starting from the input layer $\mathbf{h}(0)$, we can define the output layer $\mathbf{h}(T)$ to be the solution to this ODE initial value problem at some time $T$.

Modern ODE solvers provide guarantees about the growth of approximation error, monitor the level of error, and adapt their evaluation strategy on the fly to achieve the requested level of accuracy. This allows the cost of evaluating a model to scale with problem complexity. After training, accuracy can be reduced for real-time or low-power applications.

Unlike recurrent neural networks, which require discretizing observation and emission intervals, continuously-defined dynamics can naturally incorporate data which arrives at arbitrary times.

Consider optimizing a scalar-valued loss function $L()$, whose input is the result of an ODE solver: \begin{align} L(\mathbf{z}(t_1)) = L \left(\mathbf{z}(t_0) + \int_{t_0}^{t_1} f(\mathbf{z}(t), t, \theta) dt \right) = L \left( \textrm{ODESolve}( \mathbf{z}(t_0), f, t_0, t_1, \theta) \right) \end{align}

To optimize $L$, we require gradients with respect to $\theta$. The first step is to determining how the gradient of the loss depends on the hidden state $\mathbf{z}(t)$ at each instant. This quantity is called the \emph{adjoint} $\mathbf{a}(t) = \frac{\partial{L}}{\partial \mathbf{z}(t)}$. Its dynamics are given by another ODE, which can be thought of as the instantaneous analog of the chain rule: \begin{align} \frac{d \mathbf{a}(t)}{d t} = - \mathbf{a}(t) ^{\mkern-1.5mu\mathsf{T}} \frac{\partial f(\mathbf{z}(t), t, \theta)}{\partial \mathbf{z}} \label{eq:adj} \end{align}

We can compute $\frac{\partial L}{\partial \mathbf{z}(t_0)}$ by another call to an ODE solver. This solver must run backwards, starting from the initial value of $\frac{\partial L}{\partial \mathbf{z}(t_1)}$. One complication is that solving this ODE requires the knowing value of $\mathbf{z}(t)$ along its entire trajectory. However, we can simply recompute $\mathbf{z}(t)$ backwards in time together with the adjoint, starting from its final value $\mathbf{z}(t_1)$.

Computing the gradients with respect to the parameters $\theta$ requires evaluating a third integral, which depends on both $\mathbf{z}(t)$ and $\mathbf{a}(t)$: \begin{align}\label{eq:params_diffeq} \frac{d L}{d \theta} = -\int^{t_0}_{t_1} \mathbf{a}(t)^{\mkern-1.5mu\mathsf{T}} \frac{\partial f(\mathbf{z}(t), t, \theta)}{\partial \theta} dt \end{align} The vector-Jacobian products $\mathbf{a}(t)^T\frac{\partial f}{\partial \mathbf{z}}$ and $\mathbf{a}(t)^T\frac{\partial f}{\partial \theta}$ in \eqref{eq:adj} and \eqref{eq:params_diffeq} can be efficiently evaluated by automatic differentiation, at a time cost similar to that of evaluating $f$. All integrals for solving $\mathbf{z}$, $\mathbf{a}$ and $\frac{\partial L}{\partial \theta}$ can be computed in a single call to an ODE solver, which concatenates the original state, the adjoint, and the other partial derivatives into a single vector.

https://github.com/rtqichen/torchdiffeq

The discretized equation~\eqref{eq:res} also appears in normalizing flows~\citep{rezende2015variational} and the NICE framework~\citep{dinh2014nice}. These methods use the change of variables theorem to compute exact changes in probability if samples are transformed through a bijective function $f$: \begin{align} {\mathbf{z}}_1 = f({\mathbf{z}}_0)\; \implies \log p({\mathbf{z}}_1) = \log p({\mathbf{z}}_0) - \log \left| \det \frac{\partial f}{\partial {\mathbf{z}}_0} \right| \label{eq:change of variables} \end{align} An example is the planar normalizing flow~\citep{rezende2015variational}: \begin{align} {\mathbf{z}}(t+1) = {\mathbf{z}}(t) + uh(w^{\mkern-1.5mu\mathsf{T}} {\mathbf{z}}(t) + b),\quad \log p({\mathbf{z}}(t+1)) = \log p({\mathbf{z}}(t)) - \log \left| 1 + u^{\mkern-1.5mu\mathsf{T}} \frac{\partial h}{\partial {\mathbf{z}}} \right| \end{align}

Generally, the main bottleneck to using the change of variables formula is computing of the determinant of the Jacobian $\frac{\partial f}{\partial {\mathbf{z}}}$, which has a cubic cost in either the dimension of $\mathbf{z}$, or the number of hidden units. Recent work explores the tradeoff between the expressiveness of normalizing flow layers and computational cost~\citep{kingma2016improved,tomczak2016improving,berg2018sylvester}.

Surprisingly, moving from a discrete set of layers to a continuous transformation simplifies the computation of the change in normalizing constant:

Theorem: Instantaneous Change of Variables Let ${\mathbf{z}}(t)$ be a finite continuous random variable with probability $p({\mathbf{z}}(t))$ dependent on time. Let $\frac{d{\mathbf{z}}}{dt} = f({\mathbf{z}}(t), t)$ be a differential equation describing a continuous-in-time transformation of ${\mathbf{z}}(t)$. Assuming that $f$ is uniformly Lipschitz continuous in ${\mathbf{z}}$ and continuous in $t$, then the change in log probability also follows a differential equation,

\begin{align} \frac{\partial \log p({\mathbf{z}}(t))}{\partial t} = -\textrm{Tr}\left(\frac{d f}{d {\mathbf{z}}(t)}\right) \end{align}

Instead of the log determinant in \eqref{eq:change of variables}, we now only require a trace operation. % Also unlike standard finite flows, the differential equation $f$ does not need to be bijective, since if uniqueness is satisfied, then the entire transformation is automatically bijective.

As an example application of the instantaneous change of variables, we can examine the continuous analog of the planar flow, and its change in normalization constant: \begin{align}\label{eq:cnf} \frac{d{\mathbf{z}}(t)}{dt} = uh(w^{\mkern-1.5mu\mathsf{T}}{\mathbf{z}}(t) + b),\quad \frac{\partial\log p({\mathbf{z}}(t))}{\partial t} = -u^{\mkern-1.5mu\mathsf{T}} \frac{\partial h}{\partial {\mathbf{z}}(t)} \end{align} Given an initial distribution $p({\mathbf{z}}(0))$, we can sample from $p({\mathbf{z}}(t))$ and evaluate its density by solving this combined ODE.

We present a continuous-time, generative approach to modeling time series. Our model represents each time series by a latent trajectory. Each trajectory is determined from a local initial state, $\mathbf{z}_{t_0}$, and a global set of latent dynamics shared across all time series. Given observation times $t_0, t_1, \dots, t_N$ and an initial state $\mathbf{z}_{t_0}$, an ODE solver produces $\mathbf{z}_{t_1}, \dots,\mathbf{z}_{t_N}$, which describe the latent state at each observation.% We define this generative model formally through a sampling procedure:

\[\begin{align} \mathbf{z}_{t_0} & \sim p(\mathbf{z}_{t_0}) \\ \mathbf{z}_{t_1}, \mathbf{z}_{t_2}, \dots, \mathbf{z}_{t_N} & = \mathbf{z}vefunc(\mathbf{z}_{t_0}, f, \theta_f, t_0, \dots, t_N) \\ \textrm{each} \quad \mathbf{x}_{t_i} & \sim p(\mathbf{x} | \mathbf{z}_{t_i}, \theta_{\mathbf{x}}) \end{align}\]

Function $f$ is a time-invariant function that takes the value $\mathbf{z}$ at the current time step and outputs the gradient: $\frac{\partial \mathbf{z}(t)}{\partial t} = f(\mathbf{z}(t), \theta_f)$. We parametrize this function using a neural net. Because $f$ is time-invariant, given any latent state $\mathbf{z}(t)$, the entire latent trajectory is uniquely defined. Extrapolating this latent trajectory lets us make predictions arbitrarily far forwards or backwards in time.

Proof of the Instantaneous Change of Variables Theorem

Let $\mathbf{z}(t)$ be a finite continuous random variable with probability $p(\mathbf{z}(t))$ dependent on time. Let $\frac{d\mathbf{z}}{dt} = f(\mathbf{z}(t), t)$ be a differential equation describing a continuous-in-time transformation of $\mathbf{z}(t)$. Assuming that $f$ is uniformly Lipschitz continuous in $\mathbf{z}$ and continuous in $t$, then the change in log probability also follows a differential equation:

\[\frac{\partial \log p(\mathbf{z}(t))}{\partial t} = -\textrm{tr}\left(\frac{d f}{d \mathbf{z}}(t)\right)\]

To prove this theorem, we take the infinitesimal limit of finite changes of $\log p({\mathbf{z}}(t))$ through time. First we denote the transformation of ${\mathbf{z}}$ over an $\epsilon$ change in time as% \begin{align} {\mathbf{z}}(t+\epsilon) = T_\epsilon({\mathbf{z}}(t))% \end{align} We assume that $f$ is Lipschitz continuous in ${\mathbf{z}}(t)$ and continuous in $t$, so every initial value problem has a unique solution by Picard’s existence theorem. We also assume ${\mathbf{z}}(t)$ is bounded. These conditions imply that $f$, $T_\epsilon$, and $\frac{\partial}{\partial {\mathbf{z}}}T_\epsilon$ are all bounded. In the following, we use these conditions to exchange limits and products.

We can write the differential equation $\frac{\partial\log p({\mathbf{z}}(t))}{\partial t}$ using the discrete change of variables formula, and the definition of the derivative:

\[\begin{align} \frac{\partial\log p({\mathbf{z}}(t))}{\partial t} &= \lim_{\epsilon\rightarrow0^+} \frac{\log p({\mathbf{z}}(t)) - \log \left|\det \frac{\partial}{\partial {\mathbf{z}}} T_\epsilon({\mathbf{z}}(t)) \right| - \log p({\mathbf{z}}(t))}{\epsilon} \\ &= -\lim_{\epsilon\rightarrow0^+} \frac{\log \left|\det \frac{\partial}{\partial {\mathbf{z}}} T_\epsilon({\mathbf{z}}(t)) \right| }{\epsilon} \\ &= -\lim_{\epsilon\rightarrow0^+} \frac{\frac{\partial}{\partial \epsilon}\log \left|\det \frac{\partial}{\partial {\mathbf{z}}} T_\epsilon({\mathbf{z}}(t)) \right| }{\frac{\partial}{\partial \epsilon}\epsilon} \qquad \qquad \textrm{(by L'H\^opital's rule)}\\ &= -\lim_{\epsilon\rightarrow0^+} \frac{\frac{\partial}{\partial \epsilon}\left|\det \frac{\partial}{\partial {\mathbf{z}}} T_\epsilon({\mathbf{z}}(t)) \right| }{\left|\det \frac{\partial}{\partial {\mathbf{z}}} T_\epsilon({\mathbf{z}}(t)) \right|} \qquad \qquad \quad \left( \left. \frac{\partial \log({\mathbf{z}})}{\partial {\mathbf{z}}}\right\vert_{\mathbf{z} = 1} = 1 \right)\\ &= -\underbrace{\left( \lim_{\epsilon\rightarrow0^+} \frac{\partial}{\partial \epsilon}\left|\det \frac{\partial}{\partial {\mathbf{z}}} T_\epsilon({\mathbf{z}}(t)) \right| \right)}_{\text{bounded}} \underbrace{\left( \lim_{\epsilon\rightarrow0^+} \frac{1 }{\left|\det \frac{\partial}{\partial {\mathbf{z}}} T_\epsilon({\mathbf{z}}(t)) \right|} \right)}_{=1} \\ &= -\lim_{\epsilon\rightarrow0^+} \frac{\partial}{\partial \epsilon}\left|\det \frac{\partial}{\partial {\mathbf{z}}} T_\epsilon({\mathbf{z}}(t)) \right| \end{align}\]

The derivative of the determinant can be expressed using Jacobi’s formula, which gives

\[\begin{align} \frac{\partial\log p({\mathbf{z}}(t))}{\partial t} &= -\lim_{\epsilon\rightarrow0^+} \textrm{tr} \left( \text{adj}\left(\frac{\partial}{\partial {\mathbf{z}}}T_\epsilon({\mathbf{z}}(t))\right) \frac{\partial }{\partial \epsilon } \frac{\partial }{\partial {\mathbf{z}} } T_\epsilon({\mathbf{z}}(t)) \right) \\ &= -\textrm{tr} \left( \underbrace{\left(\lim_{\epsilon\rightarrow0^+} \text{adj}\left(\frac{\partial}{\partial {\mathbf{z}}}T_\epsilon({\mathbf{z}}(t))\right)\right)}_{=I} \left( \lim_{\epsilon\rightarrow0^+} \frac{\partial }{\partial \epsilon } \frac{\partial }{\partial {\mathbf{z}} } T_\epsilon({\mathbf{z}}(t))\right) \right) \\ &= -\textrm{tr} \left( \lim_{\epsilon\rightarrow0^+} \frac{\partial }{\partial \epsilon } \frac{\partial }{\partial {\mathbf{z}} } T_\epsilon({\mathbf{z}}(t))\right) \end{align}\]

Substituting $T_\epsilon$ with its Taylor series expansion and taking the limit, we complete the proof.

\[\begin{align} \frac{\partial\log p({\mathbf{z}}(t))}{\partial t} &= -\textrm{tr} \left( \lim_{\epsilon\rightarrow0^+} \frac{\partial }{\partial \epsilon } \frac{\partial }{\partial {\mathbf{z}}} \left( {\mathbf{z}} + \epsilon f({\mathbf{z}}(t), t) + \mathcal{O}(\epsilon^2) + \mathcal{O}(\epsilon^3) + \dots \right) \right) \\ &= -\textrm{tr} \left( \lim_{\epsilon\rightarrow0^+} \frac{\partial }{\partial \epsilon } \left( I + \frac{\partial }{\partial {\mathbf{z}}}\epsilon f({\mathbf{z}}(t), t) + \mathcal{O}(\epsilon^2) + \mathcal{O}(\epsilon^3) + \dots \right) \right) \\ &= -\textrm{tr} \left( \lim_{\epsilon\rightarrow0^+} \left(\frac{\partial }{\partial {\mathbf{z}}} f({\mathbf{z}}(t), t) + \mathcal{O}(\epsilon) + \mathcal{O}(\epsilon^2) + \dots \right) \right) \\ &= -\textrm{tr} \left( \frac{\partial }{\partial {\mathbf{z}}} f({\mathbf{z}}(t), t)\right) \end{align}\]

A Modern Proof of the Adjoint Method

We present an alternative proof to the adjoint method~\citep{pontryagin1962mathematical} that is short and easy to follow.

\subsection{Continuous Backpropagation} Let $\mathbf{z}(t)$ follow the differential equation $\frac{d\mathbf{z}(t)}{dt} = f(\mathbf{z}(t), t, \theta)$, where $\theta$ are the parameters. We will prove that if we define an adjoint state \begin{equation} \mathbf{a}(t) = \frac{dL}{d\mathbf{z}(t)} \end{equation} then it follows the differential equation \begin{equation}\label{eq:adj_method} \frac{d\mathbf{a}(t)}{dt} = - \mathbf{a}(t)\frac{\partial f(\mathbf{z}(t), t, \theta)}{\partial \mathbf{z}(t)} \end{equation} For ease of notation, we denote vectors as row vectors, whereas the main text uses column vectors.

The adjoint state is the gradient with respect to the hidden state at a specified time $t$. In standard neural networks, the gradient of a hidden layer $\mathbf{h}_t$ depends on the gradient from the next layer $\mathbf{h}_{t+1}$ by chain rule \begin{equation} \frac{d L}{d \mathbf{h}t} = \frac{d L}{d \mathbf{h}{t+1}} \frac{d \mathbf{h}{t+1}}{d \mathbf{h}{t}}. \end{equation} With a continuous hidden state, we can write the transformation after an $\epsilon$ change in time as \begin{equation} \mathbf{z}(t+\epsilon) = \int_t^{t+\epsilon} f(\mathbf{z}(t), t, \theta) dt + \mathbf{z}(t) = T_\epsilon(\mathbf{z}(t), t) \end{equation} and chain rule can also be applied \begin{equation}\label{eq:chain_rule} \frac{d L}{\partial \mathbf{z}(t)} = \frac{d L}{d \mathbf{z}(t+\epsilon)} \frac{d \mathbf{z}(t+\epsilon)}{d \mathbf{z}(t)} \quad\quad\text{or}\quad\quad \mathbf{a}(t) = \mathbf{a}(t+\epsilon)\frac{\partial T_\epsilon(\mathbf{z}(t), t)}{\partial \mathbf{z}(t)} \end{equation} The proof of \eqref{eq:adj_method} follows from the definition of derivative:

\[\begin{align} \frac{d \mathbf{a}(t)}{d t} &= \lim\limits_{\epsilon\rightarrow 0^+} \frac{\mathbf{a}(t+\epsilon) - \mathbf{a}(t)}{\epsilon} \\ &= \lim\limits_{\epsilon\rightarrow 0^+} \frac{\mathbf{a}(t+\epsilon) - \mathbf{a}(t+\epsilon)\frac{\partial }{\partial \mathbf{z}(t)}T_\epsilon(\mathbf{z}(t))}{\epsilon} \qquad\qquad\qquad\qquad\qquad\;\quad \;\textrm{(by Eq \ref{eq:chain_rule})}\\ &= \lim\limits_{\epsilon\rightarrow 0^+} \frac{\mathbf{a}(t+\epsilon) - \mathbf{a}(t+\epsilon)\frac{\partial }{\partial \mathbf{z}(t)}\left( \mathbf{z}(t) + \epsilon f(\mathbf{z}(t), t, \theta) + \mathcal{O}(\epsilon^2) \right)}{\epsilon} \textrm{(Taylor series around \mathbf{z}(t))}\\ &= \lim\limits_{\epsilon\rightarrow 0^+} \frac{\mathbf{a}(t+\epsilon) - \mathbf{a}(t+\epsilon)\left( I + \epsilon \frac{\partial f(\mathbf{z}(t), t, \theta) }{\partial \mathbf{z}(t)} + \mathcal{O}(\epsilon^2) \right)}{\epsilon} \\ &= \lim\limits_{\epsilon\rightarrow 0^+} \frac{ - \epsilon \mathbf{a}(t+\epsilon)\frac{\partial f(\mathbf{z}(t), t, \theta) }{\partial \mathbf{z}(t)} + \mathcal{O}(\epsilon^2)}{\epsilon} \\ &= \lim\limits_{\epsilon\rightarrow 0^+} - \mathbf{a}(t+\epsilon)\frac{\partial f(\mathbf{z}(t), t, \theta)}{\partial \mathbf{z}(t)} + \mathcal{O}(\epsilon) \\ &= - \mathbf{a}(t)\frac{\partial f(\mathbf{z}(t), t, \theta)}{\partial \mathbf{z}(t)}\label{eq:adjoint_z} \end{align}\]

We pointed out the similarity between adjoint method and backpropagation (eq. \ref{eq:chain_rule}). Similarly to backpropagation, ODE for the adjoint state needs to be solved \textit{backwards} in time. We specify the constraint on the last time point, which is simply the gradient of the loss wrt the last time point, and can obtain the gradients with respect to the hidden state at any time, including the initial value.

\[\underbrace{\mathbf{a}(t_N) = \frac{dL}{d \mathbf{z}(t_N)}}_{\text{initial condition of adjoint diffeq.}} \quad\quad\quad\quad \underbrace{\mathbf{a}(t_0) = \mathbf{a}(t_N) + \int_{t_N}^{t_0} \frac{d\mathbf{a}(t)}{dt} \;dt = \mathbf{a}(t_N) - \int_{t_N}^{t_0} \mathbf{a}(t)^T \frac{\partial f(\mathbf{z}(t), t, \theta)}{\partial \mathbf{z}(t)}}_{\text{gradient wrt. initial value}}\]

Here we assumed that loss function $L$ depends only on the last time point $t_N$. If function $L$ depends also on intermediate time points $t_1, t_2, \dots, t_{N-1}$, etc., we can repeat the adjoint step for each of the intervals $[t_{N-1}, t_N]$, $[t_{N-2}, t_{N-1}]$ in the backward order and sum up the obtained gradients.

\subsection{Gradients wrt. $\theta$ and $t$}

We can generalize \eqref{eq:adj_method} to obtain gradients with respect to $\theta$–a constant wrt. $t$–and and the initial and end times, $t_0$ and $t_N$. We view $\theta$ and $t$ as states with constant differential equations and write \begin{equation} \frac{\partial \theta(t)}{\partial t} = \mathbf{0} \quad\quad \frac{dt(t)}{dt} = 1 \end{equation} We can then combine these with $z$ to form an augmented state\footnote{Note that we’ve overloaded $t$ to be both a part of the state and the (dummy) independent variable. The distinction is clear given context, so we keep $t$ as the independent variable for consistency with the rest of the text.} with corresponding differential equation and adjoint state,

\[\frac{d}{dt} \begin{bmatrix} \mathbf{z} \\ \theta \\ t \end{bmatrix}(t) = f_{aug}([\mathbf{z}, \theta, t]) := \begin{bmatrix} f([\mathbf{z}, \theta, t]) \\ \mathbf{0} \\ 1 \end{bmatrix}, \; \mathbf{a}_{aug} := \begin{bmatrix} \mathbf{a} \\ \mathbf{a}_\theta \\ \mathbf{a}_t \end{bmatrix}, \; \mathbf{a}_\theta(t) := \frac{dL}{d\theta(t)},\; \mathbf{a}_t(t) := \frac{dL}{dt(t)}\]

Note this formulates the augmented ODE as an autonomous (time-invariant) ODE, but the derivations in the previous section still hold as this is a special case of a time-variant ODE. The Jacobian of $f$ has the form

\[\frac{\partial f_{aug}}{\partial [\mathbf{z}, \theta, t]} = \begin{bmatrix} \frac{\partial f }{\partial \mathbf{z} } & \frac{\partial f }{\partial \theta } & \frac{\partial f }{\partial t} \\ \mathbf{0} & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{0} \end{bmatrix}(t)\]

where each $\mathbf{0}$ is a matrix of zeros with the appropriate dimensions. We plug this into \eqref{eq:adj_method} to obtain

\[\frac{d\mathbf{a}_{aug}(t)}{dt} = - \begin{bmatrix} \mathbf{a}(t) & \mathbf{a}_\theta(t) & \mathbf{a}_t(t) \end{bmatrix} \frac{\partial f_{aug}}{\partial [\mathbf{z}, \theta, t]}(t) = - \begin{bmatrix} \mathbf{a} \frac{\partial f}{\partial \mathbf{z}} & \mathbf{a} \frac{\partial f}{\partial \theta} & \mathbf{a} \frac{\partial f}{\partial t} \end{bmatrix}(t)\]

The first element is the adjoint differential equation~\eqref{eq:adj_method}, as expected. The second element can be used to obtain the total gradient with respect to the parameters, by integrating over the full interval and setting $\mathbf{a}_\theta(t_N) = \mathbf{0}$.

\[\frac{dL}{d\theta} = \mathbf{a}_\theta(t_0) = - \int_{t_N}^{t_0} \mathbf{a}(t)\frac{\partial f(\mathbf{z}(t), t, \theta)}{\partial \theta} \;dt\]

Finally, we also get gradients with respect to $t_0$ and $t_N$, the start and end of the integration interval.

\[\frac{dL}{dt_N} = \mathbf{a}(t_N) f(\mathbf{z}(t_N), t_N, \theta) \quad\quad \frac{dL}{dt_0} = \mathbf{a}_t(t_0) = \mathbf{a}_t(t_N) - \int_{t_N}^{t_0} \mathbf{a}(t) \frac{\partial f(\mathbf{z}(t), t, \theta)}{\partial t} \;dt\]

Between \eqref{eq:adj_method}, \eqref{eq:gradient_iv}, \eqref{eq:adj_params}, and \eqref{eq:adj_t} we have gradients for all possible inputs to an initial value problem solver.

FFJORD

Given a random variable $\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})$ the log density of $\mathbf{x} = f(\mathbf{z})$ follows

\begin{equation} \log p_{\mathbf{x}}(\mathbf{x}) = \log p_{\mathbf{z}}(\mathbf{z}) - \log \det \left| \frac{\partial f(\mathbf{z})}{\partial \mathbf{z}} \right| \end{equation}

where $\frac{\partial f(\mathbf{z})}{\partial \mathbf{z}}$ is the Jacobian of $f$. In general, computation of the log determinant has a time cost of $\mathcal{O}(D^3)$.

Flow Function The change in log-density under this model follows a second differential equation, called the instantaneous change of variables formula:

\[\frac{\partial \log p(\mathbf{z}(t)) }{\partial t} = -\textrm{Tr}\left(\frac{\partial f}{\partial \mathbf{z}(t)}\right). %\frac{\partial \log p(z(t)) }{\partial t} = - \sum_{j=1}^{D} \frac{\partial}{\partial z_j(t)} f_j(z(t), t) = -\textrm{Tr}\left(\frac{\partial f}{\partial \mathbf{z}(t)}\right)\]

We can compute total change in log-density by integrating across time:

\[\log p(\mathbf{z}(t_1)) = \log p(\mathbf{z}(t_0)) - \int_{t_0}^{t_1} \textrm{Tr}\left(\frac{\partial f}{\partial \mathbf{z}(t)}\right) \;dt.\]

For a diffusion model, the timesteps are swapped as time $t=T$ represents pure noise, while $t=0$ represents the data distribution.

\[\log p(\mathbf{x}(0)) = \log p(\mathbf{x}(T)) - \int_{T}^{0} \textrm{Tr}\left(\frac{\partial f}{\partial \mathbf{x}(t)}\right) \;dt.\]

Reverse IVP:

Given a datapoint $\mathbf{x}$, we can compute both the point $\mathbf{z}_0$ which generates $\mathbf{x}$, as well as $\log p(\mathbf{x})$ under the model by solving the initial value problem:

\[\begin{align} \underbrace{\begin{bmatrix} \mathbf{z}_0 \\ \log p(\mathbf{x}) - \log p_{z_0}(\mathbf{z}_0) \end{bmatrix}}_{\textrm{solutions}} = \underbrace{\int_{t_1}^{t_0} \begin{bmatrix} f(\mathbf{z}(t), t; \theta) \\ - \textrm{Tr}\left(\frac{\partial f}{\partial \mathbf{z}(t)}\right) \end{bmatrix}dt}_{\textrm{dynamics}},\qquad \underbrace{\begin{bmatrix} \mathbf{z}(t_1) \\ \log p(\mathbf{x}) - \log p(\mathbf{z}(t_1)) \end{bmatrix} = \begin{bmatrix} \mathbf{x} \\ 0 \end{bmatrix}}_{\textrm{initial values}} \end{align}\]

which integrates the combined dynamics of $z(t)$ and the log-density of the sample backwards in time from $t_1$ to $t_0$. We can then compute $\log p(\x)$ using the solution of \eqref{eq:reverseivp} and adding $\log p_{z_0}(\z_0)$.

Adjoint: $\begin{align} \frac{dL}{d\theta} = -\int_{t_1}^{t_0}\left(\frac{\partial L}{\partial \mathbf{z}(t)}\right)^T \frac{\partial f(\mathbf{z}(t), t; \theta)}{\partial \theta} dt. \end{align}$

Trace estimator:

\[\begin{align} \textrm{Tr}(A) = E_{p(\mathbf{\epsilon})}[\mathbf{\epsilon}^T A \mathbf{\epsilon}]. \end{align}\]

The above equation holds for any $D$-by-$D$ matrix $A$ and distribution $p(\mathbf{\epsilon})$ over $D$-dimensional vectors such that $\mathbb{E}[\mathbf{\epsilon}] = 0$ and $\mathrm{Cov}(\mathbf{\epsilon}) = I$.

Stochastic ODE:

\[\begin{align} \log p(\mathbf{z}(t_1)) & = \log p(\mathbf{z}(t_0)) - \int_{t_0}^{t_1} \textrm{Tr}\left(\frac{\partial f}{\partial \mathbf{z}(t)}\right) dt\nonumber\\ &= \log p(\mathbf{z}(t_0)) - \int_{t_0}^{t_1} \mathbb{E}_{p(\mathbf{\epsilon})}\left[\mathbf{\epsilon}^T \frac{\partial f}{\partial \mathbf{z}(t)}\mathbf{\epsilon}\right]dt \nonumber\\ &= \log p(\mathbf{z}(t_0)) - \mathbb{E}_{p(\mathbf{\epsilon})}\left[\int_{t_0}^{t_1} \mathbf{\epsilon}^T \frac{\partial f}{\partial \mathbf{z}(t)}\mathbf{\epsilon} dt \right] \end{align}\]

Bottleneck Trick:

\[\mbox{Tr}\underbrace{\left( \frac{\partial f}{\partial \mathbf{z}} \right)}_{D \times D} = \textrm{Tr}\underbrace{\left( \frac{\partial g}{\partial h}\frac{\partial h}{\partial \mathbf{z}} \right)}_{D \times D} = \textrm{Tr}\underbrace{ \left( \frac{\partial h}{\partial \mathbf{z}}\frac{\partial g}{\partial h} \right)}_{H \times H} = \mathbb{E}_{p(\mathbf{\epsilon})}\left[ \mathbf{\epsilon}^T \frac{\partial h}{\partial \mathbf{z}}\frac{\partial g}{\partial h} \mathbf{\epsilon}\right].\]

References

[1] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, David Duvenaud. Neural Ordinary Differential Equations. PDF

[2] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, David Duvenaud. FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models. 2018. [PDF].

[3] M.F. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. 18:1059–1076, 01 1989.

[4] R. P. Adams, J. Pennington, M. J. Johnson, J. Smith, Y. Ovadia, B. Patton, and J. Saunderson. Estimating the Spectral Density of Large Implicit Matrices. ArXiv e-prints, February 2018.

[5] Tero Karras, Miika Aittala, Timo Aila, Samuli Laine. Elucidating the Design Space of Diffusion-Based Generative Models. PDF