Generative Models
Problem Definition
The goal of generative modeling is, given a dataset \(\{ \mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_N \}\), where each point is drawn independently from an underlying data distribution \(p(\mathbf{x})\), to fit a model to the data distribution, such that we can synthesize new data points at will by sampling from the distribution Song, 2021.
See GANS for more info.
VAE
Variational Autoencoders introduce a latent z, and maximize a lower bound:
Latent z allows for powerful interpolation and editing applications.
To maximize the density function, we must approximate the normalizing constant \(Z_{\theta}\) through variational inference.
Score-Based Models
Modeling the score function, instead of the density function. Not required to have a tractable normalizing constant, and can be directly learned by score-matching (see Yang Song’s post).
Score function of a distribution \(p(\mathbf{x})\) is defined as \(\nabla_{\mathbf{x}} \log p(\mathbf{x})\)
Once a score-based model \(s_{\theta}(\mathbf{x}) \approx \nabla_{\mathbf{x}} \log p(\mathbf{x})\), we can use an iterative procedure called Langevin dynamics to draw samples from it Song, 2021. This is a MCMC procedure.
Diffusion Models
DDPM (Ho, 2020) lays out a mathematical framework, Lillian Weng expands derivations, and then Calvin Luo gives an even more complete derivation (Luo, 2022).
Excellent tutorial videos can be found here by Outlier and from a CVPR 2022 Tutorial.
Diffusion models induce two processes: the forward and the reverse process. Let \(x_0\) represent the original image, and let the final image, from an isotropic Gaussian, be represented by \(x_T\).
Forward Process: \(q(x_t \mid x_{t-1})\) returns an image with a little bit more noise added.
Reverse Process: \(p(x_{t-1} \mid x_t)\) returns an image with less noise.
(Sohl-Dickstein et al., 2015) introduced the diffusion approach to the ML community, and emphasized that estimating small perturbations is more tractable than making a single large perturbation.
DDPM laid out that we could predict either (1) mean of the noise, (2) original image, or (3) noise of the image.
OpenAI authors suggested to learn the variance. Noise is regulated by a schedule, to make sure variance doesn’t explode as we add more and more noise.
Go from linear schedule to a cosine schedule. Linear has too rapid of destruction of signal, and towards of end of process, samples are too redundant.
Using Variational Bound on Negative Log Likelihood
To obtain a loss function that we can minimize (that maximizes the log likelihood), we’ll use a variational bound.
q(xt | xt−1, x0), where the extra conditioning term is superfluous due to the Markov property |
(Ho et al, 2020) and (Weng, 2021) have a similar derivation. One trick we’ll need is that we can rewrite KL divergence as an expectation:
\[\begin{aligned} D_{KL}(P \| Q) &= \sum\limits_{x \in X} P(x) \log \frac{P(x)}{Q(x)} = \mathbb{E}_{x \sim P}\Big[ \log \frac{P(x)}{Q(x)} \Big] \end{aligned}\]First, consider adding a strictly nonnegative quantity to the negative log likelihood:
\[\begin{aligned} - \log p_\theta(\mathbf{x}_0) &\leq - \log p_\theta(\mathbf{x}_0) + D_\text{KL}\Big(q(\mathbf{x}_{1:T}\vert\mathbf{x}_0) \| p_\theta(\mathbf{x}_{1:T}\mid \mathbf{x}_0) \Big) \\ &= -\log p_\theta(\mathbf{x}_0) + \mathbb{E}_{\mathbf{x}_{1:T}\sim q(\mathbf{x}_{1:T} \vert \mathbf{x}_0)} \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{ p_\theta(\mathbf{x}_{1:T}\mid \mathbf{x}_0) } \Big] & \mbox{Rewrite KL Divergence as Expectation} \\ &= -\log p_\theta(\mathbf{x}_0) + \mathbb{E}_{\mathbf{x}_{1:T}\sim q(\mathbf{x}_{1:T} \vert \mathbf{x}_0)} \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T}) / p_\theta(\mathbf{x}_0)} \Big] & \mbox{Use Bayes Rule: } P(A \mid B) = \frac{P(A,B) }{ P(B)} \\ &= -\log p_\theta(\mathbf{x}_0) + \mathbb{E}_q \Big[ \log\frac{q(\mathbf{x}_{1:T} \mid \mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} + \log p_\theta(\mathbf{x}_0) \Big] \\ &= -\log p_\theta(\mathbf{x}_0) + \mathbb{E}_q \Big[ \log\frac{q(\mathbf{x}_{1:T} \mid \mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] + \mathbb{E}_q \Big[ \log p_\theta(\mathbf{x}_0) \Big] & \mbox{ Expectation is linear} \\ &= -\log p_\theta(\mathbf{x}_0) + \mathbb{E}_q \Big[ \log\frac{q(\mathbf{x}_{1:T} \mid \mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] + \log p_\theta(\mathbf{x}_0) & \mbox{Rightmost term does not depend on } \mathbf{x}_{1:T} \sim q \\ &= \mathbb{E}_q \Big[ \log \frac{q(\mathbf{x}_{1:T} \mid \mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] & \mbox{ Terms cancel} \\ \text{Let }L_\text{VLB} &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \Big[ \log \frac{q(\mathbf{x}_{1:T} \mid \mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \geq - \mathbb{E}_{q(\mathbf{x}_0)} \log p_\theta(\mathbf{x}_0) \end{aligned}\]We can simplify further by using two facts: first, it is superfluous by Markov property to condition on earliest \(\mathbf{x}_0\) in forward chain process. Second, by Bayes Rule:
\[q(\mathbf{x}_t \mid \mathbf{x}_{t-1}, \mathbf{x}_0) = \frac{ q( \mathbf{x}_{t-1} \mid \mathbf{x}_t) q(\mathbf{x}_t \mid \mathbf{x}_0) }{ q(\mathbf{x}_{t-1} \mid \mathbf{x}_0) }\]Becomes tractable with this addition because why???
Using these properties, we simplify as:
\[\begin{aligned} L_\text{VLB} &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \\ &= \mathbb{E}_q \Big[ \log\frac{\prod_{t=1}^T q(\mathbf{x}_t \mid \mathbf{x}_{t-1})}{ p_\theta(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t) } \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{q(\mathbf{x}_t \mid \mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t \mid \mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \log\frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t \mid \mathbf{x}_{t-1}, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \log\frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1)} \Big] & \mbox{Superfluous by Markov property.} \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \Big( \frac{q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)}\cdot \frac{q(\mathbf{x}_t \mid \mathbf{x}_0)}{q(\mathbf{x}_{t-1}\vert\mathbf{x}_0)} \Big) + \log \frac{q(\mathbf{x}_1 \mid \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1)} \Big] & \mbox{By Bayes' Rule.} \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t \mid \mathbf{x}_0)}{q(\mathbf{x}_{t-1} \vert \mathbf{x}_0)} + \log\frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1)} \Big] & \mbox{Terms cancel in sum of logs.} \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \log\frac{q(\mathbf{x}_T \mid \mathbf{x}_0)}{q(\mathbf{x}_1 \vert \mathbf{x}_0)} + \log \frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1)} \Big]\\ &= \mathbb{E}_q \Big[ \log\frac{q(\mathbf{x}_T \mid \mathbf{x}_0)}{p_\theta(\mathbf{x}_T)} + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t)} - \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1) \Big] \\ &= \mathbb{E}_q [\underbrace{D_\text{KL}\Big(q(\mathbf{x}_T \mid \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_T)\Big)}_{L_T} + \sum_{t=2}^T \underbrace{D_\text{KL}\Big(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)\Big)}_{L_{t-1}} \underbrace{- \log p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1)}_{L_0} ] \end{aligned}\]Note that \(L_{t-1}\) is the key loss here, showing that we want a forward process step to have the same distribution as a reverse process step. Also note that in (Luo, 2022)’s derivation, he uses the reverse of the KL in Equations 47-58, which leads to a slightly different result.
Forward Process
Applying one forward step entails: \(q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t \mathbf{I})\) Repeating an arbitrary number of steps, it turns out we can use a simple, closed-form notation.
\[\begin{aligned} q( \mathbf{x}_t \mid \mathbf{x}_{t-1}) &= \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}) \\ \mathbf{x}_t &= \sqrt{1- \beta_t} \mathbf{x}_{t-1} + \sqrt{\beta_t} \epsilon_{t-1} & (\epsilon \sim \mathcal{N}(0,1) \mbox{ By reparameterization trick } X = \mu + \sigma Z) \\ &= \sqrt{\alpha_t} \mathbf{x}_{t-1} + \sqrt{1 - \alpha_t} \epsilon_{t-1} & \mbox{ Let } \alpha_t = 1 - \beta_t \\ \mathbf{x}_{t-1} &= \sqrt{\alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 - \alpha_{t-1}} \epsilon_{t-2} & \mbox{replacing } t \mbox{ from equation above with } t-1 \end{aligned}\]We can recursively evaluate \(x_t\) to find:
\[\begin{aligned} \mathbf{x}_t &= \sqrt{\alpha_t} \Bigg( \underbrace{ \sqrt{\alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 - \alpha_{t-1}} \epsilon_{t-2} }_{ \mathbf{x}_{t-1}} \Bigg) + \sqrt{1 - \alpha_t} \epsilon_{t-1} \\ \mathbf{x}_t &= \sqrt{\alpha_t} \sqrt{\alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{\alpha_t} \sqrt{1 - \alpha_{t-1}} \epsilon_{t-2} + \sqrt{1 - \alpha_t} \epsilon_{t-1} \\ \mathbf{x}_t &= \sqrt{\alpha_t} \sqrt{\alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{ \alpha_t (1 - \alpha_{t-1})} \epsilon_{t-2} + \sqrt{1 - \alpha_t} \epsilon_{t-1} \end{aligned}\]We know that the sum of normally independent random variables is also normally distributed, parameterized by the sum of their means and the sum of their variances, i.e.
\[\begin{aligned} z_1 &= \mathcal{N}(\mu_x, \sigma_x^2) \\ z_2 &= \mathcal{N}(\mu_y, \sigma_y^2) \\ z_3 &= z_1 + z_2, \hspace{10mm} z_3 \sim \mathcal{N}(\mu_x + \mu_y, \sigma_x^2 + \sigma_y^2) \end{aligned}\]Here, means for both distributions involving \(\epsilon_{t-1}\) and \(\epsilon_{t-2}\) are equal to zero, and by summing their variances, we obtain:
\[\begin{aligned} \mathbf{x}_t &= \sqrt{\alpha_t} \sqrt{\alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{ (\alpha_t - \alpha_t\alpha_{t-1}) + (1 - \alpha_t) } \epsilon_{t-2} \\ &= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{ 1 - \alpha_t\alpha_{t-1} } \epsilon_{t-2} \\ \end{aligned}\]We use \(\epsilon_t\) to denote the samples from the standard normal at different timesteps, as should now be clear why, rather than \(\mathbf{z}_t\) (like Weng, 2021) or simply \(\epsilon\). Recursively expanding the definition of \(\mathbf{x}_t\) by three time steps, we see:
\[\begin{aligned} \mathbf{x}_t &= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{ 1 - \alpha_t\alpha_{t-1} } \epsilon_{t-2} \\ &= \sqrt{\alpha_t \alpha_{t-1}} \Bigg( \underbrace{ \sqrt{\alpha_{t-2}} \mathbf{x}_{t-3} + \sqrt{1 - \alpha_{t-2}} \epsilon_{t-3} }_{\mathbf{x}_{t-2}} \Bigg) + \sqrt{ 1 - \alpha_t\alpha_{t-1} } \epsilon_{t-2} & \mbox{Using expression for } \mathbf{x}_{t-2} \\ &= \sqrt{\alpha_t \alpha_{t-1}} \sqrt{\alpha_{t-2}} \mathbf{x}_{t-3} + \sqrt{\alpha_t \alpha_{t-1}}\sqrt{1 - \alpha_{t-2}} \epsilon_{t-3} + \sqrt{ 1 - \alpha_t\alpha_{t-1} } \epsilon_{t-2} \\ &= \sqrt{\alpha_t \alpha_{t-1} \alpha_{t-2}} \mathbf{x}_{t-3} + \sqrt{\alpha_t \alpha_{t-1} (1 - \alpha_{t-2})} \epsilon_{t-3} + \sqrt{ 1 - \alpha_t\alpha_{t-1} } \epsilon_{t-2} \\ &= \sqrt{\alpha_t \alpha_{t-1} \alpha_{t-2}} \mathbf{x}_{t-3} + \sqrt{\alpha_t \alpha_{t-1} (1 - \alpha_{t-2}) + 1 - \alpha_t\alpha_{t-1} } \epsilon_{t-3} & \mbox{Sum of independent normal r.v.s is normally distributed} \\ &= \sqrt{\alpha_t \alpha_{t-1} \alpha_{t-2}} \mathbf{x}_{t-3} + \sqrt{\alpha_t \alpha_{t-1} - \alpha_t \alpha_{t-1} \alpha_{t-2}) + 1 - \alpha_t\alpha_{t-1} } \epsilon_{t-3} \\ &= \sqrt{\alpha_t \alpha_{t-1} \alpha_{t-2}} \mathbf{x}_{t-3} + \sqrt{1 - \alpha_t \alpha_{t-1} \alpha_{t-2} } \epsilon_{t-3} \\ \end{aligned}\]Generalizing this to \(N\) timesteps, it turns out we can write the closed-form expression for multiple steps at a time via:
\[\begin{aligned} \mathbf{x}_t &= \Bigg( \sqrt{ \prod_{i=t-N}^t \alpha_i} \Bigg) \mathbf{x}_{t-N} + \Bigg(\sqrt{ 1 - \prod_{i=t-N}^{t}} \Bigg) \epsilon \\ &= \sqrt{\bar{\alpha}} \mathbf{x}_{t-N} + \sqrt{1 - \bar{\alpha}} \epsilon & \mbox{ Let } \bar{\alpha_t} = \prod_{s=1}^t a_s \\ &= \sqrt{\bar{\alpha}} \mathbf{x}_{0} + \sqrt{1 - \bar{\alpha}} \epsilon_0 & \mbox{ Let } N = t. \end{aligned}\]This is a very surprising result, that the degradation at N steps can be computed with a single, closed-form expression, and will simplify the training algorithm description. A similar derivation can be found in Equations 61-70 of (Luo, 2021).
We can also use this same expression to derive \(\mathbf{x}_{0}\) from \(\mathbf{x}_t\), which will become useful later on:
\[\begin{aligned} \mathbf{x}_t &= \sqrt{\bar{\alpha}} \mathbf{x}_{0} + \sqrt{1 - \bar{\alpha}} \epsilon \\ \mathbf{x}_t - \sqrt{1 - \bar{\alpha}} \epsilon &= \sqrt{\bar{\alpha}} \mathbf{x}_{0} \\ \frac{1}{\sqrt{\bar{\alpha}}} \Big(\mathbf{x}_t - \sqrt{1 - \bar{\alpha}} \epsilon \Big) &= \mathbf{x}_{0} \end{aligned}\]When predicting the original sample \(\mathbf{x}_0\) from a noisy \(\mathbf{x}_t\), we would be using the predicted noise \(\mathbf{\epsilon}_{\theta}\), i.e.
\[\mathbf{x}_{0} = \frac{1}{\sqrt{\bar{\alpha}}} \Big(\mathbf{x}_t - \sqrt{1 - \bar{\alpha}} \mathbf{\epsilon}_{\theta} \Big)\]Reverse Process
A neural network can parameterize the normal distribution: \(p(x_{t-1} \mid x_t) = \mathcal{N}\Big(x_{t-1}; \mu_{\theta}(x_t,t), \Sigma_{\theta}(x_t,t)\Big)\) However, we don’t need to use a neural network to predict \(\Sigma\), but rather can use a fixed schedule.
Can’t compute the log likelihood because…, but to minimize the negative log likehood (maximizing the log likelihood), we can
\[-\log\Big(p_\theta(x_0)\Big) \leq -\log\Big(p_\theta(x_0)\Big) + D_{KL}\Big(q(x_{1:T} \mid x_0) \hspace{1 mm} \| \hspace{1 mm} p_{\theta}(x_{1:T} \mid x_0)\Big)\]Compute the variational lower bound for the following objective:
When simplifying, we see that the intractable log likelihood term cancels out and disappears.
Gaussian Form of Forward Step, conditioned on \(x_0\)
\[\propto \exp \dots\]See Equations of 71-84 of (Luo, 2022).
Deriving mean of forward step
What is the form of \(\mu_q\), or \(\tilde{\mu}_t\) as (Weng, 2021) and (Ho, 2020) refers to it? We will want \(\mu_{\theta}\) to match this mean term.
First, we recall that we have a closed-form expression to obtain \(\mathbf{x}_0\) using the forward process \(q(\cdot)\), given \(\mathbf{x}_t\) and a sequence of \(\{\alpha_i\}\) values. Plugging this into the derived true denoising transition mean \(\mathbf{\mu}_q(\mathbf{x}_t, \mathbf{x}_0)\), we can rederive as:
\[\begin{align} \mathbf{\mu}_q(\mathbf{x}_t, \mathbf{x}_0) &= \frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})\mathbf{x}_{t} + \sqrt{\bar\alpha_{t-1}}(1-\alpha_t)\mathbf{x}_0}{1 -\bar\alpha_{t}}\\ &= \frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})\mathbf{x}_{t} + \sqrt{\bar\alpha_{t-1}}(1-\alpha_t) \Bigg( \frac{\mathbf{x}_t - \sqrt{1 - \bar\alpha_t}\mathbf{\epsilon}_0}{\sqrt{\bar\alpha_t}} \Bigg) }{1 -\bar\alpha_{t}} \\ &= \frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})\mathbf{x}_{t} + \sqrt{ \prod\limits_{i=1}^{t-1} \alpha_i }(1-\alpha_t)\frac{\mathbf{x}_t - \sqrt{1 - \bar\alpha_t }\mathbf{\epsilon}_0}{\sqrt{ \prod\limits_{i=1}^{t} \alpha_i}}}{1-\bar\alpha_t} & \mbox{Expand } \bar\alpha_{t} \mbox{ and } \bar\alpha_{t-1} \mbox{ as products }. \\ &= \frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})\mathbf{x}_{t} + \frac{ \sqrt{ \prod\limits_{i=1}^{t-1}\alpha_i} (1-\alpha_t) }{ \sqrt{ \prod\limits_{i=1}^{t} \alpha_i } }\Big( \mathbf{x}_t - \sqrt{1 - \bar\alpha_t }\mathbf{\epsilon}_0 \Big)}{1-\bar\alpha_t}\\ &= \frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})\mathbf{x}_{t} + \sqrt{ \frac{ \prod\limits_{i=1}^{t-1}\alpha_i }{ \prod\limits_{i=1}^{t} \alpha_i }} (1-\alpha_t)\Big( \mathbf{x}_t - \sqrt{1 - \bar\alpha_t }\mathbf{\epsilon}_0 \Big)}{1-\bar\alpha_t}\\ &= \frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})\mathbf{x}_{t} + \frac{1}{\sqrt{\alpha_t}} (1-\alpha_t) \Big(\mathbf{x}_t - \sqrt{1 - \bar\alpha_t}\mathbf{\epsilon}_0 \Big) }{1 -\bar\alpha_{t}}\\ &= \frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})\mathbf{x}_{t}}{1 - \bar\alpha_t} + \frac{(1-\alpha_t)\mathbf{x}_t}{(1-\bar\alpha_t)\sqrt{\alpha_t}} - \frac{(1 - \alpha_t)\sqrt{1 - \bar\alpha_t}\mathbf{\epsilon}_0}{(1-\bar\alpha_t)\sqrt{\alpha_t}}\\ &= \left(\frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})}{1 - \bar\alpha_t} + \frac{1-\alpha_t}{(1-\bar\alpha_t)\sqrt{\alpha_t}}\right)\mathbf{x}_t - \frac{(1 - \alpha_t)\sqrt{1 - \bar\alpha_t}}{(1-\bar\alpha_t)\sqrt{\alpha_t}}\mathbf{\epsilon}_0\\ &= \left(\frac{\alpha_t(1-\bar\alpha_{t-1})}{(1 - \bar\alpha_t)\sqrt{\alpha_t}} + \frac{1-\alpha_t}{(1-\bar\alpha_t)\sqrt{\alpha_t}}\right)\mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar\alpha_t}\sqrt{\alpha_t}}\mathbf{\epsilon}_0 & \mbox{Multiply top and bottom of leftmost expression by} \sqrt{\alpha_t} \\ &= \frac{\alpha_t-\bar\alpha_{t} + 1-\alpha_t}{(1 - \bar\alpha_t)\sqrt{\alpha_t}}\mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar\alpha_t}\sqrt{\alpha_t}}\mathbf{\epsilon}_0\\ &= \frac{1-\bar\alpha_t}{(1 - \bar\alpha_t)\sqrt{\alpha_t}}\mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar\alpha_t}\sqrt{\alpha_t}}\mathbf{\epsilon}_0\\ &= \frac{1}{\sqrt{\alpha_t}}\mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar\alpha_t}\sqrt{\alpha_t}}\mathbf{\epsilon}_0 \end{align}\]Again, beautifully, this simplifies to a faily clean, closed-form expression. This appears in (Ho, 2020) in Equation 10:
\[\mathbf{\mu}_q(\mathbf{x}_t, \mathbf{x}_0) = \mathbf{\tilde{\mu}}_t = \frac{1}{\sqrt{\alpha_t}} \Bigg( \mathbf{x}_t(\mathbf{x}_0, \epsilon) - \frac{\beta_t}{ \sqrt{1- \bar\alpha_t }} \epsilon \Bigg)\]See Equations 116 to 125 of (Luo, 2022) for a similar derivation.
Loss Function: Mean-Matching
In order to train the network \(\mu_\theta\), we need to define a loss function. We wish for \(\mu_\theta\) to match \(\mathbf{\mu}_q(\mathbf{x}_t, \mathbf{x}_0)\), also referred to as \(\mathbf{\tilde{\mu}}_t\), above.
Since \(\mathbf{x}_t\) is available as input to the model, we may parameterize \(\epsilon\) as \(\epsilon_{\theta}\) instead:
\[\begin{aligned} L_t &= \frac{1}{2 \sigma_t^2} \Bigg\| \frac{1}{\sqrt{\alpha_t}}\Big(x_t - \frac{\beta_t}{\sqrt{1- \bar{\alpha_t}}} \epsilon\Big) - \mu_{\theta}(x_t, t) \Bigg\|^2 \\ &= \frac{1}{2 \sigma_t^2} \Bigg\| \frac{1}{\sqrt{\alpha_t}}\Big(x_t - \frac{\beta_t}{\sqrt{1- \bar{\alpha_t}}} \epsilon\Big) - \frac{1}{\sqrt{\alpha_t}} \Big(x_t - \frac{ \beta_t }{ \sqrt{1-\bar{\alpha}_t} } \epsilon_{\theta}(x_t, t) \Big) \Bigg\|^2 \\ &= \frac{1}{2 \sigma_t^2} \Bigg\| \frac{1}{\sqrt{\alpha_t}}\Big(x_t\Big) - \Big(\frac{1}{\sqrt{\alpha_t}}\Big) \Big( \frac{\beta_t}{\sqrt{1- \bar{\alpha_t}}} \epsilon\Big) - \frac{1}{\sqrt{\alpha_t}} \Big(x_t\Big) + \Big(\frac{1}{\sqrt{\alpha_t}}\Big) \frac{ \beta_t }{ \sqrt{1-\bar{\alpha}_t} } \epsilon_{\theta}(x_t, t) \Big) \Bigg\|^2 \\ &= \frac{1}{2 \sigma_t^2} \Bigg\| \Big(\frac{1}{\sqrt{\alpha_t}}\Big) \frac{\beta_t}{\sqrt{1- \bar{\alpha_t}}} \Big( \epsilon_{\theta}(x_t, t) - \epsilon \Big) \Bigg\|^2 \\ &= \frac{1}{2 \sigma_t^2} \Bigg\| \Big(\frac{\beta_t}{\sqrt{\alpha_t}\sqrt{1- \bar{\alpha_t}}}\Big) \Big( \epsilon_{\theta}(x_t, t) - \epsilon \Big) \Bigg\|^2 \\ &= \frac{1}{2 \sigma_t^2} \Big(\frac{\beta_t^2}{(\alpha_t)(1- \bar{\alpha_t})}\Big) \Bigg\| \epsilon_{\theta}(x_t, t) - \epsilon \Bigg\|^2 \\ \end{aligned}\]Authors [?] have found that ignoring the scaling factor at the front yields better sampling quality and a simpler implementation, and just using the mean squared error \(L_t = \| \epsilon_{\theta}(x_t, t) - \epsilon \|^2\) was sufficient.
Data Pre-processing
Scale pixel values from [0,255] to [-1,1] instead.
Training Algorithm
- Sample an image \(\mathbf{x}_0\) from our dataset \(q(\mathbf{x}_0)\).
- Sample \(t \sim \mbox{Uniform}(\{1,\dots,T\})\).
- Sample noise from the normal distribution, \(\epsilon \sim \mathcal{N}(0, \mathbf{I})\)
- Take gradient descent step on \(\nabla_\theta \| \epsilon - \epsilon_{\theta}(x_t, t) \|^2\).
- Repeat until converged.
Sampling Algorithm
- Sample an image of pure noise \(\mathbf{x}_T \sim \mathcal{N}(0,1)\).
- for \(t= T,\dots,1\) do:
\(\mathbf{z} \sim \mathcal{N}(0, \mathbf{I})\) if \(t>1\), else \(\mathbf{z} = \mathbf{0}\).
\(\mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - BLAH \Big) + \sigma_t \mathbf{z}\).
- BLAH
- BLAH
Model architecture
U-Net is most common. [7] authors started using Adaptive Group Normalization, multiply with linear projection of the timestep.
With some notes taken from Justin Johnson’s class slides.
Forward process is fixed (adding noise), so we only need to learn a model for the reverse process. In VAE’s, we needed to learn two models. (See Ari Seff’s video)
Implementation: DDPM Nano
VQ-GAN
Codebook, with Transformer CNN encoder, quantization, CNN decoder, CNN discriminator.
VQ-VAE
Survey paper:
https://arxiv.org/pdf/2310.07204.pdf
DALL-E
DALL-E (Ramesh et al., 2021))
Step 1. Train VQ-VAE (discrete grid of latent codes). Step 2. Train autoregressive Transformer model to predict sequence of latent codes (Giant model on 250M image/text pairs). Step 3. Given text prompt, sample new image codes; pass through VQ-VAE decoder to generate images.
GLIDE
\[q(x_t \mid x_{t-1}) := \mathcal{N}(x_t; \mu= \sqrt{\alpha_t} x_{t-1}, \Sigma=(1-\alpha_t) \mathcal{I})\]Reverse step is the posterior \(q(x_{t-1} \mid x_t )\), which we will learn a model for:
\(p_{\theta}(x_{t-1} \mid x_t) := \mathcal{N} (\mu_{\theta}(x_t), \Sigma_{\theta}(x_t))\) Start with Gaussian noise \(x_T \sim \mathcal{N}(0, \mathcal{I})\).
Rather than predict the image itself, we predict the noise \(\epsilon = x_t - x_{t-1}\), and we know the true noise.
Classifier-Free Guidance
Predict noise \(\epsilon_{\theta}(x_t \mid y)\). If sometimes we don’t provide the label \(y\), then we have unconditional generator \(\epsilon_{\theta}(x_t \mid \emptyset)\).
see Yannic Kilcher’s GLIDE video.
“Classifier Free Guidance” Ho and Salimans. Sander Dieleman: https://youtube.com/watch?v=9BHQvQlsVdE. Uncommon to hear these principles discussed with such clarity. CFG becomes obvious when we consider geometry (magnify direction in space to make samples look more like the conditioning info).
Used in Imagen
DALL-E 2
Imagen
Latent Diffusion/ Stable Diffusion
Progressive Distillation
Progressive distillation Salimans, 2022
v-parameterization
Parameterize model to predict Instead of predicting x_0 or \epsilon, we can predict $v$.
Take a small step in the direction you predicted (large enough step would lead to a blurry image). Two steps forward, 1 step back. Robust to errors from the neural network.
Because of linear relationship, x_t is a linear combination of x_0 and noise \epsilon. For a prediction of one, we can always convert it to the other.
\[x_t = \alpha(t) x_0 + \sigma(t) \cdot \epsilon\] \[v = \alpha(t) \cdot \epsilon - \sigma(t) \cdot x_0\]Perspective on generative modeling, concept of iterative refinement, difusion models. Let’s talk about generative modeling from a probabilistic perspective where essentially what we’re trying to obtain is a model for a distribution and that distribution is a distribution over some data set um I’ll typically use a distribution of images
as an example because that’s easy to visualize um and so essentially what we want to
know is if we have a collection of images what is the underlying distribution that these images might have been drawn from and this collection
of images will will be is what we’ll call the data set that we’re that we’re training our generative model Zone um and I want to make a distinction between explicit probabilistic generative models and and implicit models so in an explicit model you have this P ofx this probability distribution and your you’re you’re you’re building a model of that so you’re trying to sorry go ahead oh am I too close I think it does a bit too high M yeah yeah yeah okay this okay sure so with explicit models you’re kind of explicitly formulating um usually a parameterized version of this distribution and then you try to fit the parameters so that um this distribution represents the data well but there’s also a different class of models called implicit models where you where you don’t have access to this distribution through your model directly but you can draw samples from it and so a canonical example of that are generative adversarial networks where you know you feed in some noise and and a sample comes out but you don’t actually know what p ofx is in that case in explicit models you don’t necessarily know either like in some models you only have this up to a a normalization constant or or you you only have an approximate estimate of P ofx um another aspect of generative models that’s very important is conditioning conditioning basically means we feed in some extra signals into our model that allow us to influence the distribution that we’re modeling and so actually what we’re doing is modeling a conditional distribution P ofx given C where C is our conditioning signal rather than this joint distribution P of x and for example we could have class labels like let’s say we have the image net data set with thousand 3:24 different classes we could create a generative model of imag net images conditioned on the class label and then 3:30 this glass label would be conditioning signal um but we could also have other types of conditioning signals uh 3:36 specifically for images we could have a bounding box that says where a certain object should be in the image for example we could even specify a 3:42 segmentation mask we could specify a prompt you a textural prompt which is a very common approach to condition these models nowadays um we can go all the way 3:50 to on the right you can see we could even condition an image generative model on a grayscale image and expect it to 3:55 only fill in the color information for example and I like to put this on sort of an axis which describes the 4:02 density of the conditioning signal like how much information is actually being provided by the conditioning signal already and how much information does 4:09 our generative model need to fill in so if you have a class label that’s you know for an image net class label that’s 4:14 a thousand classes so that’s log of a thousand U bits or knots um so that’s 4:21 that’s very little information but then if you look at a gr scale image you’re you’re you’re already providing basically half the bits um and so um I think it’s useful to think of this in the way of like sparse and and dance conditioning and then a very important property
I would I would put that on more on the left side it’s it’s a little bit more dense than just a class label right you have a little bit more information but if you look at the total amount of like the bits of information that’s in a prompt it’s it’s very little compared of compared to the bits of information that are in an image right if you if you think of how we represent images digitally usually we have RGB we have three channels we use eight bits for each Channel that’s kind of common practice and then you you count the number of pixels in a in a reasonably sized image you you you have hundreds of thousand millions of bits but um a textual prompt can usually maybe be represented in tens or hundreds of bits so there’s a huge difference there in order of magnitude um yeah and then another sort of important way to classify generative models is whether they um exhibit mode covering or mode seeking behavior and so um a way to understand this is when we’re when we’re fitting a generative model to a data distribution usually that model will be underspecified it’s 5:47 not flexible enough to really model all the intricacies and nuances of our data distribution um and the way we can sort 5:54 of schematically represent that is let’s say we have a data distribution that’s actually a mixture of two gausian in purple here um and our model is just a 6:02 single gaussian so it’s a our model is too simple right if we if we tried to model this with a mixture of two gaussians we would be able to recover 6:08 the data distribution exactly in this case but here we have a model that’s underspecified and there are sort of two compromises that you can make one is the 6:15 mode covering compromise which is basically saying wherever there is probability mass in the data 6:20 distribution we need to make sure that our model also puts some probability Mass there and so the effect that you would have in this in this sort of contrive example is that your model your green gaussian distribution is going to kind of sit in between the two modes of your gaussian mixture so it’s going to put some probability mass on both parts of the data distribution but it’s also going to put a lot of probability mass in between where there shouldn’t really be any so it’s kind of 6:44 overgeneralizing in a sense um and then the other compromise you can make is the mode seeking compromise where you basically say okay when I draw a sample from my model I want that to look like that came from the data distribution but it’s okay if I’m skipping some parts of the data distribution so if if you if you think of these modes as like cats and dogs for example on the left you would have a model that sometimes produces these sort of weird cat dog 7:07 hybrids um whereas on the ride you might have a model that produces only cats and doesn’t bother trying to produce dogs and so one class of models that’s in this second category is adversarial models. All right so that that that was all I’m going to say just about generative modeling in general just a bit of background.
Now, let’s talk about iterative refinement because that’s kind of a core concept to to I would say most generative models that are in practical use today and the idea is basically that your model is not going to sort of produce samples from your distribution in one go it’s actually going to craft them iteratively right you have some sort of canvas that you iteratively update to create examples from a distribution and there’s kind of two dominant approaches to do this right now the first one is auto regression where the key concept there is basically to try and represent the thing that you’re trying to model the the data distribution that you’re trying to model try to represent each example as a one-dimensional sequence and then just try to predict that sequence one element at a time where every time you predict a new element you can see what the previous elements that you predicted were so you’re kind of um recursively predicting the sequence and then another 8:19 approach which has gained a lot of popularity in in the last few years in terms of um building generative models 8:26 of perceptual media is diffusion and the idea there is iteratively noising so you 8:31 you define a corruption process that gradually removes information um from 8:36 your training examples by adding noise so typically this just gaussian noise and if you if you add enough noise 8:42 if you keep doing that basically in infinite amount of time eventually you’ll have destroyed any information that was originally in the image and 8:49 what you will be looking at will look like just it was you know random noise from the from the beginning and so 8:54 you have this corruption process that you’ve defined and then you can try to learn to invert that process step step 8:59 by step and that’s the key idea behind diffusion um so I’ll talk a little bit about aut degression but as I said I’m 9:05 going to mostly talk about diffusion I think there’s also um more content coming on autoregressive model 9:11 specifically later in the week so Auto regression is based on this chain rule of probability where you decompose 9:18 a joint distribution over X let’s say x is a sequence because we can we can try to represent any type of data that we might have as a sequence so for images for example you would um Define some kind of arbitary order in which to order the pixels usually this is raster scan order so you kind of go left right row by row then you have a sequence and then you can decompose that into um sequential conditionals so you can you can Define the probability of each pixel value given all the pixel values that came before it in the sequence and if you multiply all those um those sequential conditionals together then you get your joint distribution so this is an explicit model um and thing is usually that if if the statistics of your data are well behaved like they are in in natural signals then you can kind of use you can reuse the same model to predict every step in this sequence right you can have a single autoregressive model that you just recursively apply at every step of the sequence um and if you do this in pixel space, there’s a few U models from from my colleagues at DeepMind that used this idea so pixel RNN
and pixel CNN are basically predicting pixels row by row column by column
in an out regressive manner and you can model images in this way fortunately it doesn’t scale particularly well which
is why these models we don’t really see them so much in the wild nowadays um
you can also do this in the audio domain so this is something that I also worked on sorry go ahead so we model theive
we’re assuming that the token I I can hear maybe I can just repeat okay so if we assume that the
model are weing theive assuming that the token that we
to predict comes from the previous but in an image you kind of have the
neighboring information theal information why can’t we model the token prediction based on the
previous um that would present a problem when you when you then try to sample from the model because you have to
Define some order in which you’re going to generate the pixels and you simply can’t rely on pixels that you haven’t generated yet right but this is arbitrary right people people generate these images from from left to rightfrom top to bottom usually but that’s a completely arbitrary choice but that’s the nice thing about the chain rule of probability is you can kind of decompose it in any order that you see fit and you can choose an order that makes the most sense so for language modeling for
example this is obvious what the order should be because we you know language is already an ordered sequence but for
other types of data you kind of have to make this arbitrary choice yeah this is actually one of the nice things about diffusion where you where you don’t have to do that yeah so autoaggression in amplitude space for waveform generation is a project that I worked on back in 2016 so it’s kind of the same idea but now we represent sound as a one-dimensional waveform you can kind of see there in the animation when at its maximal zoom level you can see the 12:21 individual sample points because we represent these waveforms digitally on a computer so we we have to quantize them both in time and in amplitude and then you just get a Time series and you can generate that auto regressively just like you can do with language tokens for example has it scale this does not scale well no you can it it it trains really nicely but then when you want to sample from it you have to generate very very long sequences right and especially back in 2016 we I think we were working with speech signals at 16 khz which means you have 16,000 of these predictions to make for every second of audio that you need to generate and audio is is an application where you typically like like to have real-time generation capabilities so this was not fit for purpose but this was a predecessor to a later model called parallel wavenet which is not autoaggressive and which removed this restriction um I’m not going to talk about that but yeah yeah good point um and then I would say sort of a a trend we’ll see later also in diffusion modeling but but um in generative modeling of perceptual signals in general is to move to a latent space right to to not try to model the sort of raw representation that we typically use for digital signals you know digital versions of
images or or video directly but to First learn a more compressed high level
representation and then do the auto regression or whatever generative mle you you want to use in that space to kind of save on compute to kind of simplify things a little bit and so vqv and VQ Gan are sort of the canonical
papers papers there because they’re Auto encoders that allow you to learn higher level discrete representations of
of any sort of input that you might have all right um sorry go
ahead what we talking about like using the SP for for exle what’s your
point concerning like would it affect com for
models um I’m not the same dat but you’re working in the space for purposes would it say perform
better or has would it have less probability to overit data I think it
would have more probability to overfit because you’re you’re compressing the representations and so you might be
mapping images that are are in theory kind of different right because they have different pixel values you might be
mapping them to the same latent because the point of the latent is to kind of make abstraction of say the variability in the input that doesn’t matter perceptually right like if I have um an
image let’s say there’s a cow in a field so there’s like a grassy texture at the bottom half of the image right now I go into Photoshop and I shift that grass like one pixel to the right while leaving the rest of the image unchanged and I put them next to each other you
cannot tell the difference right because you’re perceiving texture you’re not perceiving the individual pixels and so uh one advantage of these latent models is that they allow you to kind of make abstraction of some of these textural variations which are very highend y right there’s a lot of there’s a lot of um entropy for the for probabilistic models to learn there but it’s not
actually relevant perceptually so it’s it’s it’s kind of a waste to try and learn that and so latence allow us to make abstraction of this incidentally
this is why Pixel CNN and pixel RN don’t really scale because they try to learn all this local entropy and they never
get to learning The Logical structure uh in inputs yes isn’t that uh actually on because
you are in a low dimensional space and everything is [Music]
basically so it’s it’s it’s a form of underfitting in the in for the auto encoder like when you’re training the
auto to learn space that is kind of underfitting in a sense but then if you train the generative model in the latent space that is more prone to overfitting because now there is less diversity in the input
data well I would say yeah sorry the other room doesn’t oh so
these these are not okay that’s just for this room okay so we’re talking about overfitting and underfitting and does like learning latence and then doing diffusion in latent space or aut regression in lat space does that make overfitting more or less likely um so I would say as as we as as I said earlier so the autoencoder is sort of underfitting in a sense because it can’t represent every possible variation but that’s by design that’s what we want we want to make abstraction of these details and then the new representation that you get if you D Tred a generative model in that space it’s going to be more prone to um overfitting 17:08 but in in situations like say stable diffusion and sort of like large large scale text to image models and text to video models overfitting is just not really a concern because the data sets that these models are trained on are absolutely massive yeah but if you had a very if you had very limited data you might see more overfitting by using this link approach of course what you could is sort of pre-train on a very large data set and then find you the model to to to combat that yeah all right I’m going to continue because um yeah I’m I’ve already been talking for 15 minutes okay so the main thing I want to talk about is diffusion models and I want to try to build intuition for diffusion models because this is what frustrated me when I got into diffusion models is like the literature is dense and every paper has its own notation and its own perspective um and sort of reading these papers eventually pieces started filling fitting into place and I realized it’s actually you know it’s not that complicated the underlying ideas are are pretty intuitive so that’s what I kind
of want to bring across here um so as I said diffusion is based on iterative denoising so um you kind of Define this corruption process which here goes from left to right so I have an image of a of a rabbit and I add a bunch of noise to it repeatedly and then if I keep doing that for long enough eventually the rabbit will disappear and you’ll just see noise and the diff model is going to try to invert this process step by step um so yeah so let’s look at look
look at the forward process this is the corruption process what this actually looks like so I’m going to I’m going to start referring to training examples
as X notot or x0 um because we’re going to consider a process that has a sort of
a notion of time right and over time the the information in the signal gradually gets destroyed and so at time equals zero is when we have our clean training example um and then at every time step I’m going to add a little bit of noise which I represent here by Delta typically Delta will be Gauss noise with a very small variance and I can just keep doing that for a while eventually I arrive at time step T I’m going to have a data 19:15 point that’s a mixture of my original signal and some of the noise that I’ve added and then if I keep going an 19:21 infinite number of times eventually I end up in something that is indistinguishable from Gaussian noise so that’s the four process um the nice thing about using gaussin noise for this is that gin
noise has this nice property that if you add a bunch of gausian random variables together you get something that’s again
gaussian with a larger variance right so adding a little bit of noise a lot of
times is equivalent to adding a lot of noise in one go and this allows us to write down a formula for computing
any point in this process XT at time step T without having to go through the process step by step and this is going
to be very useful for for training these models where we have to artificially corrupt our training examples so I’ve written that down at the bottom there so XT is equal to x0 plus Sigma T * Epsilon where Epsilon is a standard
normal variable and then Sigma t tells us what is the total variance or
rather what is the total standard deviation um at that time step t
sorry yeah in practice we make this slightly more complicated just for practical reasons we’re not just going to add a little bit of noise we’re also going to slightly scale what we have so
far and the reason to do that is to avoid this explosion of the variant uh where um you know the it it just kind of
grows uncontrollably um so we we introduced this extra Factor Alpha T which is going
to scale down X knot in this equation and then another thing we’re
going to do is we’re not going to do this an infinite number of times because we can’t afford to do that thanks to introducing the scaling Factor we don’t
have to we can actually Define sort of an endpoint for our process capital T often for convenience people say capital t is one and then adapt the parameters accordingly to make sure everything works
out and so this Sigma as I said is what we call the noise schedule it controls the rate of Corruption of the process at various times so this can be this can be quite a nonlinear function of T and then this Alpha T is the scale
factor and we have several possible choices here popular choice is the variance preserving choice where Alpha is equal
to theare root of 1 minus Sigma squ and what this is going to do is if your if
you can show that your data distribution is unit variance zero mean unit variance um and of course this Epsilon that we’re
sort of interpolating with is also a zerain unit variance ction because that’s what we chose the noise to be
then this variance preserving choice is going to make sure that all our intermediate noisy representations of the input XT they’re all going to have
unit variance and that’s nice from a sort of a numerical point of view right when you’re putting stuff into neural
networks you want to make sure that the scale is well behaved um another Choice equally valid
is the variance exploding choice where you said Alpha is always one um here you will have to arbitrarily
choose an endpoint for the process because you can’t run it to Infinity but usually we find that there’s some variance at which most of the signal or all of the signal rather in the input is gone and so we can just run the process to like variance of 100 let’s say and then and then stop there and then another Choice with just gaining popularity is to set Alpha equals 1 minus Sigma so we remove the square root and the squaring um and this corresponds to what’s called rectified flow or flow matching in the literature which is really just another form of diffusion with with a different choice for for Alpha all right so we’ve talked about the corruption process a lot now how do we actually train a model to invert this um so this in this backward process we’re again going to add small Deltas to the input but we’re going to start from the noise now on the right side and then go left and the question is now how do we find this Delta because obviously this is not just going to be you know some random noise this is going to be some sort of signal that that brings brings back some of the structure that the noise has
destroyed and um I like to visualize this in two Dimensions now this is a
this is a dangerous game right so what I’m going to do here is I’m going to take these very high dimensional objects
which for natural images right if you have a let’s say you have a 32x 32 image that’s three channels so that’s 30 uh
372 scalar values in a vector let’s say say that’s 3,000 dimensional space
I’m going to represent that in two Dimensions because that’s all I have on the slide it’s generally very dangerous in machine learning to draw
conclusions from low dimensional spaces and then generalize them to high dimensional spaces nevertheless I
think it’s instructive to to try and visualize what’s actually happening when you sample from a diffusion model so
I’m going to do it anyway but I do want to you know provide this caveat um all right so I’m sort of
arbitrarily representing an image X notot as a point in 2D space but imagine
it’s actually you know like 100,000 dimensional let’s say and then when I add noise to that this will correspond
to a different point in this space so that’s XT in the top right corner there now what is a diffusion model
actually going to do it’s going to try to predict from XT where xnot was and this is an an a difficult task and an IL defined problem right because the whole point of adding noise is that it destroys some of the information in the signal so we’re not going to be able to predict exactly where xnot actually was but we’re going to try and make a prediction anyway and what’s going to happen is that prediction is going to be blurry then you try to visualize that as an image again and and why is that so so as I said we don’t actually have enough information to know what the original image was and so that means that many original images could have given rise to this 25:02 particular noisy image that we’ve observed and what we end up predicting is going to be sort of the the centroid the mean of all those possible images that might have given rise to this particular noisy image so this prediction is going to point to sort of a region of input space but it’s not really pointing to any training example in particular or to any image in particular it’s pointing to sort of a general region of space where we should go if if we want to move this noisy image back to being less noisy and so we can we can visualize that as a blurry image but it’s it doesn’t really make sense to think of this prediction as an image in its own right okay so the next thing we do in the backward process is we take a small step in this direction that we’ve just predicted um and this is sort of um it’s it’s it’s worth thinking of the analogy with optimization here right it’s it’s a form of gradient descent that we’re doing here with with a little bit of noise in the mix as I’ll as I’ll show later but you’re kind of taking only a small step in the direction that you’ve just predicted because if you were to take a step that’s too large you’re going to end up with a blurry image right you’re not going to be able to sort of correct for this but if you take small enough steps and you do you sort of reorient yourself at each step and and and make new prediction then you’ll see that you’ll be able to 26:18 um sample from the data distribution this way so we’re going to take a small step and then typically what we’ll do 26:26 which is which is kind of unlike optimization is after we’ve taken a small step we’re going to add a little bit of the noise back now it might seem
counterintuitive there are theoretical reasons for doing this that I’m not going to go into but the intuitive reason for why this might be a good
thing is that we’re doing a sort of U two steps forward one step back type thing and and this this tends to be more
robust to any systematic errors that might be in our predictions because our predictions are coming from a neural network neural network is not going to be perfect and since we’re going to
be applying that neural network repeatedly we might have error accumulation if we’re not careful and so
so one very common practice in in some forms of diffusion sampling algorithms not all of them but in some forms you have the step where you add a little bit of noise which is going to make it more robust to that and then we’re going to just repeat this process so we we take a small step in the in the predicted direction we add a bit of noise and then we make a new prediction and as I said this new prediction is going to be different than the prediction that we made in the previous step because now we have a little bit more information right there’s slightly less noise in the in the image that we’re starting from now and so we’re still we’re still not exactly sure where this image might have come from but we’re going to predict a new centroid that corresponds
to a smaller region of of the input space and then we just you know we rinse and repeat we add a bit of noise find a new point and so on and so on and if we keep doing that then um subject to
some details that I’m glossing over here but eventually what you get is a kind of biased random walk towards what we believe to be the clean input xnot
and our belief of what xnot might be of course changes over the course of this random
walk yes yeah so the question was whether this
noise that I’m talking about here is subject to the noise schedule that we were talking about earlier and that that is the case
yeah all right so so far I’ve been talking about how diffusion models try to predict the
original image X not if you’ve read papers about the fusion models you might be questioning that because often what’s 28:34 predicted instead is this Epsilon the standardized noise that we’ve been adding to corrupt the 28:40 data and um in some sense these These are equivalent you can you can
parameterize your model to predict X not or or Epsilon the nice thing is because we have this linear relationship 28:52 between XT which is the input to the neural network and x and Epsilon if we have a prediction for we can always 28:58 convert it to a prediction of the other um in practice they aren’t exactly equivalent because this choice of 29:05 parametrization does affect the relative weighting of noise levels in the loss which as we’ll see later is is is
actually very important um but once you have a train model they are equivalent um and you don’t have to
predict either X knot or Epsilon either you can actually choose to predict any linear combination of xot and Epsilon
and there are other variants in the literature like V prediction and flow matching that that make choices
here all right so in summary how do we train a diffusion model for each training example X not we take a random 29:37 time step in our corruption process t we use the formula that I showed
earlier to corrupt xnot to get XT we can do that in one go we don’t have to run the corruption process step by step
thankfully um and then we train a neural network to take this XT as input and uh
predict X knot or Epsilon um from XT and then once we have this prediction oh I
don’t know what that happened with the with the hats there the hats are slightly shifted um once we have this
prediction we just minimize minimize the squar prediction error so we we the training objective is literally just a mean squared error um between our
prediction and a Target uh and as I said before this problem is is sort of um
it’s it’s really hard it’s it’s kind of Il defined because there’s information missing from the input so uh you’re not
actually able to to bring this error to zero um you you can kind of make a a general
prediction uh but you know your prediction is going to be blurry if you try to visualize it in the in the image domain uh but the fact that that you can
just train this with mean square error means that it’s you know mean square eror is a well studied loss so this is kind of easy easy to
optimize IUS
compare the global noise right not the noise from the previous step but to like
noise of the full noise right so this mean square error the abson had from
right sure but Epson is what is it it’s not the one step less noisy no no no
it’s it’s uh so there’s there’s with sorry yeah I I am I yeah so the question
is about does this Epsilon here reflect like the the previous step in the sampling process or is this sort of a
more Global quantity and it’s the ladder right so um sense sorry it’s counter
right why is it the case well um so as I said so so we have this formula here that um
thanks to the the nature of of the gausian distribution right we we can kind of aggregate many small additions
of noise into one one big uh addition of noise and so that means you can write down this formula to jump from uh the
starting point to any uh point in your corruption process and because this is a linear relationship uh this now means
that if you have a prediction for X notot you can therefore convert that into a prediction for what the Epsilon
might have been this this noise that you’re you know like the essentially this is a linear interpolation right
your XT is a linear interpolation between uh a sample from a gan distribution and a sample from your data 32:13 distribution is it true that the prediction when the T Clos the will be more accurate than when the T 32:22 isther so the question is is it true that the prediction will be more accurate at small T than at large T this 32:27 this depends on your prediction Target so if you if you try to predict Epsilon then at low time steps which means low 32:34 noise this is a really hard problem right because you’re trying to predict the sort of the standardized noise that was added but you can barely distinguish 32:41 the noise in the input um whereas if you get to High Time steps it’s going to be a really easy problem because you’re 32:47 basically copying the input for uh for xn prediction it’s vice versa X not prediction is going to be really easy at 32:53 low noise levels really hard at high noise levels yeah have any 33:02 attemps uh yes um I have made attempts myself uh question yeah sorry um so the 33:09 question is what what do you do when the noise isn’t couching like can you can you define the fusion for uh the fusion 33:14 processes for noise that isn’t couching you can very often nothing is tractable like it becomes intractable to jump to 33:21 any point in the process uh easily for example um but there are some formulations like there like there’s 33:27 some formulations for diffusion on the Simplex for example where it doesn’t make sense to use gaussians where you can use different distributions and 33:33 somehow the math works out although often you get things like Bessel functions popping up and like it’s it’s not nice my general recommendation is if 33:40 you can do gaussian stick with gaussian uh make your life simple um okay I’m goingon to take one 33:45 more question then I’m going to continue because I am going to run out of time yeah I 33:51 just and 34:05 so the question is if you if you give the the model XT does the model also get to know T and the answer is yes yeah you 34:12 tell it uh this is the point uh in the corruption process where we currently are and this is the the noisy input that 34:18 I’ve observed yeah all right um quick summary of diffusion sampling so once you have a
train model at each sampling time step t uh we make a prediction and then we take
a small step in a in a predicted direction to partially Denise the input and and uh Fair Point like here you can
see it I guess I we pass in t to the network as well we tell the network this is how much noise you should expect
there to be in the input although it’s not always strictly necessary to do that like sometimes it’s just easy to infer
how much noise there is from uh from the input as well um yeah and then in some
sampling algorithms we optionally add some noise back we repeat all right so that’s sort of my intuitive summary of
how diffusion Works uh Now for Something Completely Different score matching um
it looks different at a glance but I’ll show you that it’s actually the same thing so score matching is this idea of like uh fitting you all right nobody move okay so score matching um so I’ve talked before about fitting probability distributions P of X One annoying thing about fitting P of x to data is that P of X has to be a valid distribution so it has to be normalized so you have this normalization constant and this can create a lot of headache to make sure you know to keep things normalized um so people came up with an alternative way to fit distributions which is score matching and this is based on this idea of the score function where you don’t fit P of X directly but you fit the gradient of the log of p ofx 36:30 with respect to X and think of this quantity as a gradient that points to regions of high density right it’s it’s like pointing to like this is how you should move X to make it more likely under the distribution um and you can you can sort of match this with uh Target distribution uh using mean squ error and in theory that should also allow you to fit uh distributions um and this is because uh the this normalization constraint means that if we know the gradient of log PX we also know PX because usually when you know the gradient of a quantity you know it up to a constant but here that constant is going to be defined by the normalization requirement so yeah we can we can minimize this uh this MSE over scores except of course we can’t because uh you don’t have these score values for your ground truth training data but there’s a trick which is called the noising score matching where instead of sort directly trying to match these these gradients of the of the the true distribution we’re going to add noise to the distribution um we add a little bit of calcium noise and then instead of matching uh these scores from the from the data distribution we match the scores from this conditional noisy distribution so given a the conditional distribution given a clean example U of
A noisy example and this this we know because this is we choose this to be gaussian and so we we know this quantity and it turns out that um if you if you do this
conditional thing where you add noise uh the the loss that you get is different but it has the same minimum as this
original score matching law so this is a way to make score matching actually tractable in practice for for real
data um and and then it turns out that uh this this conditional score that I
just talked about that is actually going to be equal to this this Epsilon that we talked about before so this is actually
the same as the Epsilon prediction that we were doing before uh to train the fusion models up to uh a scaling
constant uh the reason I wanted to mention this connection is because you you’ll often see it in literature but also because it then allows us to talk
about um diffusion in terms of stochastic differential equations so we can Define this uh uh process this
corruption process in terms of a stochastic differential equation which is sounds scary but it’s really just a differential equation where you have a
term that injects some noise that’s a DW over there um and then there’s there’s
some Theory associated with uh SD that then allows you to write down an SD for
the backward process and that’s what’s shown at the bottom of the slide here and what appears in this backwards SD
it’s exactly the score function right so having a predictor of the score function
then allows you to plug that into this reverse SD and use it to draw samples uh from a generative model
so this is often the way it’s framed in literature so that’s that’s why I want to mention this another interesting connection I want to talk about is the equivalence between diffusion models and flow-based models and this one’s a little bit crazy so if we have an SD that describes a stochastic process we can ask at every
time step what is actually the distribution of XT at that point what does that look like so on this diagram which I’ve conveniently stolen from Yang song’s paper you could see a
kind of a toy example where we start with a distribution that’s a mixture of two gaussians and then we sort of diffuse it to a single Gaussian um and you can do that by you know following this SD so adding little bits of noise you know infinite decimal bits of noise because we made time continuous variable here um and then you can ask at each time step what is the distribution that we now observe for XT and then it
turns out that you can also write down a regular ordinary ordinary differential equation that will give rise to the same distributions at each time step so it’s it’s it’s an ordinary differential equation so it’s no longer describing a stochastic process it’s describing a deterministic evolution of the input but at at each time slice that you take of this process you wouldn’t be able to distinguish which one you’re actually running just by looking at XT and so
that’s really cool because it now gives us a way to sample from uh diffusion models in a deterministic way so without
having to add noise at each um at each time step and if you have a model that
allows you to morph one distribution into another one deterministically that’s exactly a flow-based model so
that’s the connection between diffusion models and flow-based models um I also like to think of
diffusion models as a kind of RNN that’s strained without back problem so let me explain that uh if you look at the sampling procedure for a diffusion model uh it’s kind of a repeated application of this neural this denoiser neural network that updates a canvas let’s say so we start at XT and then we kind of move towards X not by repeatedly applying the D noiser and we can unroll that whole sampling Loop and look at it as a kind of very deep computational graph it’s actually kind of a neural network in its own right where you know where we we’ve essentially reused a bunch of layers many times and so if you just look at that as a very deep neural network you could say oh we can just drain that with back propagation right that’s that’s essentially backr through time like is like we apply in in recurrent neural networks and you can do that and the resulting model is called a continuous normalizing flow in literature um but as we found out more recently you can actually also train this with score matching and then you only have to back propop through one application of the denoiser you don’t have to back propop through these repeated applications so in a sense you could say diffusion models allow you to train very deep recurrent neural
networks without back propop through time and I think a a coral are of this 42:21 is that in many years uh when when training neural networks with hundreds of thousands of layers 42:27 is no longer going to be a burden we might actually end up going back to training V for everything you know just 42:33 train variational Auto encoder that has 100,000 layers it’s probably going to be as powerful computationally as this as 42:39 this diffusion process okay um I want to talk a little 42:44 bit about why uh diffusion works so well for images because there’s this interesting um I would say unstable equilibrium right now where there’s two main classes of generative models that are popular at scale which is autoregression and diffusion Auto regression is always used for language and diffusion is used for pretty much everything else and you know why is that why did the Fusion come in and sort of take over uh image modeling and so um it’s worth doing a little bit of signal processing to understand that so if you look at the uh the spectrum of spatial frequencies in natural images uh that has a very particular shape it it kind of has a power law shape so if you plot that on a log log plot where both axes are logarithmic you should get a line right and usually this
line has a particular slope as well of around minus 2 and that’s just it’s kind of a law of nature uh it’s very interesting um if we do the same thing with a a sample from a gaussian distribution and we TR to plot the spectrum of that that’s going to be a horizontal line because gausian Noise by definition has sort of um every frequency is has the same amplitude now what we’re doing in the fusion models is we’re adding these together right we’re taking our natural images let’s say and we add uh gausian noise if you look at what the Spectrum looks like uh on this on this log log plot you get this kind of hinge shape
and you can do that with different noise levels so I’m comparing here like with with a certain noise level and then uh
increasing the noise level which kind of lifts this uh lifts the energy of the of the gaan noise you see that the hinge kind of moves up and what I want you to understand from this is that adding
gausian noise obscures the high frequencies in images but leaves the low frequencies untouched because the low frequencies have much more power have much more magnitude and so you can kind of control which frequencies you’re
removing by by by uh changing the amount of noise that you’re adding and so this is the link between the fusion models
and autogressive models because in a sense you could say that the fusion model is spectral autoregression right your the fusion models are predicting
Images autoregressively but they’re doing it in frequency space they’re first predicting the low frequencies and then gradually adding uh the higher
frequencies um I didn’t come up with this there’s a really nice paper called generative modeling with inverse heat dissipation that I sort of took this
idea from they have this nice figure in there as well uh and it’s it it kind of um give me this Insight of like a
connection between the variance of the noise that we’re adding the corresponding spatial frequencies in
images and also the therefore the scale of the features that we’re modeling at each uh noise level and this is why it’s so important to uh correctly waight the different noise levels in the loss function of a diffusion model because you’re basically telling the diffusion model these are the scales of features that I care about these are the scales that I don’t care about and and so you end up with a loss that’s actually a better perceptual loss function than likelihood in the input space would be right so I I already talked about this a little bit so uh waiting and SP uh spacing of the different noise levels is very important so the waiting uh in the loss function determines the relative importance of these noise levels and hence relative importance of different feature scales uh and this this Choice kind of depends on a few things of course there’s perceptual relevance so human eyes are uh more sensitive to low frequencies than high
frequencies so it makes sense to spend more model capacity on low frequencies uh but also uh depending on how much noise you add to the input the task of learning is going to be get harder or or easier right and so those are two things that you kind of have to take into account when you choose the the relative waiting of time steps um and then the formalism the formalism that we use for diffusion models unfortunately means that there’s lots of different knobs that you can change change uh to affect 46:29 the waiting of these noise levels there’s there’s a noise schedule itself Sigma T but then you could have an explicit waiting Factor that’s time dependent in the loss you could also change the distribution that you sample the time steps from during training it doesn’t have to be uniform could be something else and then of course as I mentioned before already whether you predict this uh X knot or Epsilon will affect the relative waiting of the noise levels as well and then if we move to sampling time at sampling time uh the waiting no longer matters cuz the model is drained but now we have to choose how to space the different noise levels at which we evaluate the model uh and this again depends on perceptual relevance but now it also
depends on how accurate is our model at this particular noise level and and also how much potential is there for error accumulation so earlier on in the sampling process uh the predictions that we make are going to have more profound implications on where we end up than later on in the process and this again also depends on which sampling algorithm we’re using like are we injecting noise or not to make it more robust to this uh accumulation of errors and the spacing is affected by um the noise casual Sigma T and also this um uh the potentially and explicitly non-uniform uh spacing of the time steps um I kind of already covered this mostly so I talked about latent diffusion.
Guidance
I want to talk about guidance as well because guidance is kind of what I like to call a cheat code for diffusion models that allows them to perform sort of above their pay grade in a sense and I’m going to use this this sort of diagram again that I also used to explain the backward process because we’re going to make some modifications to it and it’s nice to be able to visualize them um so as before when we’re trying to sample from the diffusion model we’re going to make this prediction for X notot this prediction is not going to be that accurate because you know there’s missing information but now we’re also going to bring in a classifier and this classifier could be an imet classifier um is going to you know give us you know he’s going to use a soft argmax function to give us different probabilities and we can we can basically say what is the direction in input space that I should move this image in to make it adhere more to a particular class to make that class more likely and that’s just the gradient of the classifier right and that so that gives us another Direction in input input space along with the direction that the diffusion model predicted and we can combine those two into a new direction and then instead take a step in that direction so we’re kind of modifying the sampling process as we go along and it turns out if you if you think about that for a minute you can actually see this as an application of Base rule so we’re working with gradients of log probabilities but we can undo the gradient and the log operation to go back to a formula in terms of probabilities and that’s kind of what I’ve done here so at the bottom you see the the gradients the score functions and then at the top you see what actually does this correspond to in probability space and you can see that actually what this classifier guidance allows us to do is take an unconditional diffusion model add this classifier that has you know an arbitrary sort of dictionary of classes and turn our model conditional after the fact without retraining it that’s the power of classifier guidance but the real power is kind of unlocked when we then when we don’t just use this gradient as is but we apply a scale factor to it so we scale it by some Factor gamma which is called the guidance scale um and then we we use that modified Direction and so now what this is going to do is say I really really really would like this image to adhere to this particular class right just make sure that you maximize the probability of this particular class and you know the other than that the the the diffusion process the reverse diffusion process um continues as normal so if we look at what that means in terms of the basan perspective Again by sort of undoing this gradient and log operation you can see that we’ve taken this classifier distribution P ofc given X and we’ve raised it to a power gamma so this is this acts as an inverse temperature what I’ve called coldness um and so this coldness what it’s doing is it’s actually sharpening this distribution and this is going to very strongly encourag the model to focus on um producing output that maximally activates this particular class and then of course the U of the most popular form of guidance that we used to today is called classifier free guidance and as the name implies it means that we don’t need this external classifier that we need to bring in to get to get gradients instead what we’re going to do is we’re going to make two predictions with our model we’re going to make a conditional and an unconditional prediction and the way we can do that is by by training our model with conditioning but then maybe dropping it out 10% of the time so that the model can act as both an unconditional and a conditional model so we can make these two predictions and calculate the difference between these predictions and this is going to give us a Direction in space that should make samples look more like they belong to the corresponding class so we can again do the same thing that we did in classifier guidance and magnify it so we multiply it by the guidance scale and then we take a step in that
direction um and this is magic so this is what makes diffusion models work
well like any samples you see from Modern diffusion models like you know like mid Journey stable diffusion
they are using they are leaning very heavily on this trick to make the it look good again the the process kind
of proceeds as normal after this modification but we do it at every step so yeah so this is kind of
supercharging the model thanks to Bass because this is really just applying bass rule twice but that’s what I want
to take away from this classifier free guidance is just base rule
twice and then some examples from the Glide paper which was one of the first sort of large text to image models uh
this one’s by open AI so on the left you see samples from the base the fusion model without classifi free guidance and
on the right you see what classifi free guidance does so it kind of allows you to get much higher quality samples but
also the diversity is reduced quite a bit yes a little practic question let’s
say youal different would you be dropping them one by one training would you drop
them all of them that’s a very good question so so the question is like if you have
multiple conditioning signals do you drop them one by one do you drop them all at the same time you probably want to sort of drop them independently so
that you can guide specifically by one signal while while keeping the others fixed or you know guide by by all of
them yeah so you probably want to do that independently yeah um yeah one more example it’s a different prompt but you
can again you can again kind of see I think these examples nicely show the trade-off between quality and diversity
that this gives you but yeah as I said guidance is just base
rule um or is it so there’s a recent paper that came out from the Nvidia
folks in in Finland that kind of calls us into question a little bit so they I’m I’m not going to go into this but this is worth checking out they kind of do this analysis of guidance and say like maybe this is not why guidance Works maybe there’s some something more
to it worth a read this came out last month um and then to wrap up the diffusion section I just want to highlight two amazing papers this one I call the diffusion Bible um it’s also by the NVIDIA guys in Finland it’s really a gold mine for diffusion practitioners read it front to back including appendices the appendices is where most of the gold is and then this one complements this nicely so this this paper goes really deep into this relative waiting of noise levels and and how you can control that and what that actually means um also really great paper with really great
appendices all right so in the last five minutes I’m just going to give you a few examples of some of the work that I’ve done with the fusion models at Deepmind so one thing I worked on is
continuous diffusion for categorical data and the idea here was basically to use this framework of continuous
diffusion with data that isn’t actually continuous right and and this is a problem because if you try to apply this naively then the question is what you know what is this what is this gradient with respect to the you know the probability the log probability what is the what is the gradient there if your space is discreet you you don’t have a score function essentially um but an easy fix for that is to embed your discrete data in a continuous space and then do gausian diffusion in that space and this is essentially a recipe for making that work and so a goal of this work was to do language modeling with the fusion models and specifically to make it look as similar to existing large language models as possible to kind of make it easy for existing LLM practitioners to adopt this and and play with it so we use a Transformer model just a standard Transformer except without the causal masking because we don’t need that here um and we also train a diffusion model with cross entropy loss instead of the mean square error loss just kind of an overview of of what makes that possible so there’s kind of two novel components in bold here there’s this idea of score interpolation which is the idea that instead of having the model make a prediction for the score function directly what you can do is train it as a classifier you know 55:48 if you have a vocabulary of let’s say 32,000 tokens as you often do in language models you can just have a 32,000 way argmax sorry up there and
um and then you can take the probabilities associated with each possible token in a dictionary and ask what would the score be if this was the correct prediction and then you can just use these probabilities to interpolate between all these different possible score functions and the result is the same kind of average estimate of the
score function that you also get from a standard diffusion model so this is a nice way to to enable training Fusion
models with cross entropy in the in the discrete setting and then there’s another component that I’m not going to go into called time warping where we
specifically looked at what is the relative weighting of noise levels that we should use in this unfamiliar domain
for diffusion and can we learn that on the Fly rather than having to figure it out ourselves so what this model ends up
looking like is actually a lot like Bert if you’ve heard about that so Bert is a language model but it’s not a
generative language model it’s a language model that learns internal representations and the way they do that is by taking a sequence of to tokens
masking out some of the tokens in the input and then trying to predict them from Context um and you can think think of
the cdcd type models as a version of Bert where instead of masking noise you have Gaussian noise so you kind of add gaussian noise to the embeddings and that makes it look like the embedding could have come from a few different directions and then you try to predict what is actually the the correct embedding sequence and so this is gives rise to kind of interdimensional version of bird if you want um so yeah I have some samples here there’s probably not enough time to read them but the the key point here is that you don’t just have to sample with a prefix you can you can fix some tokens in the sequence and then fill in the middle so there’s like an example there in the middle on the right where we fix a year ago in Paris at the start and
what the great what a great day at the end and we let the model fill in in between and the model kind of knows how to do that out of the
box um after this paper I had this observation that okay the fusion
language modeling shows promise but it’s going to take a few more breakthroughs to really compete with autor regressive
models I wrote a blog post about that in case you’re interested and then the last few things I want to show are
just what I’ve been working on more recently which is Imagen 3 and Veo which are Google DeepMind’s text-to-image and text-to-video models respectively now this is a I I left this to the end just to talk about for a few minutes because
I can’t really share anything educational about this yet because this is unpublished work right but I can show you some samples and I can tell you that imagine three will be available to everyone very soon to play with so yeah these are just some some sample images that you can generate with the fusion if you put your mind to it and you have a lot of comput um there we go yeah just yeah on the website you can you can see more and then there there will be a tech report I promise and then on the on the video side we have a model called Veo very comparable actually both in output and also in architecture to the Sora model from open AI so it’s a latent diffusion model we have a latent space in which we represent the videos we have text prompt op optional image prompt and then we just have a diffusion model that produces a 1080p video and I have like a this kind of a show real a bunch of samples edited together just to show what the what this model is capable of so we’re working on making this one available as well but as you can imagine video is a bit more unwieldy so that might take us a little bit longer please bear with us um all right so to wrap up most of the content that I’ve talked about here is on my blog so I have a lot of blog posts um if you if you just remember sander. a you can just find them all there um I think I kind of touched on pretty much all of these except maybe the Paradox of diffusion distillation I didn’t really talk about distilling diffusion models but I have a blog post about that um as well um and I will leave it at that maybe we can do a couple of questions and I wanted to start with the question from the most question something like um what is your opinion on AO regressive models in general it feels weird that they need this intermediate States but they generated with no add input I know there’s work into this generation into a simple noing step um in let let me have a look yeah um okay so so Auto regression is is really beautiful in a sense because it elim Ates this problem of the of the
normalization constant right I talked before about if you want to estimate P of X you you better make sure that P of X is actually normalized this is annoying leads to all kinds of problems in intractability autoagressive models don’t have this because you kind of factorize P ofx into scalar distributions and each of these scalar distributions is really easy to normalize like it’s always tractable so I think Auto regression is very very powerful and then of course especially in in cases is where the data naturally lends itself to sequential modeling like for example in language um I do kind of think that it’s it’s artificial for some perceptual modalities to kind of have to turn it into a 1D sequence but it’s just a very very powerful and also very computationally efficient way to model things so I think Auto regression is not going away in fact I would say with the trend towards multimodal models we’re having we’re at this point where we’re trying to merge language models and text to image text to video you know like try to have a single model that generates all these modalities and this is leading to a bit of a a clash right there’s an incompatibility because we’re using diffusion for some modalities and autoregression for others and so I can see this kind of going two ways because I don’t believe this is why I called it an unstable equilibrium I don’t believe that people will tolerate this sort of having to deal with two different modeling paradigms indefinitely so either we’re just going to go back to doing everything autoregressively which seems most likely right now because that already kind of works works it it misses
some of the advantages that diffusion has but it works well enough so practically speaking that seems likely
to happen another thing that could happen is people figure out these barriers to making text diffusion work well um and then maybe we can just train multimodal diffusion models and we don’t have to use autoregression anymore but I don’t know which one it’s going to be maybe one question yeah
so I think the question is about like using generative models as a as a prior in in AO applications right um so um aut
regressive models are really great for that because they give you this this tra whole estimate of P ofx right with
diffusion you can estimate likelihoods of individual data points but it’s as expensive as sampling so it
requires you to run this iterative procedure and essentially solve an ordinary differential equation to be
able to get an an approximate estimate of the likelihood of a particular sample and the problem also
with that is it’s usually it’s an approximation it’s not an upper or lower bound it’s an approximation so like it’s
it’s you don’t you don’t know if it’s actually bounding anything
Imagen Video
Imagen Video Ho, 2022 progressive distillation
Diffusion for Sequential Decision Making
- Planning with Diffusion for Flexible Behavior Synthesis PDF
- Value function estimation using conditional diffusion models for control PDF
- Training Diffusion Models with Reinforcement Learning (DDPO) Blog PDF
- Is Conditional Generative Modeling All You Need for Decision-Making? PDF
- Diffusion Policy: Visuomotor Policy Learning via Action Diffusion PDF
“Simple Diffusion” Math Formulation
References
-
Patrick Esser et al, “Taming Transformers for High-Resolution Image Synthesis”, CVPR 2021. PDF.
-
Ramesh et al, “Zero-Shot Text-to-Image Generation”, ICML 2021. PDF.
-
Yang Song. Generative Modeling by Estimating Gradients of the Data Distribution. Post.
-
Ermon and Song. CS 236, “Score-Based Models” Slides 1, Slides 2.
-
Jonathan Ho, Ajay Jain, Pieter Abbeel. Denoising Diffusion Probabilistic Models. PDF.
-
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, Surya Ganguli. “Deep Unsupervised Learning using Nonequilibrium Thermodynamics”. ICML, 2015. PDF.
-
Prafulla Dhariwal, Alexander Nichol. “Diffusion Models Beat GANs on Image Synthesis”. NeurIPS 2021. PDF.
-
Lilian Weng. What are Diffusion Models? July 2021. Post.
-
Calvin Luo. Understanding Diffusion Models: A Unified Perspective. Arxiv, August 2022. PDF.
-
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, Tim Salimans∗ Imagen Video: High Definition Video Generation with Diffusion Models. PDF.
-
Tim Salimans, Jonathan Ho. Progressive Distillation for Fast Sampling of Diffusion Models. PDF.
-
Saharia et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. 2022. PDF.
-
R. Po, W. Yifan, V. Golyanik, K. Aberman, J. T. Barron, A. Bermano, E. Chan, T. Dekel, A. Holynski, A. Kanazawa, C. K. Liu, L. Liu, B. Mildenhall, M. Nießner, B. Ommer, C. Theobalt, P. Wonka, G. Wetzstein. State of the Art on Diffusion Models for Visual Computing. arXiv, 2023. [PDF].