Table of Contents:

Overview

Something that’s surprised me while observing recent progress in GenAI, especially text-to-video, is how simple and straightforward the extensions are to text-to-image models, to make them spit out coherent video. In fact, the simpler the approach, the better the results.

  • VDM: Change each 2D conv into a space-only 3D conv (3x3 becomes 1x3x3 conv). If we do this for all layers of DDPM’s 2d UNet, we get a 3d UNet, that operates over a 4d tensor of (frames x height x width x channels). Then, simply interleave a few temporal attention blocks.

What is a “temporal attention block” you ask? Just attention over the first frames axis, but treat the spatial axes as the batch dimension, i.e. re-order (frames x H x W x C) as ([H x W] x frames x C). VDM: https://arxiv.org/abs/2204.03458 arxiv.org Video Diffusion Models Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We make progress towards this milestone by proposing a diffusion model for video…

  • Video LDM and StableVideoDiffusion: Take your standard Stable Diffusion U-Net, fix the spatial layer weights, but interleave 3d conv and temporal attention layers between the existing layers and retrain, followed by separate temporal and spatial super-resolution models.

  • WALT: Throw away the UNet, and use the DiffusionTransformer (DiT). Directly learn a 3d encoder-decoder (autoencoder) to map 4d input to a lower-resolution 4d latent space, patchify it, and alternate non-overlapping, window-restricted spatial and spatiotemporal attention.

Why window-restricted? To limit computational demands from huge sequence lengths in quadratic attention. MAGViT-v2 is the encoder/decoder they use. But WALT only generates lowres 128x128px video, and then needs separate super-res models afterwards.

  • Sora: Same overall strategy as WALT, but obtains much higher resolution by generating video at its native resolution, eliminating the need for a cascade (and appears to be trained on significantly more video data). https://openai.com/index/video-generation-models-as-world-simulators/

  • The more complicated methods, like Imagen Video, seem much less effective than the straightforward Sora approach. Imagen Video used VDM as a base model, but required a cascade of 7 total models.

ImagenVideo uses temporal convolution in the base model instead of temporal attention to manage memory and computation cost for 24fps frame rate. Its other 6 models are temporal and spatialsuper-resolution (SR) models. https://arxiv.org/abs/2210.02303 arxiv.org Imagen Video: High Definition Video Generation with Diffusion Models We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, Imagen Video generates high definition videos using a base…

  • Make-A-Video also employs a sequential cascade of models and adapts a text-to-image model as their base model. Their key insight: 3d conv is expensive, so approximate it by simply adding a temporal 1D conv layer after each spatial 2D conv layer.

The temporal conv layer is initialized with an identity function. They call this “pseudo-3d convolution” and also make a version for attention called “pseudo-3d attention”.

Make-A-Video arxiv.org Make-A-Video: Text-to-Video Generation without Text-Video Data We propose Make-A-Video – an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple: learn what the…

Of course, perhaps these extensions are only simple in hindsight. A nice resource to understand text-to-video model architectures is the blog post here https://lilianweng.github.io/posts/2024-04-12-diffusion-video/

lilianweng.github.io Diffusion Models for Video Generation Diffusion models have demonstrated strong results on image synthesis in past years. Now the research community has started working on a harder task—using it for video generation. The task itself is a…

References

  1. Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, David J. Fleet. Video Diffusion Models. [arXiv].
  2. Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, Karsten Kreis. *Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. [arXiv].
  3. Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets. [arXiv]
  4. https://openai.com/index/video-generation-models-as-world-simulators/
  5. Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, Tim Salimans. Imagen Video: High Definition Video Generation with Diffusion Models. [arXiv].