A survey on video diffusion models

## A survey on video diffusion models
- Authors: Xing et al.
- Journal: ACM Computing Surveys
- Year: 2024
- Link: [dl.acm.org...](https://dl.acm.org/doi/pdf/10.1145/3696415?casa_token=OXJ0boaQUoMAAAAA:DRDhL-2bXJOSt65RrPPOhVWwFwI3V6nKrIUePkmUwvfXvp4-MF8jHB1L8PE4LTGer9-3suSfBWer)


### **Abstract**
- **AI-Generated Content (AIGC)** | A recent trend in computer vision where AI is used to generate content.
- **Diffusion models** have become very successful in AIGC, especially in generating images and videos.
  - Are better than other methods like GANs and auto-regressive Transformers and excel in image and video generation, and editing.


### **Introduction**
- The paper starts by highlighting the success of AI-generated content (AIGC) in computer vision, with diffusion models playing a key role.
- It notes that diffusion models are becoming more popular than GANs and auto-regressive Transformers for image generation due to their controllability, photorealistic output, and diversity.
- The introduction emphasizes video as a vital medium on the internet, offering dynamic information for a comprehensive user experience.
- Research on video tasks using diffusion models is increasing, covering areas like video generation, editing, and understanding.

### **Preliminaries: Diffusion Model**
- A class of **probabilistic** generative models.
- Learn to reverse a process that gradually degrades the training data structure. 
  - This degradation can be thought of as adding noise to the data until it becomes pure noise.
  - Have become the state-of-the-art family of deep generative models, meaning they are currently the most advanced models in this category.


#### Denoising Diffusion Probabilistic Models (DDPMs)
- DDPMs involve Markov chains for forward (data to noise) and reverse (noise to data) processes.
- **Forward Markov Chain (Diffusion Process)**
  - Gradually adds noise to the data until it becomes a simple prior distribution (e.g., Gaussian noise).
  - The joint distribution factorization is as follows:
The transition kernel $q(x_t | x_{t-1})$ defines how each step in the forward process adds noise:
$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)$
    - $x_t$: The data at time step $t$. $x_{t-1}$: The data at the previous time step $t-1$.
    - $\beta_t$: A hyperparameter controlling the amount of noise added at each step, where $\beta_t \in (0, 1)$.
    - $\mathcal{N}$: A normal (Gaussian) distribution. $I$: The identity matrix.

- The joint distribution $q(x_1, ..., x_T | x_0)$ is the product of these transitions:
$q(x_1, ..., x_T | x_0) = \prod_{t=1}^{T} q(x_t | x_{t-1})$

- **Reverse Markov Chain**
  - The reverse process starts from a prior distribution $p(x_T)$ (typically Gaussian noise) and iteratively denoises to generate new data.
  - The prior distribution is defined as:
$p(x_T) = \mathcal{N}(x_T; 0, I)$
  - $x_T$: The data at the final time step $T$, which is random noise.
  - $0$: A vector of zeros.
  - $I$: The identity matrix.

- The learnable transition kernel $p_\theta(x_{t-1} | x_t)$ is modeled by a neural network:
  - $p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$
  - $x_{t-1}$: The data at the previous time step $t-1$.
  - $x_t$: The data at time step $t$.
  - $\mu_\theta(x_t, t)$: The mean predicted by the neural network, dependent on $x_t$ and $t$.
  - $\Sigma_\theta(x_t, t)$: The variance predicted by the neural network, dependent on $x_t$ and $t$.
  - $\theta$: The parameters of the neural network.

- To generate new data, sample $x_T \sim p(x_T)$ and iteratively sample from $p_\theta(x_{t-1} | x_t)$ until $t = 1$, obtaining $x_0$.

- The core idea is to **train the reverse Markov chain to accurately reverse the forward Markov chain**, effectively learning to generate data from noise.


#### Score-Based Generative Models (SGMs)
- SGMs introduce noise to data at various levels and then estimate 'scores' (gradients of the data distribution's logarithm) for each noise level by training a conditional score network.
- SGMs separate the training of the model from the sample generation process, offering flexibility.
- Mathematical Formulation
  - Let $q(x_0)$ be the data distribution. $0 < \sigma_1 < \sigma_2 < \dots < \sigma_T$  represents a sequence of increasing noise levels.
  - Gaussian noise is added to data point $x_0$ to get $x_t$, following the distribution $q(x_t | x_0) = N(x_t; x_0, \sigma_t^2 I)$.
  - This creates a series of noisy data densities $q(x_1), q(x_2), \dots, q(x_T)$, where $q(x_t)$ is the integral of $q(x_t | x_0)q(x_0)$ over all possible values of  $x_0$.
  - A noise-conditional score network (NCSN) $s_\theta(x, t)$ is trained to estimate the score function $\nabla_{x_t} \log q(x_t)$, which indicates the direction of increasing probability density at a given point $x_t$ and noise level $t$.
  - Training techniques include score matching, denoising score matching, and sliced score matching.

- **Sample Generation** l Iterative methods are used with score functions $s_\theta(x, T), s_\theta(x, T-1), \dots, s_\theta(x, 0)$ to generate samples, often using annealed Langevin dynamics (ALD).
  - ALD is an iterative sampling technique. It starts with a sample from a simple distribution (like Gaussian noise) and gradually refines it by using the score function to move the sample towards regions of higher probability. 
  - The "annealed" part means that the step size or noise level is gradually reduced during the iteration, allowing for more precise refinement at later stages.
  - SGMs transform a complex data generation problem into a series of simpler noise estimation tasks, guided by the learned score function.

#### Stochastic Differential Equations (Score SDEs)
- Score SDEs use an infinite number of noise scales to perturb data. The diffusion process is modeled as a solution to an SDE:
$dx = f(x, t)dt + \sigma(t)dw$,
  - $dx$ represents the change in the data $x$,
  - $f(x, t)$ is the drift function, defining the deterministic change in $x$ over time $t$,
  - $\sigma(t)$ is the diffusion function, controlling the amount of noise added,
  - $dw$ is the standard Wiener process, representing random noise.

- The reverse process, which generates samples $x(0)$ from noise $x(T)$, is defined by the reverse-time SDE:
$dx = [f(x, t) - \sigma(t)^2 \nabla_x \log q_t(x)]dt + \sigma(t)d\bar{w}$,
  - $\nabla_x \log q_t(x)$ is the score function, representing the gradient of the log probability density of the data at time $t$,
  - $d\bar{w}$ is the standard Wiener process when time flows backward.

- Knowing the score function for all $t$ allows deriving and simulating the reverse diffusion process to sample from the original data distribution $p_0$.

------------
### Related Works
- **Text-to-Video (T2V) Generation**
  - **Definition** ㅣ T2V generation involves automatically creating videos from textual descriptions.
  - **Process** l The model needs to understand the scenes, objects, and actions described in the text. Then, it translates this understanding into a series of coherent video frames.
  - **Goal** l The generated video should have both logical and visual consistency, meaning the actions and scenes make sense and look realistic.
  - **Applications** l T2V has many uses, including: Automatically generating movies, Creating animations, Developing virtual reality content, Producing educational videos

- **Unconditional Video Generation**
  - **Definition** ㅣ This involves generating videos without any specific input conditions or guidance.
  - **Process** ㅣ The model starts from random noise or a fixed initial state and creates a continuous, visually coherent video sequence.
  - **Key Difference** ㅣ Unlike conditional video generation, no external information is provided to guide the process.
  - **Goal** ㅣ The model must learn to capture temporal dynamics (how things change over time), actions, and visual coherence on its own.
  - **Importance** ㅣ Unconditional video generation is important for: 
    - Exploring how well generative models can learn video content from unsupervised data (data without labels or specific instructions). 
    - Demonstrating the diversity of content a model can create.
- **Text-guided Video Editing**
  - **Definition** ㅣ This technique uses textual descriptions to guide the editing of video content.
  - **Process** ㅣ A user provides a video and a natural language description of the desired changes.
    - The system analyzes the text, identifying relevant objects, actions, or scenes.
    - This information is then used to guide the editing process, modifying the video accordingly.
  - **Benefits** ㅣ Offers an efficient and intuitive way to edit videos.
    - Allows editors to communicate their intentions using natural language.
    - Reduces the need for manual, frame-by-frame editing.


-----------------
### Datasets and Metrics
- The Comparison of Main Caption-level Video Datasets
    <img src="https://github.com/user-attachments/assets/61368d85-24eb-459b-9bc5-9d0c3d5bb504" width=80%>

- The Comparison of Existing Category-level Datasets for Video Generation and Editing
    <img src="https://github.com/user-attachments/assets/a13920c9-5d4a-4113-9215-1c2f1a50b907" width=70%>

- **Evaluation Metrics**
  - Qualitative and quantitative measures are used to evaluate video generation.
  - **Qualitative Measures**
    - Human subjective evaluation is used in several works. Evaluators compare generated videos against others.
    - Voting-based assessments are done for realism, coherence, and text alignment. It is costly and potentially fails to reflect the full capabilities of the model.
  - **Quantitative Evaluation Standards**
    - Image-level and video-level assessments are used.

- **Image-level Metrics**
  - Videos are composed of image frames, so image-level metrics provide insight into quality.
  - **Fréchet Inception Distance (FID)**
    - Assess the quality of generated videos by comparing synthesized video frames to real video frames.
    - It involves preprocessing the images for normalization to a consistent scale, utilizing InceptionV3 to extract features from real and synthesized videos, and computing mean and covariance matrices.
    - These statistics are then combined to calculate the FID score.
  - **Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM)**
    - Both are pixel-level metrics. SSIM evaluates brightness, contrast, and structural features of original and generated images.
      $PSNR = 10 \cdot log_{10}(\frac{MAX_I^2}{MSE})$    
      $MSE = \frac{1}{mn}\sum_{i=0}^{m-1}\sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2$
      - $MAX_I$ is the maximum possible pixel value of the image.
      - $MSE$ is the Mean Squared Error between the original and generated images.
      - PSNR is a coefficient representing the ratio between peak signal and Mean Squared Error (MSE).
      - These two metrics are commonly used to assess the quality of reconstructed image frames and are applied in tasks such as super-resolution and in-painting.
  - **CLIPSIM**
    - CLIPSIM is a method for measuring image-text relevance.
    - Based on the CLIP model, it extracts both image and text features and then computes the similarity between them.
    - This metric is often employed in text-conditional video generation or editing tasks.

- **Video-level Metrics**
  - Image-level metrics focus on individual frames and disregard temporal coherence.
  - Video-level metrics provide a more comprehensive evaluation of video generation.
  - **Fréchet Video Distance (FVD)**
    - Fréchet Video Distance (FVD) is a video quality evaluation metric based on FID.
    - Unlike image-level methods that use the Inception network to extract features from a single frame, FVD employs the Inflated-3D Convnets (I3D) pre-trained on Kinetics to extract features from video clips.
    - Subsequently, FVD scores are computed through the combination of means and covariance matrices.
  - **Kernel Video Distance (KVD)**
    - KVDis also based on I3D features, but it differentiates itself by utilizing Maximum Mean Discrepancy (MMD), a kernel-based method, to assess the quality of generated videos.
  - **Video Inception Score (IS)**
    - Video IS (Inception Score) calculates the Inception score of generated videos using features extracted by the 3D-Convnets (C3D), which is often applied in evaluation on UCF-101.
    - High-quality videos are characterized by a low entropy probability, denoted as $P(y|x)$, whereas diversity is assessed by examining the marginal distribution across all videos, which should exhibit a high level of entropy.
  - **Frame Consistency CLIP Score**
    - Frame Consistency CLIP Score is often used in video editing tasks to measure the coherence of edited videos.
    - It is calculated by obtaining CLIP image embeddings for all frames and averaging the cosine similarity between all pairs of frames.


----------------------------
### Video Generation
- **Taxonomy of Video Generation**
    <img src="https://github.com/user-attachments/assets/a35ca16c-3c6c-4443-a5ec-916c3d256858">

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A survey on video diffusion models #49