You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
AI-Generated Content (AIGC) | A recent trend in computer vision where AI is used to generate content.
Diffusion models have become very successful in AIGC, especially in generating images and videos.
Are better than other methods like GANs and auto-regressive Transformers and excel in image and video generation, and editing.
Introduction
The paper starts by highlighting the success of AI-generated content (AIGC) in computer vision, with diffusion models playing a key role.
It notes that diffusion models are becoming more popular than GANs and auto-regressive Transformers for image generation due to their controllability, photorealistic output, and diversity.
The introduction emphasizes video as a vital medium on the internet, offering dynamic information for a comprehensive user experience.
Research on video tasks using diffusion models is increasing, covering areas like video generation, editing, and understanding.
Preliminaries: Diffusion Model
A class of probabilistic generative models.
Learn to reverse a process that gradually degrades the training data structure.
This degradation can be thought of as adding noise to the data until it becomes pure noise.
Have become the state-of-the-art family of deep generative models, meaning they are currently the most advanced models in this category.
Denoising Diffusion Probabilistic Models (DDPMs)
DDPMs involve Markov chains for forward (data to noise) and reverse (noise to data) processes.
Forward Markov Chain (Diffusion Process)
Gradually adds noise to the data until it becomes a simple prior distribution (e.g., Gaussian noise).
The joint distribution factorization is as follows:
The transition kernel $q(x_t | x_{t-1})$ defines how each step in the forward process adds noise: $q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)$
$x_t$: The data at time step $t$. $x_{t-1}$: The data at the previous time step $t-1$.
$\beta_t$: A hyperparameter controlling the amount of noise added at each step, where $\beta_t \in (0, 1)$.
$\mathcal{N}$: A normal (Gaussian) distribution. $I$: The identity matrix.
The joint distribution $q(x_1, ..., x_T | x_0)$ is the product of these transitions: $q(x_1, ..., x_T | x_0) = \prod_{t=1}^{T} q(x_t | x_{t-1})$
Reverse Markov Chain
The reverse process starts from a prior distribution $p(x_T)$ (typically Gaussian noise) and iteratively denoises to generate new data.
The prior distribution is defined as: $p(x_T) = \mathcal{N}(x_T; 0, I)$
$x_T$: The data at the final time step $T$, which is random noise.
$0$: A vector of zeros.
$I$: The identity matrix.
The learnable transition kernel $p_\theta(x_{t-1} | x_t)$ is modeled by a neural network:
$x_{t-1}$: The data at the previous time step $t-1$.
$x_t$: The data at time step $t$.
$\mu_\theta(x_t, t)$: The mean predicted by the neural network, dependent on $x_t$ and $t$.
$\Sigma_\theta(x_t, t)$: The variance predicted by the neural network, dependent on $x_t$ and $t$.
$\theta$: The parameters of the neural network.
To generate new data, sample $x_T \sim p(x_T)$ and iteratively sample from $p_\theta(x_{t-1} | x_t)$ until $t = 1$, obtaining $x_0$.
The core idea is to train the reverse Markov chain to accurately reverse the forward Markov chain, effectively learning to generate data from noise.
Score-Based Generative Models (SGMs)
SGMs introduce noise to data at various levels and then estimate 'scores' (gradients of the data distribution's logarithm) for each noise level by training a conditional score network.
SGMs separate the training of the model from the sample generation process, offering flexibility.
Mathematical Formulation
Let $q(x_0)$ be the data distribution. $0 < \sigma_1 < \sigma_2 < \dots < \sigma_T$ represents a sequence of increasing noise levels.
Gaussian noise is added to data point $x_0$ to get $x_t$, following the distribution $q(x_t | x_0) = N(x_t; x_0, \sigma_t^2 I)$.
This creates a series of noisy data densities $q(x_1), q(x_2), \dots, q(x_T)$, where $q(x_t)$ is the integral of $q(x_t | x_0)q(x_0)$ over all possible values of $x_0$.
A noise-conditional score network (NCSN) $s_\theta(x, t)$ is trained to estimate the score function $\nabla_{x_t} \log q(x_t)$, which indicates the direction of increasing probability density at a given point $x_t$ and noise level $t$.
Training techniques include score matching, denoising score matching, and sliced score matching.
Sample Generation l Iterative methods are used with score functions $s_\theta(x, T), s_\theta(x, T-1), \dots, s_\theta(x, 0)$ to generate samples, often using annealed Langevin dynamics (ALD).
ALD is an iterative sampling technique. It starts with a sample from a simple distribution (like Gaussian noise) and gradually refines it by using the score function to move the sample towards regions of higher probability.
The "annealed" part means that the step size or noise level is gradually reduced during the iteration, allowing for more precise refinement at later stages.
SGMs transform a complex data generation problem into a series of simpler noise estimation tasks, guided by the learned score function.
Stochastic Differential Equations (Score SDEs)
Score SDEs use an infinite number of noise scales to perturb data. The diffusion process is modeled as a solution to an SDE: $dx = f(x, t)dt + \sigma(t)dw$,
$dx$ represents the change in the data $x$,
$f(x, t)$ is the drift function, defining the deterministic change in $x$ over time $t$,
$\sigma(t)$ is the diffusion function, controlling the amount of noise added,
$dw$ is the standard Wiener process, representing random noise.
The reverse process, which generates samples $x(0)$ from noise $x(T)$, is defined by the reverse-time SDE: $dx = [f(x, t) - \sigma(t)^2 \nabla_x \log q_t(x)]dt + \sigma(t)d\bar{w}$,
$\nabla_x \log q_t(x)$ is the score function, representing the gradient of the log probability density of the data at time $t$,
$d\bar{w}$ is the standard Wiener process when time flows backward.
Knowing the score function for all $t$ allows deriving and simulating the reverse diffusion process to sample from the original data distribution $p_0$.
Process l The model needs to understand the scenes, objects, and actions described in the text. Then, it translates this understanding into a series of coherent video frames.
Goal l The generated video should have both logical and visual consistency, meaning the actions and scenes make sense and look realistic.
Applications l T2V has many uses, including: Automatically generating movies, Creating animations, Developing virtual reality content, Producing educational videos
Unconditional Video Generation
Definition ㅣ This involves generating videos without any specific input conditions or guidance.
Process ㅣ The model starts from random noise or a fixed initial state and creates a continuous, visually coherent video sequence.
Key Difference ㅣ Unlike conditional video generation, no external information is provided to guide the process.
Goal ㅣ The model must learn to capture temporal dynamics (how things change over time), actions, and visual coherence on its own.
Importance ㅣ Unconditional video generation is important for:
Exploring how well generative models can learn video content from unsupervised data (data without labels or specific instructions).
Demonstrating the diversity of content a model can create.
Text-guided Video Editing
Definition ㅣ This technique uses textual descriptions to guide the editing of video content.
Process ㅣ A user provides a video and a natural language description of the desired changes.
The system analyzes the text, identifying relevant objects, actions, or scenes.
This information is then used to guide the editing process, modifying the video accordingly.
Benefits ㅣ Offers an efficient and intuitive way to edit videos.
Allows editors to communicate their intentions using natural language.
Reduces the need for manual, frame-by-frame editing.
Datasets and Metrics
The Comparison of Main Caption-level Video Datasets
The Comparison of Existing Category-level Datasets for Video Generation and Editing
Evaluation Metrics
Qualitative and quantitative measures are used to evaluate video generation.
Qualitative Measures
Human subjective evaluation is used in several works. Evaluators compare generated videos against others.
Voting-based assessments are done for realism, coherence, and text alignment. It is costly and potentially fails to reflect the full capabilities of the model.
Quantitative Evaluation Standards
Image-level and video-level assessments are used.
Image-level Metrics
Videos are composed of image frames, so image-level metrics provide insight into quality.
Fréchet Inception Distance (FID)
Assess the quality of generated videos by comparing synthesized video frames to real video frames.
It involves preprocessing the images for normalization to a consistent scale, utilizing InceptionV3 to extract features from real and synthesized videos, and computing mean and covariance matrices.
These statistics are then combined to calculate the FID score.
Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM)
Both are pixel-level metrics. SSIM evaluates brightness, contrast, and structural features of original and generated images. $PSNR = 10 \cdot log_{10}(\frac{MAX_I^2}{MSE})$ $MSE = \frac{1}{mn}\sum_{i=0}^{m-1}\sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2$
$MAX_I$ is the maximum possible pixel value of the image.
$MSE$ is the Mean Squared Error between the original and generated images.
PSNR is a coefficient representing the ratio between peak signal and Mean Squared Error (MSE).
These two metrics are commonly used to assess the quality of reconstructed image frames and are applied in tasks such as super-resolution and in-painting.
CLIPSIM
CLIPSIM is a method for measuring image-text relevance.
Based on the CLIP model, it extracts both image and text features and then computes the similarity between them.
This metric is often employed in text-conditional video generation or editing tasks.
Video-level Metrics
Image-level metrics focus on individual frames and disregard temporal coherence.
Video-level metrics provide a more comprehensive evaluation of video generation.
Fréchet Video Distance (FVD)
Fréchet Video Distance (FVD) is a video quality evaluation metric based on FID.
Unlike image-level methods that use the Inception network to extract features from a single frame, FVD employs the Inflated-3D Convnets (I3D) pre-trained on Kinetics to extract features from video clips.
Subsequently, FVD scores are computed through the combination of means and covariance matrices.
Kernel Video Distance (KVD)
KVDis also based on I3D features, but it differentiates itself by utilizing Maximum Mean Discrepancy (MMD), a kernel-based method, to assess the quality of generated videos.
Video Inception Score (IS)
Video IS (Inception Score) calculates the Inception score of generated videos using features extracted by the 3D-Convnets (C3D), which is often applied in evaluation on UCF-101.
High-quality videos are characterized by a low entropy probability, denoted as $P(y|x)$, whereas diversity is assessed by examining the marginal distribution across all videos, which should exhibit a high level of entropy.
Frame Consistency CLIP Score
Frame Consistency CLIP Score is often used in video editing tasks to measure the coherence of edited videos.
It is calculated by obtaining CLIP image embeddings for all frames and averaging the cosine similarity between all pairs of frames.
A survey on video diffusion models
Abstract
Introduction
Preliminaries: Diffusion Model
Denoising Diffusion Probabilistic Models (DDPMs)
DDPMs involve Markov chains for forward (data to noise) and reverse (noise to data) processes.
Forward Markov Chain (Diffusion Process)
The transition kernel
The joint distribution$q(x_1, ..., x_T | x_0)$ is the product of these transitions:
$q(x_1, ..., x_T | x_0) = \prod_{t=1}^{T} q(x_t | x_{t-1})$
Reverse Markov Chain
The learnable transition kernel$p_\theta(x_{t-1} | x_t)$ is modeled by a neural network:
To generate new data, sample$x_T \sim p(x_T)$ and iteratively sample from $p_\theta(x_{t-1} | x_t)$ until $t = 1$ , obtaining $x_0$ .
The core idea is to train the reverse Markov chain to accurately reverse the forward Markov chain, effectively learning to generate data from noise.
Score-Based Generative Models (SGMs)
SGMs introduce noise to data at various levels and then estimate 'scores' (gradients of the data distribution's logarithm) for each noise level by training a conditional score network.
SGMs separate the training of the model from the sample generation process, offering flexibility.
Mathematical Formulation
Sample Generation l Iterative methods are used with score functions$s_\theta(x, T), s_\theta(x, T-1), \dots, s_\theta(x, 0)$ to generate samples, often using annealed Langevin dynamics (ALD).
Stochastic Differential Equations (Score SDEs)
Score SDEs use an infinite number of noise scales to perturb data. The diffusion process is modeled as a solution to an SDE:
$dx = f(x, t)dt + \sigma(t)dw$ ,
The reverse process, which generates samples$x(0)$ from noise $x(T)$ , is defined by the reverse-time SDE:
$dx = [f(x, t) - \sigma(t)^2 \nabla_x \log q_t(x)]dt + \sigma(t)d\bar{w}$ ,
Knowing the score function for all$t$ allows deriving and simulating the reverse diffusion process to sample from the original data distribution $p_0$ .
Related Works
Text-to-Video (T2V) Generation
Unconditional Video Generation
Text-guided Video Editing
Datasets and Metrics
The Comparison of Main Caption-level Video Datasets

The Comparison of Existing Category-level Datasets for Video Generation and Editing

Evaluation Metrics
Image-level Metrics
Video-level Metrics
Video Generation