VideoScience-Bench: Benchmarking Scientific Reasoning in Video Generations

What this repo provides

VideoScience-Bench evaluates whether video models can go beyond looking plausible to being scientifically correct.

200 undergraduate-level scientific scenarios (physics + chemistry)
- 160 for T2V evaluation
- and 40 for I2V evaluation
12 topics, 103 concepts, and multi-concept scientific reasoning required in a single prompt
Evaluation along 5 dimensions (Prompt Consistency, Phenomenon Congruency, Correct Dynamism, Immutability, Spatio-Temporal Coherence)

VideoScience-Judge is an auto evaluation pipeline that supports:

Prompt-specific checklist generation
CV-grounded evidence extraction (e.g., object detection, object tracking, motion tracking)
Salient key frames selection where scientific phenomena occur
final grading with a reasoning-capable VLM

Dataset Overview

VideoScience-Bench is curated to stress scientific reasoning in video generation: each prompt typically requires at least 2 interacting scientific concepts to produce the correct phenomenon.

Topics (12)

Physics (7):

Classical Mechanics
Thermodynamics
Electromagnetism
Optics
Fluid Mechanics
Material Mechanics
Modern Physics

Chemistry (5):

Redox Reactions
Acid-Base
Reaction Kinetics
Solution and Phase Chemistry
Materials and Solid-State Chemistry

What each example contains

The prompt suite is lightweight and easy to integrate into any video generation harness.

Common fields (as in the HF release):

prompt: the experimental setup + procedure
expected phenomenon: a concise description of what should happen if the laws are obeyed
keywords: fine-grained scientific concepts involved
field: Physics / Chemistry
vid: instance id

Loading from Hugging Face

from datasets import load_dataset

ds = load_dataset("lmgame/VideoScienceBench")
data = ds["test"]

# sanity check an example with
print(data[0]["prompt"])
print(data[0]["expected phenomenon"])
print(data[0]["keywords"])

Installation

Basic Setup

# Clone the repository
git clone https://github.com/hao-ai-lab/VideoScience.git
cd VideoScience

# Install dependencies
pip install -r requirements.txt

FastVideo Setup

FastVideo is a video generation provider that supports two modes of operation:

Option 1: Remote API Server (Recommended for Production)

If you have a deployed FastVideo API server:

export FASTVIDEO_API_BASE="http://your-fastvideo-server:8000"
export FASTVIDEO_API_KEY="your-api-key"  # Optional, if authentication is required

Option 2: Local Inference Mode

For local GPU inference:

# Install FastVideo package
pip install fastvideo

# Set the model path (will be downloaded on first use)
export FASTVIDEO_MODEL_PATH="FastVideo/FastWan2.1-T2V-1.3B-Diffusers"

Requirements for local inference:

CUDA-capable GPU with sufficient VRAM
PyTorch with CUDA support

Usage

1) Batched video generation

Download the CSV data file under data/database/data_filtered.jsonl.
Launch the script:

bash scripts/batched_generation_using_csv.sh

2) Single video generation

python3 single_generation_frontend.py \
  --provider {provider_name} \
  --model {model_name} \
  --prompt "{your_prompt}"

3) VLM-as-a-judge evaluation

bash judge/batched_evaluate_all_models.sh

Evaluation Metrics

We evaluate each generated video on five dimensions (Likert 1–4):

Prompt Consistency (PCS): is the setup/procedure faithful to the prompt?
Phenomenon Congruency (PCG): does the correct scientific outcome occur?
Correct Dynamism (CDN): are motions / dynamics physically consistent?
Immutability (IMB): are static attributes preserved (no flicker/identity drift)?
Spatio-Temporal Coherence (STC): is the video coherent over time and space?

VideoScience-Judge vs. Human Annotations

Manual scientific evaluation is expensive. VideoScience-Judge aims to be human expert-aligned while remaining scalable.

Ranking correlation with expert ratings

We report ranking correlations between automatic metrics and domain-expert annotations across 7 evaluated video models.

Metric	Kendall τ	Spearman ρ
VSci-Judge	0.81	0.89
VSci-Judge (Checklist)	0.90	0.96
VSci-Judge (Checklist + CV evidence)	0.90	0.96
PhyGenEval	0.52	0.61
VideoScore2	0.24	0.29

Note: adding prompt-specific checklists (and optional CV evidence) makes the judge align near-perfectly with expert-ranked model quality on VideoScience-Bench.

VideoScience-Judge Features

[optional] Checklist generation: create an evaluative rubric tied to the prompt
[optional] CV-based evidence extraction (optional but recommended): tracking, motion, attribute changes, key frames
final grading: VLM-as-a-judge reasons over the checklist + all evidences

Citation

If you use VideoScience in your research, please cite:

@article{hu2025videoscience,
  title={Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench},
  author={Hu, Lanxiang and Shankarampeta, Abhilash and Huang, Yixin and Dai, Zilin and Yu, Haoyang and Zhao, Yujie and Kang, Haoqiang and Zhao, Daniel and Rosing, Tajana and Zhang, Hao},
  journal={arXiv preprint arXiv:2512.02942},
  year={2025}
}

License

This project is released under the MIT License. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
GroundingDINO @ 856dde2		GroundingDINO @ 856dde2
assets		assets
backend		backend
data		data
judge		judge
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
evaluate_readme.md		evaluate_readme.md
frontend.py		frontend.py
requirements.txt		requirements.txt
single_generation_frontend.py		single_generation_frontend.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoScience-Bench: Benchmarking Scientific Reasoning in Video Generations

What this repo provides

Table of Contents

Dataset Overview

Topics (12)

What each example contains

Loading from Hugging Face

Installation

Basic Setup

FastVideo Setup

Option 1: Remote API Server (Recommended for Production)

Option 2: Local Inference Mode

Usage

1) Batched video generation

2) Single video generation

3) VLM-as-a-judge evaluation

Evaluation Metrics

VideoScience-Judge vs. Human Annotations

Ranking correlation with expert ratings

VideoScience-Judge Features

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VideoScience-Bench: Benchmarking Scientific Reasoning in Video Generations

What this repo provides

Table of Contents

Dataset Overview

Topics (12)

What each example contains

Loading from Hugging Face

Installation

Basic Setup

FastVideo Setup

Option 1: Remote API Server (Recommended for Production)

Option 2: Local Inference Mode

Usage

1) Batched video generation

2) Single video generation

3) VLM-as-a-judge evaluation

Evaluation Metrics

VideoScience-Judge vs. Human Annotations

Ranking correlation with expert ratings

VideoScience-Judge Features

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages