VR-Thinker achieves state-of-the-art performance across three major video preference benchmarks.
| Model | Size | VideoGen-Reward | GenAI-Bench | MJ-Bench-Video |
|---|---|---|---|---|
| Classifier-based Reward Models | ||||
| VideoScore | 7B | 50.2 | 70.9 | 63.5 |
| VideoReward | 2B | 73.8 | 73.1 | 62.6 |
| VisionReward | 13B | 68.4 | 72.7 | 65.2 |
| Generative-based Reward Models | ||||
| LiFT | 13B | 57.9 | 59.4 | 51.4 |
| UnifiedReward | 7B | 78.6 | 76.8 | 69.5 |
| Reasoning-based Reward Models | ||||
| UnifiedReward-Think | 7B | 79.1 | 80.4 | 71.9 |
| VR-Thinker | 7B | 80.5 | 82.3 | 75.6 |
Numbers are diff(%) on each benchmark, taken from
the paper's Table 1 (main results).
4-Dimension Evaluation
Videos are scored across four complementary dimensions, each contributing to the overall preference judgment:
Text Alignment
TA
How well video content matches the text prompt
Motion Quality
MQ
Naturalness and smoothness of motion
Visual Quality
VQ
Resolution, clarity, and visual artifacts
Overall Assessment
OA
Holistic quality judgment