VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

Qunzhong Wang1,2, Jie Liu1, Jiajun Liang2, Yilei Jiang1, Yuanxing Zhang2, Yaozhi Zheng1, Xintao Wang2, Pengfei Wan2, Xiangyu Yue1, Jiaheng Liu3
1CUHK MMLab, 2Kling Team, Kuaishou Technology, 3Nanjing University
VR-Thinker overview

VR-Thinker equips video reward models with visual reasoning operations, enabling active visual evidence acquisition during chain-of-thought reasoning.

Abstract

Multimodal reward models face inherent limitations for video preference data: visual inputs consume substantial context budget, forcing RMs to process fewer frames and risking the loss of fine-grained details. Furthermore, all visual information is typically packed into the initial prompt; during the RM's Chain-of-Thought reasoning, the process proceeds purely in text without revisiting or updating visual evidence, which exacerbates forgetting and hallucination.

We introduce a novel thinking-with-image framework that equips the reward model with visual reasoning operations (e.g., SelectFrame) and a configurable visual memory window, enabling active visual evidence acquisition during reasoning. Frame selection allows the model to actively retrieve previously seen frames and acquire unseen visual evidence as new inputs to subsequent reasoning rounds. The configurable memory window retains only the most recently active visual information, keeping the memory footprint stable while extending both the reasoning horizon and the total number of frames the model can process.

Building on this framework, we propose VR-Thinker, the first multimodal reward model capable of visual reasoning. Our training pipeline comprises three stages: (I) Cold Start using curated visual CoT data, (II) Rejection Sampling Fine-Tuning on verified high-quality traces, and (III) GRPO reinforcement learning. A 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video.

Method

Thinking-with-Image Framework

Unlike standard chain-of-thought approaches that pack all visual frames into the initial prompt, VR-Thinker interleaves reasoning with visual operations. The model iteratively performs tool invocations to retrieve additional frames and updates its reasoning by incorporating the tool-execution outcomes.

Key components:

  • Tool Invocation (SelectFrame): The model actively retrieves frames from the full video when missing evidence prevents a definitive judgment, enabling on-demand visual evidence acquisition.
  • Window Memory: Each tool-execution outcome remains active for a preset number of rounds before being deliberately forgotten, keeping the total context usage stable regardless of how many reasoning steps are taken.
  • Reasoning Format: XML-style tags (<Snapshot>, <Think>, <Recommend Answer>, <Answer>) delineate functional areas, ensuring clarity and consistency.

3-Stage Training Pipeline

Stage I: Cold Start

Using curated visual CoT data generated by GPT-4o, we instill basic textual reasoning skills and tool-calling syntax via SFT. Two-stage filtering ensures format compliance and judgment accuracy.

Stage II: Rejection Sampling FT

The Stage I model generates multiple CoT samples per input. Only traces where all per-dimension and overall judgments are correct are retained for SFT, consolidating high-quality reasoning.

Stage III: GRPO

Group Relative Policy Optimization with rule-based rewards (format + accuracy + CoT gain + exploratory incentive) reinforces multimodal reasoning capabilities.

Reward Design

The GRPO stage uses a composite reward signal:

  • Format Reward: Ensures correct XML tag structure and answer format compliance.
  • Accuracy Reward: Evaluates both per-dimension judgments (TA, MQ, VQ, OA) and overall preference, expanding the answer space to 3d+1 to reduce misleading reward signals.
  • CoT Gain Reward: Rewards accuracy improvement between successive reasoning updates, encouraging the model to leverage visual evidence for better conclusions.
  • Exploratory Incentive: Enforces a lower bound on multimodal reasoning proportion to prevent the model from defaulting to purely textual reasoning.

Qualitative Example

VR-Thinker qualitative case

When frames are downsampled, key information might not be included in the input. VR-Thinker actively retrieves frames via SelectFrame, ensuring the correctness of such cases. The model reasons about what visual evidence is missing and strategically requests specific frames to resolve ambiguity.

Results

VR-Thinker achieves state-of-the-art performance across three major video preference benchmarks.

Model Size VideoGen-Reward GenAI-Bench MJ-Bench-Video
VideoScore - 54.5 58.3 53.4
VideoReward - 65.1 70.7 63.9
GPT-4o - 62.5 67.3 58.7
Qwen2.5-VL 7B 64.8 69.5 60.2
VR-Thinker 7B 80.5 82.3 75.6

4-Dimension Evaluation

Videos are scored across four complementary dimensions, each contributing to the overall preference judgment:

Text Alignment

TA

How well video content matches the text prompt

Motion Quality

MQ

Naturalness and smoothness of motion

Visual Quality

VQ

Resolution, clarity, and visual artifacts

Overall Assessment

OA

Holistic quality judgment

BibTeX

@article{wang2025vrthinker,
  title     = {VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning},
  author    = {Wang, Qunzhong and Liu, Jie and Liang, Jiajun and Jiang, Yilei and Zhang, Yuanxing and Chen, Jinyuan and Zheng, Yaozhi and Wang, Xintao and Wan, Pengfei and Yue, Xiangyu and Liu, Jiaheng},
  journal   = {arXiv preprint arXiv:2510.10518},
  year      = {2025},
}