VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

Abstract

Multimodal reward models face inherent limitations for video preference data: visual inputs consume substantial context budget, forcing RMs to process fewer frames and risking the loss of fine-grained details. Furthermore, all visual information is typically packed into the initial prompt; during the RM's Chain-of-Thought reasoning, the process proceeds purely in text without revisiting or updating visual evidence, which exacerbates forgetting and hallucination.

We introduce a novel thinking-with-image framework that equips the reward model with visual reasoning operations (e.g., SelectFrame) and a configurable visual memory window, enabling active visual evidence acquisition during reasoning. Frame selection allows the model to actively retrieve previously seen frames and acquire unseen visual evidence as new inputs to subsequent reasoning rounds. The configurable memory window retains only the most recently active visual information, keeping the memory footprint stable while extending both the reasoning horizon and the total number of frames the model can process.

Building on this framework, we propose VR-Thinker, the first multimodal reward model capable of visual reasoning. Our training pipeline comprises three stages: (I) Cold Start using curated visual CoT data, (II) Rejection Sampling Fine-Tuning on verified high-quality traces, and (III) GRPO reinforcement learning. A 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video.

Method

Thinking-with-Image Framework

Unlike standard chain-of-thought approaches that pack all visual frames into the initial prompt, VR-Thinker interleaves reasoning with visual operations. The model iteratively performs tool invocations to retrieve additional frames and updates its reasoning by incorporating the tool-execution outcomes.

Key components:

Tool Invocation (SelectFrame): The model actively retrieves frames from the full video when missing evidence prevents a definitive judgment, enabling on-demand visual evidence acquisition.
Window Memory: Each tool-execution outcome remains active for a preset number of rounds before being deliberately forgotten, keeping the total context usage stable regardless of how many reasoning steps are taken.
Reasoning Format: XML-style tags (<Snapshot>, <Think>, <Recommend Answer>, <Answer>) delineate functional areas, ensuring clarity and consistency.

3-Stage Training Pipeline

Stage I: Cold Start

Using curated visual CoT data generated by GPT-4o, we instill basic textual reasoning skills and tool-calling syntax via SFT. Two-stage filtering ensures format compliance and judgment accuracy.

Stage II: Rejection Sampling FT

The Stage I model generates multiple CoT samples per input. Only traces where all per-dimension and overall judgments are correct are retained for SFT, consolidating high-quality reasoning.

Stage III: GRPO

Group Relative Policy Optimization with rule-based rewards (format + accuracy + CoT gain + exploratory incentive) reinforces multimodal reasoning capabilities.

Reward Design

The GRPO stage uses a composite reward signal:

Format Reward: Ensures correct XML tag structure and answer format compliance.
Accuracy Reward: Evaluates both per-dimension judgments (TA, MQ, VQ, OA) and overall preference, expanding the answer space to 3^d+1 to reduce misleading reward signals.
CoT Gain Reward: Rewards accuracy improvement between successive reasoning updates, encouraging the model to leverage visual evidence for better conclusions.
Exploratory Incentive: Enforces a lower bound on multimodal reasoning proportion to prevent the model from defaulting to purely textual reasoning.

Qualitative Example

When frames are downsampled, key information might not be included in the input. VR-Thinker actively retrieves frames via SelectFrame, ensuring the correctness of such cases. The model reasons about what visual evidence is missing and strategically requests specific frames to resolve ambiguity.

Results

VR-Thinker achieves state-of-the-art performance across three major video preference benchmarks.

Model	Size	VideoGen-Reward	GenAI-Bench	MJ-Bench-Video
Classifier-based Reward Models
VideoScore	7B	50.2	70.9	63.5
VideoReward	2B	73.8	73.1	62.6
VisionReward	13B	68.4	72.7	65.2
Generative-based Reward Models
LiFT	13B	57.9	59.4	51.4
UnifiedReward	7B	78.6	76.8	69.5
Reasoning-based Reward Models
UnifiedReward-Think	7B	79.1	80.4	71.9
VR-Thinker	7B	80.5	82.3	75.6

Numbers are diff(%) on each benchmark, taken from the paper's Table 1 (main results).

4-Dimension Evaluation

Videos are scored across four complementary dimensions, each contributing to the overall preference judgment:

Text Alignment

TA

How well video content matches the text prompt

Motion Quality

MQ

Naturalness and smoothness of motion

Visual Quality

VQ

Resolution, clarity, and visual artifacts

Overall Assessment

OA

Holistic quality judgment

BibTeX

@article{wang2025vrthinker,
  title     = {VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning},
  author    = {Wang, Qunzhong and Liu, Jie and Liang, Jiajun and Jiang, Yilei and Zhang, Yuanxing and Chen, Jinyuan and Zheng, Yaozhi and Wang, Xintao and Wan, Pengfei and Yue, Xiangyu and Liu, Jiaheng},
  journal   = {arXiv preprint arXiv:2510.10518},
  year      = {2025},
}