Multimodal reward models face inherent limitations for video
preference data: visual inputs consume substantial context
budget, forcing RMs to process fewer frames and risking the loss
of fine-grained details. Furthermore, all visual information is
typically packed into the initial prompt; during the RM's
Chain-of-Thought reasoning, the process proceeds purely in text
without revisiting or updating visual evidence, which
exacerbates forgetting and hallucination.
We introduce a novel thinking-with-image framework that
equips the reward model with visual reasoning operations (e.g.,
SelectFrame) and a configurable visual memory
window, enabling active visual evidence acquisition during
reasoning. Frame selection allows the model to actively retrieve
previously seen frames and acquire unseen visual evidence as new
inputs to subsequent reasoning rounds. The configurable memory
window retains only the most recently active visual information,
keeping the memory footprint stable while extending both the
reasoning horizon and the total number of frames the model can
process.
Building on this framework, we propose
VR-Thinker, the first multimodal
reward model capable of visual reasoning. Our training pipeline
comprises three stages: (I) Cold Start using curated
visual CoT data, (II) Rejection Sampling Fine-Tuning on
verified high-quality traces, and (III)
GRPO reinforcement learning. A 7B VR-Thinker achieves
80.5% on VideoGen Reward, 82.3% on GenAI-Bench,
and 75.6% on MJ-Bench-Video.