Visual Jigsaw Post-Training Improves MLLMs

Abstract

Reinforcement learning based post-training has recently emerged as a powerful paradigm for enhancing the alignment and reasoning capabilities of multimodal large language models (MLLMs). While vision-centric post-training is crucial for enhancing MLLMs' intrinsic understanding of visual signals, current post-training paradigms are predominantly text-centric, where dense visual inputs are only leveraged to extract sparse cues for text-based reasoning. There exist a few approaches in this direction, however, they often still rely on text as an intermediate mediator or introduce additional visual generative designs. In this work, we introduce Visual Jigsaw, a generic self-supervised post-training framework designed to strengthen visual understanding in MLLMs. Visual Jigsaw is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. This naturally aligns with reinforcement learning from verifiable rewards (RLVR), requires no additional visual generative components, and derives its supervisory signal automatically without any annotations. We instantiate Visual Jigsaw across three visual modalities, including images, videos, and 3D data. Extensive experiments demonstrate substantial improvements in fine-grained perception, temporal reasoning, and 3D spatial understanding. Our findings highlight the potential of self-supervised vision-centric tasks in post-training MLLMs and aim to inspire further research on vision-centric pretext designs.

Upload Image

Score: 0 / 0

Drag tiles from the Palette to the Grid. Click Check to score; Reset to restart. You can also upload your own image.

Palette

Grid

Drag 6 clips (top) into the 6-slot timeline (bottom) in the correct order.

Clips

Timeline

Arrange six depth tokens from near → far. The image shows points 1–6.

Reference Image

Depth Tokens

Order (Near → Far)

Methodology Overview

Visual Jigsaw framework is formulated as a general visual ordering problem. Given some data from a certain visual modality (image, video, or 3D), we derive a set of K jigsaw elements by applying a modality-specific partitioning rule, such as splitting an image into patches, segmenting a video into clips, or sampling points in a 3D scene. These elements are then shuffled, and the model is tasked with predicting their original structural arrangement. Formally, the model predicts a permutation of size K as a list of indices, which is then compared against the ground-truth permutation. We optimize this task using the GRPO algorithm.
In the Image Jigsaw, an image is partitioned into non-overlapping patches, shuffled into a sequence, and the model is tasked with predicting the correct raster order. In the Video Jigsaw, a video is segmented into temporal clips, shuffled, and the model predicts their original chronological order. In the 3D Jigsaw, points with distinct depth values are sampled from an RGB-D image, shuffled and annotated in the RGB view, and the model is required to recover the correct depth order from nearest to farthest. Across all tasks, the policy model outputs an ordering that is compared against the ground truth, and a partial accuracy reward is assigned when only some elements are correctly ordered.

Experiments

Image Jigsaw

Image Jigsaw consistently improves the vision-centric capabilities of MLLMs across three categories of vision-centric benchmarks including 1) Fine-grained perception & understanding, 2) Monocular spatial understanding, and 3) Compositional visual understanding. These results confirm that incorporating image jigsaw post-training significantly enhances MLLMs' perceptual grounding and fine-grained vision understanding beyond reasoning-centric post-training strategies. We attribute these improvements to the fact that solving image jigsaw requires the model to attend to local patch details, infer global spatial layouts, and reason about inter-patch relations, which directly benefits fine-grained, spatial, and compositional understanding.

Video Jigsaw

Video Jigsaw brings consistent improvements across all video understanding benchmarks and frame settings. While our method enhances general video perception and comprehension, the gains are particularly pronounced on tasks requiring temporal-centric understanding and reasoning about temporal directionality (e.g. AoTBench). Furthermore, the strong gains on CVBench demonstrate improved cross-video understanding and reasoning. These results confirm that solving video jigsaw tasks encourages the model to better capture temporal continuity, understand relationships across videos, reason about directional consistency, and generalize to holistic and generalizable video understanding scenarios.

3D Jigsaw

3D Jigsaw achieves significant improvements across all benchmarks. Unsurprisingly, the largest gain is on DA-2K, a depth estimation benchmark that is directly related to our depth-ordering pre-training task. More importantly, we observe consistent improvements on a wide range of other tasks, including those with single-view (e.g. 3DSRBench, OminiSpatial), multi-view (e.g. ViewSpatial, All-Angles), and egocentric video inputs (e.g. VSI-Bench). These results demonstrate that our approach not only teaches the specific skill of depth ordering but also effectively strengthens the model's general ability to perceive and reason about 3D spatial structure.

BibTeX

@article{Visual_Jigsaw,
  author    = {Wu, Penghao and Yushan, Zhang and Haiwen, Diao and Bo, Li and Lu, Lewei and Liu, Ziwei},
  title     = {Visual Jigsaw Post-Training Improves MLLMs},
  journal={arXiv preprint arXiv:2509},
  year={2025}}