GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior

1S-Lab, Nanyang Technological University, 2SenseTime Research   Corresponding Author

Abstract

Multimodal Large Language Models (MLLMs) have shown great potential in revolutionizing Graphical User Interface (GUI) automation. However, existing GUI models mostly rely on learning from nearly error-free offline trajectories, thus lacking reflection and error recovery capabilities. To bridge this gap, we propose GUI-Reflection, a novel framework that explicitly integrates self-reflection and error correction capabilities into end-to-end multimodal GUI models throughout dedicated training stages: GUI-specific pre-training, offline supervised fine-tuning (SFT), and online reflection tuning. GUI-reflection enables self-reflection behavior emergence with fully automated data generation and learning processes without requiring any human annotation. Specifically, 1) we first propose scalable data pipelines to automatically construct reflection and error correction data from existing successful trajectories. While existing GUI models mainly focus on grounding and UI understanding ability, we propose the GUI-Reflection Task Suite to learn and evaluate reflection-oriented abilities explicitly. 2) Furthermore, we built a diverse and efficient environment for online training and data collection of GUI models on mobile devices. 3) We also present an iterative online reflection tuning algorithm leveraging the proposed environment, enabling the model to continuously enhance its reflection and error correction abilities. Our framework equips GUI agents with self-reflection and correction capabilities, paving the way for more robust, adaptable, and intelligent GUI automation.

GUI-Reflection Framework

Interpolate start reference image.
We introduce, GUI-Reflection, an automatic framework designed to explicitly integrate self-reflection and error correction capabilities into end-to-end multimodal GUI models throughout different training stages.The GUI-Reflection framework includes (1) Learning basic reflection-oriented skills from GUI-Reflection Task Suite in the GUI pre-training stage; (2) Learning reflection and correction behaviours from automatically generated error scenarios in the offline SFT stage; (3) Continuously enhancing reflection and correction capabilities via reflection tuning in the online learning stage.

Reflection-oriented Abilities: GUI-Reflection Task Suite

In the current paradigm, the GUI pre-training stage mainly targets enhancing the GUI perception capability of the base MLLM and injecting GUI-related knowledge into it. While GUI grounding and understanding are crucial for basic GUI interactions, it is also important to maintain or enhance the model's nascent abilities for self-reflection and error recognition within the GUI context. We decompose such reflection and correction behaviors into smaller reflection-oriented atomic capabilities and design the GUI-Reflection Task Suite to evaluate and learn such capabilities.

Reflection Behavior in Offline SFT

During the SFT stage, the GUI model is trained on offline GUI interaction trajectories that are mostly error-free. The ability to recognize possible mistakes based on execution results and the ability to recover or learn from mistakes are greatly limited in such a training approach. Therefore, we design a scalable automatic data pipeline to create realistic reflection and correction data from the existing successful trajectories with two approaches.

Interpolate start reference image.
In the first approach, we modify the original task goal to make an original correct action incorrect. The modified goal is constructed to make the now-incorrect action appear as an easy or natural mistake that a user unfamiliar with the app, button functions, or certain operations might make. Then a reflection step is constructed after the incorrect action.

Interpolate start reference image.
In the second approach, we insert an ineffective incorrect action which should not change the screenshot before a correct action and modify the original correct action by adding reflection content about the inserted ineffective action.

Iterative Online Reflection Tuning

We developed a specialized environment for efficient online learning, testing, and data collection of mobile GUI agents. We then design an iterative reflection tuning algorithm for the GUI model trained with offline SFT to further improve the general and reflection capabilities through interacting with our online environment.

Interpolate start reference image.

Experiments

Interpolate start reference image.
By evaluating on our GUI-Reflection Task Suite, we find that large-scale general-purpose MLLMs possess some inherent reflection capabilities in the GUI context, while such capabilities are still very limited in smaller-scale models, and the standard GUI pre-training tends to further diminish these abilities. However, by incorporating training data from our reflection-oriented tasks during the pre-training phase, such essential capabilities can be effectively improved.

Interpolate start reference image

By conducting evaluations on our proposed environment, we observe that incorporating reflection data during the offline SFT stage significantly boosts the performance. And when our online reflection tuning algorithm is applied online, the success rate further increases, demonstrating the benefits of explicitly training for reflection at multiple stages.

To evaluate our model on more general and comprehensive tasks, we combine the training data collected in the online training stage with a similar-sized subset of the original offline data and fine-tune the offline SFT model to inject valuable reflection experiences while maintaining the generalization ability. We evaluate our model on the AndroidWorld benchmark. Our model achieves a competitive success rate of 34.5% among end-to-end models, demonstrating the effectiveness of our proposed framework.

Interpolate start reference image

BibTeX


  @article{GUI_Reflection,
  author    = {Wu, Penghao and Ma, Shengnan and Wang, Bo and Yu, Jiaheng and Lu, Lewei and Liu, Ziwei},
  title     = {GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior},
  journal={arXiv preprint arXiv:2506.08012},
  year={2025}}