GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior

Abstract

Multimodal Large Language Models (MLLMs) have shown great potential in revolutionizing Graphical User Interface (GUI) automation. However, existing GUI models mostly rely on learning from nearly error-free offline trajectories, thus lacking reflection and error recovery capabilities. To bridge this gap, we propose GUI-Reflection, a novel framework that explicitly integrates self-reflection and error correction capabilities into end-to-end multimodal GUI models throughout dedicated training stages: GUI-specific pre-training, offline supervised fine-tuning (SFT), and online reflection tuning. GUI-reflection enables self-reflection behavior emergence with fully automated data generation and learning processes without requiring any human annotation. Specifically, 1) we first propose scalable data pipelines to automatically construct reflection and error correction data from existing successful trajectories. While existing GUI models mainly focus on grounding and UI understanding ability, we propose the GUI-Reflection Task Suite to learn and evaluate reflection-oriented abilities explicitly. 2) Furthermore, we built a diverse and efficient environment for online training and data collection of GUI models on mobile devices. 3) We also present an iterative online reflection tuning algorithm leveraging the proposed environment, enabling the model to continuously enhance its reflection and error correction abilities. Our framework equips GUI agents with self-reflection and correction capabilities, paving the way for more robust, adaptable, and intelligent GUI automation.

GUI-Reflection Framework

We introduce, GUI-Reflection, an automatic framework designed to explicitly integrate self-reflection and error correction capabilities into end-to-end multimodal GUI models throughout different training stages.The GUI-Reflection framework includes (1) Learning basic reflection-oriented skills from GUI-Reflection Task Suite in the GUI pre-training stage; (2) Learning reflection and correction behaviours from automatically generated error scenarios in the offline SFT stage; (3) Continuously enhancing reflection and correction capabilities via reflection tuning in the online learning stage.

Reflection-oriented Abilities: GUI-Reflection Task Suite

In the current paradigm, the GUI pre-training stage mainly targets enhancing the GUI perception capability of the base MLLM and injecting GUI-related knowledge into it. While GUI grounding and understanding are crucial for basic GUI interactions, it is also important to maintain or enhance the model's nascent abilities for self-reflection and error recognition within the GUI context. We decompose such reflection and correction behaviors into smaller reflection-oriented atomic capabilities and design the GUI-Reflection Task Suite to evaluate and learn such capabilities.

Action Verification Task
Recognizing the error or mistake is the very first and crucial step in the reflection and correction process. The Action Verification task tests the models' ability to determine if an implicit action, executed on a previous GUI state, accomplished a specific purpose, based on observing the resulting GUI state outcome.

Action Reversal Task
The Action Reversal task addresses the scenario where an undesired or incorrect action has been recognized, and the objective is to determine the subsequent action required to revert the GUI to its state immediately preceding the execution of the original action.

Mistake-informed Reattempt Task
After recognizing an error and potentially reverting the state, a critical reflective capability is to make an informed new attempt based on the known mistakes. In the Mistake-informed Reattempt task, the model is first asked to ground GUI elements based on a given instruction. We then identify the samples that are incorrectly grounded. The model is informed of the prior mistake and is asked to make a new prediction.

Reflection Behavior in Offline SFT

During the SFT stage, the GUI model is trained on offline GUI interaction trajectories that are mostly error-free. The ability to recognize possible mistakes based on execution results and the ability to recover or learn from mistakes are greatly limited in such a training approach. Therefore, we design a scalable automatic data pipeline to create realistic reflection and correction data from the existing successful trajectories with two approaches.

In the first approach, we modify the original task goal to make an original correct action incorrect. The modified goal is constructed to make the now-incorrect action appear as an easy or natural mistake that a user unfamiliar with the app, button functions, or certain operations might make. Then a reflection step is constructed after the incorrect action.

In the second approach, we insert an ineffective incorrect action which should not change the screenshot before a correct action and modify the original correct action by adding reflection content about the inserted ineffective action.

Iterative Online Reflection Tuning

We developed a specialized environment for efficient online learning, testing, and data collection of mobile GUI agents. We then design an iterative reflection tuning algorithm for the GUI model trained with offline SFT to further improve the general and reflection capabilities through interacting with our online environment.

Experiments

By evaluating on our GUI-Reflection Task Suite, we find that large-scale general-purpose MLLMs possess some inherent reflection capabilities in the GUI context, while such capabilities are still very limited in smaller-scale models, and the standard GUI pre-training tends to further diminish these abilities. However, by incorporating training data from our reflection-oriented tasks during the pre-training phase, such essential capabilities can be effectively improved.

By conducting evaluations on our proposed environment, we observe that incorporating reflection data during the offline SFT stage significantly boosts the performance. And when our online reflection tuning algorithm is applied online, the success rate further increases, demonstrating the benefits of explicitly training for reflection at multiple stages.

To evaluate our model on more general and comprehensive tasks, we combine the training data collected in the online training stage with a similar-sized subset of the original offline data and fine-tune the offline SFT model to inject valuable reflection experiences while maintaining the generalization ability. We evaluate our model on the AndroidWorld benchmark. Our model achieves a competitive success rate of 34.5% among end-to-end models, demonstrating the effectiveness of our proposed framework.

BibTeX


  @article{GUI_Reflection,
  author    = {Wu, Penghao and Ma, Shengnan and Wang, Bo and Yu, Jiaheng and Lu, Lewei and Liu, Ziwei},
  title     = {GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior},
  journal={arXiv preprint arXiv:2506.08012},
  year={2025}}