GitHub - showlab/GUI-Thinker: Enable AI to control your PC. This repo includes the WorldGUI Benchmark and GUI-Thinker Agent Framework.

Welcome to GUI-Thinker! If you find this repo useful, please give a star ⭐ for encouragement.

English | 中文

Overview

GUI-Thinker has highly adaptive self-reflection capabilities in dynamic GUI envs.

No Docker or Virtual Machine for deployment.

Visit our study WorldGUI in project page.🌐

User Query: Disable the 'Battery saver' Notifications

GUI-Thinker:

Introduction

What's new in GUI-Thinker?

GUI-Thinker is a newly developed GUI agent based on a self-reflection mechanism. We systematically investigate GUI automation and establish the following workflow, incorporating three key self-reflection modules:

Planner-Critic (Post-Planning Critique): Self-corrects the initial plans to ensure their accuracy
Step-Check (Pre-Execution Validation): Remove redundant steps or modify them if necessary.
Actor-Critic (Post-Action Evaluation): Review the task completion status and apply necessary corrections.

Technique Details

Overall framework of GUI-Thinker.

State-Aware Planner and Planner-Critic modules.

Step-Check module.

Actor-Critic module.

Comparing with SOTA Desktop GUI Agent

Comparison of various agents on the WorldGUI Benchmark (meta task).

📢 Update

[2025.03.11] ⚡ We are excited to introduce a fast version of GUI-Thinker powered by the base models Claude-3.5-Sonnet and Claude-3.7-Sonnet. In this release, the Claude models serve as the Actor without relying on the GUI Parser. This setup delivers impressive speed. Try with test_guithinker_fast.py.
[2025.03.08] We made a demo for showing the GUI-Thinker.
[2025.03.05] ⚡ Our GUI-Thinker now supports both instructional video and non-video inputs. Enjoy!
[2025.03.05] 😊 We release the code of GUI-Thinker. Now, we support running our GUI agent on your Windows computer locally Getting started. GUI-Thinker now supports various base LMMs through API calling, including GPT-4o, Gemini-2.0, and Claude-3.5-Sonnet. Local model support will be available soon.
[2025.02.13] We release the WorldGUI in arxiv.

✨Key Features

🏆 High Performance: Our GUI-Thinker surpasses Cluade-3.5 Computer Use by 14.9% on our WorldGUI Benchmark.
🌐 Universal LMM Support: Seamlessly integrates with A Wide Range of LMMs (e.g., OpenAI, Anthropic, Gemini)
🔀 Flexible Interaction: Supports both intructional video input and non-instructional video input.
🚀 Easy Deployment: Get started instantly with a simple .\shells\start_server.bat command and python test_guithinker_custom.py without the need of Docker or Virtual Machine.

🤖 Core Components:

Our codebases includes:

GUI Parser: Utilizes Google OCR and PyAutoGUI to extract element grounding information.
State-Aware Planner: Accepts screenshots and instructional videos to generate plans.
Planner-Critic: Refines the initial plan generated by the planner.
Step-Check: Verifies task completion and redundancy using various output statuses (e.g., <Modify>, <Pass>, <Continue>, <Finished>). It also implements an LLM-driven region search module to locate target elements.
Actor: Translates action descriptions into executable code (e.g., click(100, 200)). It can be any API models or locally running models.
Actor-Critic: Checks task completion status by comparing before and after screenshots and uses an iterative action correction algorithm to gradually verify and correct actions.
Input with Instructional Video: Supports the execution with instructional video.
Input without Instructional Video: Supports the direct execution with user query.
Frontend-backend communication system: Supports seperate the frontend and backend for flexible deploying the locally running model and user interfaces.

See our paper for detail. Our GUI-Thinker is along with a newly curted Desktop GUI benchmark WorldGUI.

✅ Todo List

GUI-Thinker is continuously evolving! Here's what's coming:

⚡ Fast Version: Supporting a fast version specially equipped with anthropic Computer Use without the GUI parser.
👓 OOTB Usage: Supporting a user-frendly interface based on Gradio.
📊 Locally-running Models: Supporting the ShowUI or UI-TARS as the Actor in our framework.
🎨 Huggingface Demo: Developing online demo in Huggingface.

Feel free to open issues or submit pull requests if you have suggestions. Our project is actively maintained, with new features and bug fixes released regularly. 🚀

🖥️ Demo of Computer Using

Demo Video (The video has been sped up):

demovideo1.mp4

See 1080p version from https://www.youtube.com/watch?v=RoJ-cbjfZmg

🚀 Getting Started

See Get Started for local computer running.

❤ Acknowledgement

Special thanks to Difei Gao for his hard work on devleoping the codebase.
We express our great thanks to Kaiming Yang, Mingyi Yan, Wendi Yu for their hard work for data ananotation and baseline testing.
OOTB (Computer Use): Computer Use OOTB is an out-of-the-box (OOTB) solution for Desktop GUI Agent, including API-based (Claude 3.5 Computer Use) and locally-running models (ShowUI, UI-TARS).
ShowUI: Open-source, End-to-end, Lightweight, Vision-Language-Action model for GUI Agent & Computer Use.
AssistGUI: AssistGUI is the first work that focuses on desktop productivity software usage with over 100 realistic GUI tasks.
VideoGUI: A Benchmark for GUI Automation from Instructional Videos. Can a GUI agent behave like a human when giving an image-style effect and a user query?
SWE-bench Multimodal: SWE-bench Multimodal is a dataset for evaluating AI systems on visual software engineering tasks.

🎓 BibTeX

If you find WorldGUI useful, please cite using this BibTeX:

@misc{zhao2025worldguidynamictestingcomprehensive,
      title={WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation}, 
      author={Henry Hengyuan Zhao and Difei Gao and Mike Zheng Shou},
      year={2025},
      eprint={2502.08047},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2502.08047}, 
}

Star History

🔔 Contact

If you have any questions or suggestions, please don't hesitate to let us know. You can directly email Henry Hengyuan Zhao at NUS using the email address [email protected], or post an issue on this repository. We welcome contributions. Feel free to submit pull requests if you have suggestions for improvement.

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
agent		agent
assets		assets
css		css
data		data
shells		shells
Get Started.md		Get Started.md
README.md		README.md
README_zh.md		README_zh.md
index.html		index.html
requirements.txt		requirements.txt
test_guithinker_custom.py		test_guithinker_custom.py
test_guithinker_demo.py		test_guithinker_demo.py
test_guithinker_fast.py		test_guithinker_fast.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to GUI-Thinker! If you find this repo useful, please give a star ⭐ for encouragement.

English | 中文

Overview

Introduction

Technique Details

Comparing with SOTA Desktop GUI Agent

📢 Update

✨Key Features

🤖 Core Components:

✅ Todo List

🖥️ Demo of Computer Using

🚀 Getting Started

❤ Acknowledgement

🎓 BibTeX

Star History

🔔 Contact

About

Contributors 2

Languages

showlab/GUI-Thinker

Folders and files

Latest commit

History

Repository files navigation

Welcome to GUI-Thinker! If you find this repo useful, please give a star ⭐ for encouragement.

English | 中文

Overview

Introduction

Technique Details

Comparing with SOTA Desktop GUI Agent

📢 Update

✨Key Features

🤖 Core Components:

✅ Todo List

🖥️ Demo of Computer Using

🚀 Getting Started

❤ Acknowledgement

🎓 BibTeX

Star History

🔔 Contact

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages