Research on GUI Models
updated
Qwen2.5-VL Technical Report
Paper
• 2502.13923
• Published
• 214
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper
• 2404.05719
• Published
• 83
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper
• 2411.17465
• Published
• 89
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
Any Resolution
Paper
• 2409.12191
• Published
• 78
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Paper
• 2501.12326
• Published
• 64
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real
Computer Environments
Paper
• 2404.07972
• Published
• 52
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper
• 2410.23218
• Published
• 49
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Paper
• 2409.08264
• Published
• 48
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper
• 2402.04615
• Published
• 44
CogAgent: A Visual Language Model for GUI Agents
Paper
• 2312.08914
• Published
• 31
Navigating the Digital World as Humans Do: Universal Visual Grounding
for GUI Agents
Paper
• 2410.05243
• Published
• 20
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
Paper
• 2401.10935
• Published
• 5
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language
Understanding
Paper
• 2210.03347
• Published
• 3
Ferret-UI 2: Mastering Universal User Interface Understanding Across
Platforms
Paper
• 2410.18967
• Published
• 1
ScreenAgent: A Vision Language Model-driven Computer Control Agent
Paper
• 2402.07945
• Published
From Pixels to UI Actions: Learning to Follow Instructions via Graphical
User Interfaces
Paper
• 2306.00245
• Published
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents
Paper
• 2406.10819
• Published
• 2
ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots
Paper
• 2209.08199
• Published