Research on GUI Models - a maxiw Collection

maxiw 's Collections

Research on GUI Models

Research on GUI Models

updated Feb 21, 2025

Qwen2.5-VL Technical Report

Paper • 2502.13923 • Published Feb 19, 2025 • 214
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Paper • 2404.05719 • Published Apr 8, 2024 • 83
ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Paper • 2411.17465 • Published Nov 26, 2024 • 89
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Paper • 2409.12191 • Published Sep 18, 2024 • 78
UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Paper • 2501.12326 • Published Jan 21, 2025 • 64
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Paper • 2404.07972 • Published Apr 11, 2024 • 52
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Paper • 2410.23218 • Published Oct 30, 2024 • 49
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Paper • 2409.08264 • Published Sep 12, 2024 • 48
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
CogAgent: A Visual Language Model for GUI Agents

Paper • 2312.08914 • Published Dec 14, 2023 • 31
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Paper • 2410.05243 • Published Oct 7, 2024 • 20
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Paper • 2401.10935 • Published Jan 17, 2024 • 5
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Paper • 2210.03347 • Published Oct 7, 2022 • 3
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

Paper • 2410.18967 • Published Oct 24, 2024 • 1
ScreenAgent: A Vision Language Model-driven Computer Control Agent

Paper • 2402.07945 • Published Feb 9, 2024
From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces

Paper • 2306.00245 • Published May 31, 2023
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

Paper • 2406.10819 • Published Jun 16, 2024 • 2
ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots

Paper • 2209.08199 • Published Sep 16, 2022