Introduction
OmniParser is an open-source pure vision GUI Agent screen parsing tool by Microsoft, converting interface screenshots into structured elements to significantly enhance LLMs (e.g., GPT-4V) for desktop automation and control.
Key Features
- Parses any UI screenshot into structured, actionable elements
- Integrates with OmniTool for multi-agent orchestration and automation
- Supports mainstream LLMs (OpenAI, DeepSeek, Qwen, Anthropic, etc.)
- Provides HuggingFace models and Gradio online demo
Use Cases
- Desktop agent automation
- Screen element recognition and interactive region detection
- Building training data pipelines for AI Agent applications
Technical Highlights
OmniParser is built on high-performance vision models, supports plugin extensions, and is easy to integrate into existing systems. Model weights are released under AGPL/MIT licenses, and the open-source codebase enables customization. See the technical report on arXiv .
- Project page: OmniParser Project Page
- Online demo: HuggingFace Space Demo