OmniParser

OmniParser is an open-source pure vision GUI Agent screen parsing tool by Microsoft, converting interface screenshots into structured elements to empower LLMs and agents for precise desktop actions.

Microsoft · Since 2024-09-20

Loading score...

GitHub Website

Introduction

OmniParser is an open-source pure vision GUI Agent screen parsing tool by Microsoft, converting interface screenshots into structured elements to significantly enhance LLMs (e.g., GPT-4V) for desktop automation and control.

Key Features

Parses any UI screenshot into structured, actionable elements
Integrates with OmniTool for multi-agent orchestration and automation
Supports mainstream LLMs (OpenAI, DeepSeek, Qwen, Anthropic, etc.)
Provides HuggingFace models and Gradio online demo

Use Cases

Desktop agent automation
Screen element recognition and interactive region detection
Building training data pipelines for AI Agent applications

Technical Highlights

OmniParser is built on high-performance vision models, supports plugin extensions, and is easy to integrate into existing systems. Model weights are released under AGPL/MIT licenses, and the open-source codebase enables customization. See the technical report on arXiv.

Project page: OmniParser Project Page
Online demo: HuggingFace Space Demo

OmniParser

Introduction

Key Features

Use Cases

Technical Highlights

Score Breakdown

Related Resources

Snippy

VSCode Copilot Chat

Call Center AI