CogAgent

Open-source end-to-end VLM GUI agent supporting bilingual, screenshot-based interaction.

Author: zai-org

Since: 2023-11-28

CogAgent connects screenshots with language to produce structured GUI actions.

CogAgent is an open-source, end-to-end vision-language model (VLM) based GUI agent designed to combine screenshots and natural language to generate executable GUI operation sequences. The project provides model weights, inference demos, and documentation demonstrating strong performance on GUI grounding and action prediction benchmarks.

Detailed Introduction

CogAgent integrates visual understanding with action generation for GUI tasks, supporting both Chinese and English inputs. The repository includes the CogAgent-9B model, inference examples, and a technical blog presenting evaluations. See the project site for more details.

Main Features

GUI-focused: recognizes UI elements and outputs action sequences for execution.
Bilingual support: supports tasks and outputs in both Chinese and English.
Multi-step execution: supports historical context and stepwise plans.
Open license: code under Apache-2.0; model license details are provided in the repo.

Use Cases

Suitable for desktop/web automation, visual testing, accessibility assistants, and research prototypes. CogAgent can be deployed as a local inference service or integrated into RPA and testing pipelines for automated GUI interactions.

Technical Features

Implemented in Python, CogAgent builds on VLM base models (e.g., GLM-4V-9B) with multi-stage training and strategy optimizations to improve GUI localization and action generation. The repo contains inference demos, deployment scripts, and fine-tuning guidelines, along with notes on GPU/VRAM requirements.

CogAgent

Detailed Introduction

Main Features

Use Cases

Technical Features

Resource Info

Related Resources

NOFX

Agentic Data Scientist

BISHENG