A curated list of AI tools and resources for developers, see the AI Resources .

Self-Operating Computer

A framework that enables multimodal models to operate a computer by observing the screen and issuing mouse/keyboard actions, supporting voice, OCR and multiple model backends.

Summary

The Self-Operating Computer framework allows multimodal models to view the screen and perform mouse and keyboard actions to achieve tasks. It supports multiple backends and input modes, suitable for automation, accessibility, and research.

Key features

  • Multi-model compatibility: works with GPT-4 variants, Gemini Vision, Claude, Qwen-VL, LLaVa and others.
  • Multiple operation modes: voice, OCR, and Set-of-Mark (SoM) visual prompting to improve visual grounding.
  • Easy start: pip install, operate CLI, Docker examples and cross-platform support (macOS/Windows/Linux).

Use cases

  • Desktop automation and visual test scripting.
  • Research into real-world interaction capabilities of multimodal models.

Technical details

  • Pure Python implementation with audio/OCR modules and integrations for local/cloud model providers; includes examples and configuration for model selection and operation modes.

Comments

Self-Operating Computer
Resource Info
🌱 Open Source 🏗️ Framework