Self-Operating Computer

A framework that enables multimodal models to operate a computer by observing the screen and issuing mouse/keyboard actions, supporting voice, OCR and multiple model backends.

Author: OthersideAI

Added Date: 2025-10-02

Open Source Since: 2023-11-04

Visit Website GitHub

Summary

The Self-Operating Computer framework allows multimodal models to view the screen and perform mouse and keyboard actions to achieve tasks. It supports multiple backends and input modes, suitable for automation, accessibility, and research.

Key features

Multi-model compatibility: works with GPT-4 variants, Gemini Vision, Claude, Qwen-VL, LLaVa and others.
Multiple operation modes: voice, OCR, and Set-of-Mark (SoM) visual prompting to improve visual grounding.
Easy start: pip install, operate CLI, Docker examples and cross-platform support (macOS/Windows/Linux).

Use cases

Desktop automation and visual test scripting.
Research into real-world interaction capabilities of multimodal models.

Technical details

Pure Python implementation with audio/OCR modules and integrations for local/cloud model providers; includes examples and configuration for model selection and operation modes.

Self-Operating Computer

Summary

Key features

Use cases

Technical details

Resource Info

Related Resources

Glow

LangREPL

MONAI