A guide to building long-term compounding knowledge infrastructure. See details on GitHub .

Self-Operating Computer

A framework that enables multimodal models to operate a computer by observing the screen and issuing mouse/keyboard actions, supporting voice, OCR and multiple model backends.

Summary

The Self-Operating Computer framework allows multimodal models to view the screen and perform mouse and keyboard actions to achieve tasks. It supports multiple backends and input modes, suitable for automation, accessibility, and research.

Key features

  • Multi-model compatibility: works with GPT-4 variants, Gemini Vision, Claude, Qwen-VL, LLaVa and others.
  • Multiple operation modes: voice, OCR, and Set-of-Mark (SoM) visual prompting to improve visual grounding.
  • Easy start: pip install, operate CLI, Docker examples and cross-platform support (macOS/Windows/Linux).

Use cases

  • Desktop automation and visual test scripting.
  • Research into real-world interaction capabilities of multimodal models.

Technical details

  • Pure Python implementation with audio/OCR modules and integrations for local/cloud model providers; includes examples and configuration for model selection and operation modes.

Comments

Self-Operating Computer
Resource Info
Author OthersideAI
Added Date 2025-10-02
Open Source Since 2023-11-04
Tags
Open Source Framework