Self-Operating Computer

A framework that enables multimodal models to operate a computer by observing the screen and issuing mouse/keyboard actions, supporting voice, OCR and multiple model backends.

OthersideAI · Since 2023-11-04

Loading score...

GitHub Website

Summary

The Self-Operating Computer framework allows multimodal models to view the screen and perform mouse and keyboard actions to achieve tasks. It supports multiple backends and input modes, suitable for automation, accessibility, and research.

Key features

Multi-model compatibility: works with GPT-4 variants, Gemini Vision, Claude, Qwen-VL, LLaVa and others.
Multiple operation modes: voice, OCR, and Set-of-Mark (SoM) visual prompting to improve visual grounding.
Easy start: pip install, operate CLI, Docker examples and cross-platform support (macOS/Windows/Linux).

Use cases

Desktop automation and visual test scripting.
Research into real-world interaction capabilities of multimodal models.

Technical details

Pure Python implementation with audio/OCR modules and integrations for local/cloud model providers; includes examples and configuration for model selection and operation modes.

Self-Operating Computer

Summary

Key features

Use cases

Technical details

Score Breakdown

Related Resources

json-render

UI/UX Pro Max Skill

aicodeprep-gui