Overview
Skypilot provides a unified layer to schedule and run jobs across cloud and on-prem clusters, handling environment setup and dependencies to make distributed training and inference reproducible and portable.
Key features
- One-command provisioning and management for multi-cloud/on-prem jobs.
- Automated environment setup, dependency installation and node orchestration.
- Support for multiple backends and common ML frameworks.
Use cases
- Rapid experimentation with large-scale training configurations on cloud providers.
- Simplifying model deployment and inference pipelines across heterogeneous clusters.
Technical highlights
- Plugin-based backend adapters for extensibility to different cloud vendors and self-hosted resources.
- CLI and Python SDK for integration into existing CI/CD and training workflows.