gpustack

Open-source GPU cluster manager for efficient model training and high-performance inference orchestration.

gpustack · Since 2024-05-11

Loading score...

Detailed Introduction

gpustack is an open-source platform that unifies heterogeneous GPU resources into a single, orchestratable pool for model training and inference. It provides device discovery, resource abstraction, and centralized scheduling so teams can run distributed training and low-latency inference with improved utilization and observability.

Main Features

Resource pooling and device discovery: automatic identification of GPU model, memory and driver details, with support for CUDA and ROCm.
Intelligent scheduling: policies based on job requirements, priorities and reservations to maximize utilization and reduce queue time.
Observability: built-in metrics collection, job dashboards and historical statistics with Prometheus/Grafana integration.
Extensibility: plugin hooks for custom schedulers, lifecycle events and monitoring integrations.

Use Cases

Research and education clusters: share GPUs safely across projects while avoiding memory and card conflicts.
Enterprise training platforms: orchestrate large-scale distributed training and control costs.
Online inference fleets: dynamically allocate GPUs by request load to provide low-latency, cost-effective serving.

Technical Highlights

gpustack follows cloud-native principles and integrates with container ecosystems and orchestration tooling. It exposes a RESTful API and CLI for automation, supports modular deployment of scheduler/monitoring/access layers, and is released under the Apache-2.0 license with community documentation available on the project website.

Core Content

Core Content

Technology

Technology

More

More

AI Infrastructure

AI Infrastructure

Explore

Explore

Connect

Connect

Quick Links

Quick Links

LinkedIn

LinkedIn

Follow on X

Follow on X

gpustack

Detailed Introduction

Main Features

Use Cases

Technical Highlights

Score Breakdown

Related Resources

AI-Trader

AReaL

AXLearn