MLServer

MLServer is an open-source high-performance inference server supporting multi-model serving, REST/gRPC (V2) protocols and extensible runtimes.

Author: SeldonIO

Since: 2020-06-16

Visit Website GitHub

Overview

MLServer is an open-source inference server designed for production model serving. It implements the V2 inference protocols over REST and gRPC, supports multi-model serving, adaptive batching and extensible inference runtimes (e.g., MLflow, Hugging Face, XGBoost). MLServer integrates well with Kubernetes-native deployment frameworks such as Seldon Core and KServe.

Key features

Multi-model serving: run multiple models in the same process for resource efficiency.
Parallel inference and adaptive batching: improve throughput via worker pools and dynamic batching.
Extensible runtimes and plugins: built-in and custom runtimes to support various model formats.
Standard protocol support: V2-compatible REST/gRPC interfaces for interoperability.

Use cases

Production model inference on Kubernetes.
Exposing heterogeneous models with a unified inference API.
Building low-latency, high-throughput online inference pipelines.

Technical notes

Implemented in Python with plugin-based runtime architecture.
Supports many model formats/backends (TensorFlow, PyTorch, ONNX, XGBoost, etc.).
Apache-2.0 licensed, actively maintained with documentation and examples.

MLServer

Overview

Key features

Use cases

Technical notes

Resource Info

Related Resources

Seldon Core

gtr — Git Worktree Runner

Katana