Overview
MLServer is an open-source inference server designed for production model serving. It implements the V2 inference protocols over REST and gRPC, supports multi-model serving, adaptive batching and extensible inference runtimes (e.g., MLflow, Hugging Face, XGBoost). MLServer integrates well with Kubernetes-native deployment frameworks such as Seldon Core and KServe.
Key features
- Multi-model serving: run multiple models in the same process for resource efficiency.
- Parallel inference and adaptive batching: improve throughput via worker pools and dynamic batching.
- Extensible runtimes and plugins: built-in and custom runtimes to support various model formats.
- Standard protocol support: V2-compatible REST/gRPC interfaces for interoperability.
Use cases
- Production model inference on Kubernetes.
- Exposing heterogeneous models with a unified inference API.
- Building low-latency, high-throughput online inference pipelines.
Technical notes
- Implemented in Python with plugin-based runtime architecture.
- Supports many model formats/backends (TensorFlow, PyTorch, ONNX, XGBoost, etc.).
- Apache-2.0 licensed, actively maintained with documentation and examples.