Overview
OpenLLM is an open-source toolkit maintained by BentoML to simplify self-hosting large language models. It offers CLI and Python APIs, an OpenAI-compatible model server (openllm serve
), built-in web UI, and integrations with inference backends and cloud deployments.
Key features
- One-command model serving:
openllm serve <model>
launches a service exposing OpenAI-compatible APIs and a web chat UI. - Broad model support: adapters and model repositories for many open-source LLMs (Llama, Mistral, Qwen, Gemma, etc.).
- Deployment options: Docker, Kubernetes, and BentoML/BentoCloud integrations for production deployments.
Use cases
- Quickly self-host models locally for experimentation or production.
- Provide an audit-friendly, monitorable inference service for teams.
- Integrate custom model repositories for organization-specific models.
Technical notes
- Python-based with CLI and SDK; integrates with vLLM, BentoML and other inference tooling.
- Does not store model weights; gated models require HF_TOKEN and appropriate access.
- Apache-2.0 licensed with active community and detailed documentation.