Introduction
Terminal-Bench is an open-source benchmark suite and execution harness for testing AI agents in realistic terminal environments. It provides reproducible tasks and an execution harness to evaluate agents on end-to-end system-level tasks such as building code, training models, and setting up services.
Key features
- Reproducible task dataset and test scripts (see the
tasks
directory). - An execution harness that connects models to a sandboxed terminal environment and supports leaderboard submissions.
- Comprehensive documentation and quickstart guides at https://www.tbench.ai/docs .
- Adapter and contribution mechanisms for extending tasks and integrations.
Use cases
- Evaluating LLM agents on real-world engineering tasks involving system and environment interactions.
- Regression and capacity testing for agent development workflows.
- Building and validating automation pipelines for complex engineering tasks.
Technical details
- Implemented primarily in Python and shell scripts, with a CLI (
tb
) for running evaluations. - Supports Docker sandboxing and virtual environment isolation for reproducible, secure testing.
- Extensible task/adapters architecture for easy addition of new benchmarks and integrations.