Terminal-Bench

A benchmark and execution harness for evaluating AI agents in real terminal environments.

Terminal-Bench Team · Since 2025-01-17

Loading score...

Introduction

Terminal-Bench is an open-source benchmark suite and execution harness for testing AI agents in realistic terminal environments. It provides reproducible tasks and an execution harness to evaluate agents on end-to-end system-level tasks such as building code, training models, and setting up services.

Key features

Reproducible task dataset and test scripts (see the tasks directory).
An execution harness that connects models to a sandboxed terminal environment and supports leaderboard submissions.
Comprehensive documentation and quickstart guides at https://www.tbench.ai/docs.
Adapter and contribution mechanisms for extending tasks and integrations.

Use cases

Evaluating LLM agents on real-world engineering tasks involving system and environment interactions.
Regression and capacity testing for agent development workflows.
Building and validating automation pipelines for complex engineering tasks.

Technical details

Implemented primarily in Python and shell scripts, with a CLI (tb) for running evaluations.
Supports Docker sandboxing and virtual environment isolation for reproducible, secure testing.
Extensible task/adapters architecture for easy addition of new benchmarks and integrations.

Terminal-Bench

Introduction

Key features

Use cases

Technical details

Score Breakdown

Related Resources

Evaluation Guidebook

Giskard OSS

HELM