A guide to building long-term compounding knowledge infrastructure. See details on GitHub .

Terminal-Bench

A benchmark and execution harness for evaluating AI agents in real terminal environments.

Introduction

Terminal-Bench is an open-source benchmark suite and execution harness for testing AI agents in realistic terminal environments. It provides reproducible tasks and an execution harness to evaluate agents on end-to-end system-level tasks such as building code, training models, and setting up services.

Key features

  • Reproducible task dataset and test scripts (see the tasks directory).
  • An execution harness that connects models to a sandboxed terminal environment and supports leaderboard submissions.
  • Comprehensive documentation and quickstart guides at https://www.tbench.ai/docs .
  • Adapter and contribution mechanisms for extending tasks and integrations.

Use cases

  • Evaluating LLM agents on real-world engineering tasks involving system and environment interactions.
  • Regression and capacity testing for agent development workflows.
  • Building and validating automation pipelines for complex engineering tasks.

Technical details

  • Implemented primarily in Python and shell scripts, with a CLI (tb) for running evaluations.
  • Supports Docker sandboxing and virtual environment isolation for reproducible, secure testing.
  • Extensible task/adapters architecture for easy addition of new benchmarks and integrations.

Comments

Terminal-Bench
Resource Info
Author Terminal-Bench Team
Added Date 2025-09-30
Open Source Since 2025-01-17
Tags
Open Source Benchmark Evaluation