LongBench

LongBench is a bilingual multi-task benchmark for long-context understanding and reasoning, covering single/multi-document QA, long in-context learning, dialogue history, and code-repo understanding.

Author: THUDM

Added Date: 2025-09-27

Open Source Since: 2023-07-29

Visit Website GitHub

Overview

LongBench (v1 and v2) provides large-scale datasets and evaluation tooling for assessing model capabilities on realistic long-context multitasks. Context lengths range from thousands to millions of words.

Key features

Multi-task and multi-length coverage including single-document QA, multi-document QA, long in-context learning, long-dialogue understanding, and code-repo tasks.
Reproducible datasets and evaluation scripts with a public leaderboard for tracking progress.
Data is provided in multiple formats (Hugging Face datasets, JSON) and includes citation information for academic use.

Use cases

Benchmarking and selecting models for long-context applications.
Research into retrieval-augmented methods, long-context memory, and reasoning improvements.
Regression testing for long-context services and model deployment validation.

Technical notes

Tasks are formatted as multiple-choice for objective evaluation and statistical reliability.
Evaluation pipelines can be automated using provided scripts; examples include deploying a model with vLLM and running the pred.py / result.py workflow.
See the project page and paper links on the project site for leaderboard and dataset download.

LongBench

Overview

Key features

Use cases

Technical notes

Resource Info

Related Resources

Giskard OSS

HELM

LightEval