Overview
LongBench (v1 and v2) provides large-scale datasets and evaluation tooling for assessing model capabilities on realistic long-context multitasks. Context lengths range from thousands to millions of words.
Key features
- Multi-task and multi-length coverage including single-document QA, multi-document QA, long in-context learning, long-dialogue understanding, and code-repo tasks.
- Reproducible datasets and evaluation scripts with a public leaderboard for tracking progress.
- Data is provided in multiple formats (Hugging Face datasets, JSON) and includes citation information for academic use.
Use cases
- Benchmarking and selecting models for long-context applications.
- Research into retrieval-augmented methods, long-context memory, and reasoning improvements.
- Regression testing for long-context services and model deployment validation.
Technical notes
- Tasks are formatted as multiple-choice for objective evaluation and statistical reliability.
- Evaluation pipelines can be automated using provided scripts; examples include deploying a model with vLLM and running the
pred.py
/result.py
workflow. - See the project page and paper links on the project site for leaderboard and dataset download.