A guide to building long-term compounding knowledge infrastructure. See details on GitHub .

LongBench

LongBench is a bilingual multi-task benchmark for long-context understanding and reasoning, covering single/multi-document QA, long in-context learning, dialogue history, and code-repo understanding.

Overview

LongBench (v1 and v2) provides large-scale datasets and evaluation tooling for assessing model capabilities on realistic long-context multitasks. Context lengths range from thousands to millions of words.

Key features

  • Multi-task and multi-length coverage including single-document QA, multi-document QA, long in-context learning, long-dialogue understanding, and code-repo tasks.
  • Reproducible datasets and evaluation scripts with a public leaderboard for tracking progress.
  • Data is provided in multiple formats (Hugging Face datasets, JSON) and includes citation information for academic use.

Use cases

  • Benchmarking and selecting models for long-context applications.
  • Research into retrieval-augmented methods, long-context memory, and reasoning improvements.
  • Regression testing for long-context services and model deployment validation.

Technical notes

  • Tasks are formatted as multiple-choice for objective evaluation and statistical reliability.
  • Evaluation pipelines can be automated using provided scripts; examples include deploying a model with vLLM and running the pred.py / result.py workflow.
  • See the project page and paper links on the project site for leaderboard and dataset download.

Comments

LongBench
Resource Info
🌱 Open Source 📊 Benchmark