模型评估
LLM 评估没有“银弹”,但方法的选择决定了你能看到的世界。本文将带你拆解主流评估范式,理解背后的逻辑与局限。
理解 LLM 评估的四大主流方法
在实际工作中,如何科学评估大语言模型(Large Language Model, LLM)?这是一个看似简单却极具深度的问题。无论是模型选型、结果解读,还是微调与自研模型的进展衡量,评估方法的选择都至关重要。
本节将梳理 LLM 评估的四种常见方式:多项选择、验证器、排行榜和 LLM 评审。理解这些方法的原理,有助于你更好地解读 benchmark、leaderboard 以及论文报告的数据。
评估方法概览
目前主流的 LLM 评估方法可分为两大类:基准测试(Benchmark-based)和判断类评估(Judgment-based)。常见的四种方法如下:
- 多项选择(Multiple Choice)
- 验证器(Verifier)
- 排行榜(Leaderboard)
- LLM 评审(LLM Judge)
下图展示了这四种评估方式的关系,帮助理解各自的归属与联系。
方法一:多项选择准确率评估
多项选择题(如 MMLU)是最常见的基准测试方法,主要考察模型的知识回忆能力。以 MMLU(Massive Multitask Language Understanding)为例,涵盖 57 个学科、约 1.6 万道选择题,评估指标为准确率。
下图展示了多项选择题评估的基本流程。
代码示例:加载模型与评测
以下代码演示如何加载 Qwen3 0.6B 模型并进行多项选择题评测。
from pathlib import Path
import torch
from reasoning_from_scratch.ch02 import get_device
from reasoning_from_scratch.qwen3 import (
download_qwen3_small, Qwen3Tokenizer,
Qwen3Model, QWEN_CONFIG_06_B
)
device = get_device()
# Set matmul precision to "high" to
# enable Tensor Cores on compatible GPUs
torch.set_float32_matmul_precision("high")
# Uncomment the following line
# if you encounter device compatibility issues
# device = "cpu"
# Use the base model by default
WHICH_MODEL = "base"
if WHICH_MODEL == "base":
download_qwen3_small(
kind="base", tokenizer_only=False, out_dir="qwen3"
)
tokenizer_path = Path("qwen3") / "tokenizer-base.json"
model_path = Path("qwen3") / "qwen3-0.6B-base.pth"
tokenizer = Qwen3Tokenizer(tokenizer_file_path=tokenizer_path)
elif WHICH_MODEL == "reasoning":
download_qwen3_small(
kind="reasoning", tokenizer_only=False, out_dir="qwen3"
)
tokenizer_path = Path("qwen3") / "tokenizer-reasoning.json"
model_path = Path("qwen3") / "qwen3-0.6B-reasoning.pth"
tokenizer = Qwen3Tokenizer(
tokenizer_file_path=tokenizer_path,
apply_chat_template=True,
add_generation_prompt=True,
add_thinking=True,
)
else:
raise ValueError(f"Invalid choice: WHICH_MODEL={WHICH_MODEL}")
model = Qwen3Model(QWEN_CONFIG_06_B)
model.load_state_dict(torch.load(model_path))
model.to(device)
# Optionally enable model compilation for potential performance gains
USE_COMPILE = False
if USE_COMPILE:
torch._dynamo.config.allow_unspec_int_on_nn_module = True
model = torch.compile(model)
格式化多项选择题的 prompt:
def format_prompt(example):
return (
f"{example['question']}\n"
f"A. {example['choices'][0]}\n"
f"B. {example['choices'][1]}\n"
f"C. {example['choices'][2]}\n"
f"D. {example['choices'][3]}\n"
"Answer: "
)
预测答案并比对:
def predict_choice(model, tokenizer, prompt_fmt, max_new_tokens=8):
pred = None
for t in generate_text_basic_stream_cache(
model=model,
token_ids=prompt_fmt,
max_new_tokens=max_new_tokens,
eos_token_id=tokenizer.eos_token_id,
):
answer = tokenizer.decode(t.squeeze(0).tolist())
for letter in answer:
letter = letter.upper()
# stop as soon as a letter appears
if letter in "ABCD":
pred = letter
break
if pred:
break
return pred
实际效果如下:
Generated letter: C
Correct? False
多项选择题评测简单直观,适合大规模快速对比,但仅能衡量知识回忆能力,无法反映推理与真实应用表现。
方法二:验证器自动判分
验证器方法允许模型自由生成答案,再用外部工具(如代码解释器、计算器)自动比对答案正确性,适用于数学、代码等可自动验证领域。
下图为验证器自动判分的流程示意。
该方法可自动生成大量题目,适合推理能力评测,但仅适用于可自动验证的领域,且依赖外部工具的准确性。
方法三:排行榜与偏好投票
排行榜方法通过用户或 LLM 对模型输出的偏好投票,统计模型受欢迎程度。典型如 LM Arena,用户对比两模型输出,投票选出更优者,最终形成排行榜。
下图为排行榜投票流程示意。
代码示例:Elo 排名实现
def elo_ratings(vote_pairs, k_factor=32, initial_rating=1000):
# Initialize all models with the same base rating
ratings = {
model: initial_rating
for pair in vote_pairs
for model in pair
}
# Update ratings after each match
for winner, loser in vote_pairs:
# Expected score for the current winner
expected_winner = 1.0 / (
1.0 + 10 ** (
(ratings[loser] - ratings[winner])
/ 400.0
)
)
# k_factor determines sensitivity of updates
ratings[winner] = (
ratings[winner]
+ k_factor * (1 - expected_winner)
)
ratings[loser] = (
ratings[loser]
+ k_factor * (0 - (1 - expected_winner))
)
return ratings
输出示例:
GPT-5 : 1043.7
Claude-3 : 1015.2
Llama-4 : 1000.7
Llama-3 : 940.4
排行榜方法能反映模型在真实场景下的受欢迎程度,但受用户群体、投票偏好等影响较大,且难以衡量答案正确性。
方法四:LLM 评审(AI 评分官)
LLM 评审方法利用强大的 LLM(如 GPT-5)作为评分官,依据评分标准(rubric)对模型输出进行自动打分,兼具可扩展性与一致性。
下图为 LLM 评审流程示意。
代码示例:Ollama API 自动评分
import json
import urllib.request
def query_model(
prompt,
model="gpt-oss:20b",
# If you used
# OLLAMA_HOST=127.0.0.1:11435 ollama serve
# update the address below
url="http://localhost:11434/api/chat"
):
# Create the data payload as a dictionary:
data = {
"model": model,
"messages": [
{"role": "user", "content": prompt}
],
# Settings required for deterministic responses:
"options": {
"seed": 123,
"temperature": 0,
"num_ctx": 2048
}
}
# Convert the dictionary to JSON and encode it to bytes
payload = json.dumps(data).encode("utf-8")
# Create a POST request and add headers
request = urllib.request.Request(
url,
data=payload,
method="POST"
)
request.add_header("Content-Type", "application/json")
response_data = ""
# Send the request and capture the streaming response
with urllib.request.urlopen(request) as response:
while True:
line = response.readline().decode("utf-8")
if not line:
break
# Parse each line into JSON
response_json = json.loads(line)
response_data += response_json["message"]["content"]
return response_data
def rubric_prompt(instruction, reference_answer, model_answer):
rubric = (
"You are a fair judge assistant. You will be "
"given an instruction, a reference answer, and "
"a candidate answer to evaluate, according to "
"the following rubric:\n\n"
"1: The response fails to address the "
"instruction, providing irrelevant, incorrect, "
"or excessively verbose content.\n"
"2: The response partially addresses the "
"instruction but contains major errors, "
"omissions, or irrelevant details.\n"
"3: The response addresses the instruction to "
"some degree but is incomplete, partially "
"correct, or unclear in places.\n"
"4: The response mostly adheres to the "
"instruction, with only minor errors, "
"omissions, or lack of clarity.\n"
"5: The response fully adheres to the "
"instruction, providing a clear, accurate, and "
"relevant answer in a concise and efficient "
"manner.\n\n"
"Now here is the instruction, the reference "
"answer, and the response.\n"
)
prompt = (
f"{rubric}\n"
f"Instruction:\n{instruction}\n\n"
f"Reference Answer:\n{reference_answer}\n\n"
f"Answer:\n{model_answer}\n\n"
f"Evaluation: "
)
return prompt
评分结果示例:
Score: 5
The candidate answer directly addresses the question, correctly applies the given premises, and concisely states that a penguin would be able to fly. It is accurate, relevant, and clear.
LLM 评审方法适用于大规模自动评测,兼具灵活性与一致性,但结果依赖评分官模型与评分标准,存在一定主观性。
方法对比与适用建议
不同评估方法各有优缺点,实际应用中应结合多种方式,综合衡量模型能力。下表总结了四种方法的主要特性。
| 方法 | 优点 | 缺点 |
|---|---|---|
| 多项选择 | 快速、标准化、可复现 | 仅测知识回忆,无法反映真实应用能力 |
| 验证器 | 自动化、可评推理、支持自由生成 | 仅适用可验证领域,依赖外部工具 |
| 排行榜 | 反映用户真实偏好、涵盖风格与安全性 | 受用户群体影响,难以衡量正确性 |
| LLM 评审 | 可扩展、一致性强、支持多任务 | 依赖评分官模型与 rubric,存在主观性 |
总结
本文系统梳理了 LLM 评估的四大主流方法,并配以从零实现的代码示例。每种方法各有适用场景与局限,实际评估时建议结合多种方式,并根据业务目标定制评测数据与流程。只有这样,才能全面、客观地衡量模型的真实能力与改进空间。