草稿

模型评估

LLM 评估没有“银弹”,但方法的选择决定了你能看到的世界。本文将带你拆解主流评估范式,理解背后的逻辑与局限。

理解 LLM 评估的四大主流方法

在实际工作中,如何科学评估大语言模型(Large Language Model, LLM)?这是一个看似简单却极具深度的问题。无论是模型选型、结果解读,还是微调与自研模型的进展衡量,评估方法的选择都至关重要。

本节将梳理 LLM 评估的四种常见方式:多项选择、验证器、排行榜和 LLM 评审。理解这些方法的原理,有助于你更好地解读 benchmark、leaderboard 以及论文报告的数据。

评估方法概览

目前主流的 LLM 评估方法可分为两大类:基准测试(Benchmark-based)和判断类评估(Judgment-based)。常见的四种方法如下:

  • 多项选择(Multiple Choice)
  • 验证器(Verifier)
  • 排行榜(Leaderboard)
  • LLM 评审(LLM Judge)

下图展示了这四种评估方式的关系,帮助理解各自的归属与联系。

图 1: 四种主流 LLM 评估方法分布示意图
图 1: 四种主流 LLM 评估方法分布示意图

方法一:多项选择准确率评估

多项选择题(如 MMLU)是最常见的基准测试方法,主要考察模型的知识回忆能力。以 MMLU(Massive Multitask Language Understanding)为例,涵盖 57 个学科、约 1.6 万道选择题,评估指标为准确率。

下图展示了多项选择题评估的基本流程。

图 2: MMLU 多项选择题评估流程图
图 2: MMLU 多项选择题评估流程图

代码示例:加载模型与评测

以下代码演示如何加载 Qwen3 0.6B 模型并进行多项选择题评测。

from pathlib import Path
import torch
from reasoning_from_scratch.ch02 import get_device
from reasoning_from_scratch.qwen3 import (
    download_qwen3_small, Qwen3Tokenizer,
    Qwen3Model, QWEN_CONFIG_06_B
)

device = get_device()

# Set matmul precision to "high" to 
# enable Tensor Cores on compatible GPUs
torch.set_float32_matmul_precision("high")

# Uncomment the following line 
# if you encounter device compatibility issues
# device = "cpu"

# Use the base model by default
WHICH_MODEL = "base"

if WHICH_MODEL == "base":
    download_qwen3_small(
        kind="base", tokenizer_only=False, out_dir="qwen3"
    )
    tokenizer_path = Path("qwen3") / "tokenizer-base.json"
    model_path = Path("qwen3") / "qwen3-0.6B-base.pth"
    tokenizer = Qwen3Tokenizer(tokenizer_file_path=tokenizer_path)

elif WHICH_MODEL == "reasoning":
    download_qwen3_small(
        kind="reasoning", tokenizer_only=False, out_dir="qwen3"
    )
    tokenizer_path = Path("qwen3") / "tokenizer-reasoning.json"
    model_path = Path("qwen3") / "qwen3-0.6B-reasoning.pth"
    tokenizer = Qwen3Tokenizer(
        tokenizer_file_path=tokenizer_path,
        apply_chat_template=True,
        add_generation_prompt=True,
        add_thinking=True,
    )

else:
    raise ValueError(f"Invalid choice: WHICH_MODEL={WHICH_MODEL}")

model = Qwen3Model(QWEN_CONFIG_06_B)
model.load_state_dict(torch.load(model_path))
model.to(device)

# Optionally enable model compilation for potential performance gains
USE_COMPILE = False
if USE_COMPILE:
    torch._dynamo.config.allow_unspec_int_on_nn_module = True
    model = torch.compile(model)

格式化多项选择题的 prompt:

def format_prompt(example):
    return (
        f"{example['question']}\n"
        f"A. {example['choices'][0]}\n"
        f"B. {example['choices'][1]}\n"
        f"C. {example['choices'][2]}\n"
        f"D. {example['choices'][3]}\n"
        "Answer: "
    )

预测答案并比对:

def predict_choice(model, tokenizer, prompt_fmt, max_new_tokens=8):
    pred = None
    for t in generate_text_basic_stream_cache(
        model=model,
        token_ids=prompt_fmt,
        max_new_tokens=max_new_tokens,
        eos_token_id=tokenizer.eos_token_id,
    ):
        answer = tokenizer.decode(t.squeeze(0).tolist())
        for letter in answer:
            letter = letter.upper()
            # stop as soon as a letter appears
            if letter in "ABCD":
                pred = letter
                break
        if pred:
            break
    return pred

实际效果如下:

Generated letter: C
Correct? False

多项选择题评测简单直观,适合大规模快速对比,但仅能衡量知识回忆能力,无法反映推理与真实应用表现。

方法二:验证器自动判分

验证器方法允许模型自由生成答案,再用外部工具(如代码解释器、计算器)自动比对答案正确性,适用于数学、代码等可自动验证领域。

下图为验证器自动判分的流程示意。

图 3: 验证器自动判分流程示意图
图 3: 验证器自动判分流程示意图

该方法可自动生成大量题目,适合推理能力评测,但仅适用于可自动验证的领域,且依赖外部工具的准确性。

方法三:排行榜与偏好投票

排行榜方法通过用户或 LLM 对模型输出的偏好投票,统计模型受欢迎程度。典型如 LM Arena,用户对比两模型输出,投票选出更优者,最终形成排行榜。

下图为排行榜投票流程示意。

图 4: LM Arena 排行榜投票界面示意图
图 4: LM Arena 排行榜投票界面示意图

代码示例:Elo 排名实现

def elo_ratings(vote_pairs, k_factor=32, initial_rating=1000):
    # Initialize all models with the same base rating
    ratings = {
        model: initial_rating
        for pair in vote_pairs
        for model in pair
    }

    # Update ratings after each match
    for winner, loser in vote_pairs:

        # Expected score for the current winner
        expected_winner = 1.0 / (
            1.0 + 10 ** (
                (ratings[loser] - ratings[winner])
                / 400.0
            )
        )

        # k_factor determines sensitivity of updates
        ratings[winner] = (
            ratings[winner]
            + k_factor * (1 - expected_winner)
        )
        ratings[loser] = (
            ratings[loser]
            + k_factor * (0 - (1 - expected_winner))
        )

    return ratings

输出示例:

GPT-5 : 1043.7
Claude-3 : 1015.2
Llama-4 : 1000.7
Llama-3 : 940.4

排行榜方法能反映模型在真实场景下的受欢迎程度,但受用户群体、投票偏好等影响较大,且难以衡量答案正确性。

方法四:LLM 评审(AI 评分官)

LLM 评审方法利用强大的 LLM(如 GPT-5)作为评分官,依据评分标准(rubric)对模型输出进行自动打分,兼具可扩展性与一致性。

下图为 LLM 评审流程示意。

图 5: LLM 评审流程示意图
图 5: LLM 评审流程示意图

代码示例:Ollama API 自动评分

import json
import urllib.request

def query_model(
    prompt,
    model="gpt-oss:20b",
    # If you used 
    # OLLAMA_HOST=127.0.0.1:11435 ollama serve
    # update the address below
    url="http://localhost:11434/api/chat"
):
    # Create the data payload as a dictionary:
    data = {
        "model": model,
        "messages": [
            {"role": "user", "content": prompt}
        ],
        # Settings required for deterministic responses:
        "options": {
            "seed": 123,
            "temperature": 0,
            "num_ctx": 2048
        }
    }

    # Convert the dictionary to JSON and encode it to bytes
    payload = json.dumps(data).encode("utf-8")

    # Create a POST request and add headers
    request = urllib.request.Request(  
        url,
        data=payload,
        method="POST"
    )
    request.add_header("Content-Type", "application/json")

    response_data = ""

    # Send the request and capture the streaming response
    with urllib.request.urlopen(request) as response:
        while True:
            line = response.readline().decode("utf-8")
            if not line:
                break
            # Parse each line into JSON
            response_json = json.loads(line)
            response_data += response_json["message"]["content"]

    return response_data

def rubric_prompt(instruction, reference_answer, model_answer):
    rubric = (
        "You are a fair judge assistant. You will be "
        "given an instruction, a reference answer, and "
        "a candidate answer to evaluate, according to "
        "the following rubric:\n\n"
        "1: The response fails to address the "
        "instruction, providing irrelevant, incorrect, "
        "or excessively verbose content.\n"
        "2: The response partially addresses the "
        "instruction but contains major errors, "
        "omissions, or irrelevant details.\n"
        "3: The response addresses the instruction to "
        "some degree but is incomplete, partially "
        "correct, or unclear in places.\n"
        "4: The response mostly adheres to the "
        "instruction, with only minor errors, "
        "omissions, or lack of clarity.\n"
        "5: The response fully adheres to the "
        "instruction, providing a clear, accurate, and "
        "relevant answer in a concise and efficient "
        "manner.\n\n"
        "Now here is the instruction, the reference "
        "answer, and the response.\n"
    )

    prompt = (
        f"{rubric}\n"
        f"Instruction:\n{instruction}\n\n"
        f"Reference Answer:\n{reference_answer}\n\n"
        f"Answer:\n{model_answer}\n\n"
        f"Evaluation: "
    )
    return prompt

评分结果示例:

Score: 5

The candidate answer directly addresses the question, correctly applies the given premises, and concisely states that a penguin would be able to fly. It is accurate, relevant, and clear.

LLM 评审方法适用于大规模自动评测,兼具灵活性与一致性,但结果依赖评分官模型与评分标准,存在一定主观性。

方法对比与适用建议

不同评估方法各有优缺点,实际应用中应结合多种方式,综合衡量模型能力。下表总结了四种方法的主要特性。

方法优点缺点
多项选择快速、标准化、可复现仅测知识回忆,无法反映真实应用能力
验证器自动化、可评推理、支持自由生成仅适用可验证领域,依赖外部工具
排行榜反映用户真实偏好、涵盖风格与安全性受用户群体影响,难以衡量正确性
LLM 评审可扩展、一致性强、支持多任务依赖评分官模型与 rubric,存在主观性
表 1: 主流 LLM 评估方法对比

总结

本文系统梳理了 LLM 评估的四大主流方法,并配以从零实现的代码示例。每种方法各有适用场景与局限,实际评估时建议结合多种方式,并根据业务目标定制评测数据与流程。只有这样,才能全面、客观地衡量模型的真实能力与改进空间。

参考文献

文章导航

章节内容

这是章节的内容页面。

章节概览