从零开始用 Python 构建智能体

智能体不是魔法，而是工程。理解其本质，才能驾驭未来 AI 的无限可能。

本文以最小可运行代码为例，系统讲解如何用 Python 直接调用大型模型 API，逐步实现对话记忆、工具调用与执行循环，帮助你理解智能体的核心原理与工程边界。

什么是开始构建智能体系统的最佳方式？市面上有无数用于构建智能体的框架，比如 CrewAI、LangGraph 以及 OpenAI Agents SDK，选择其中之一可能让人无所适从。另一方面，Anthropic 建议先直接使用 LLM API 调用来理解基本原理，然后再依赖框架抽象。

本节采用自下而上的方式：不依赖任何智能体框架，直接调用大型模型 API 来搭建一个最小可用的智能体。目标是把核心概念内化 —— 明确「模型、记忆、工具、循环」之间的边界与交互，为后续接入更复杂的智能体框架或多智能体协同打好基础。

从零实现一个智能体

本节将通过逐步整合以下构件来实现一个 Agent() 类，这些构件是智能体的核心组成部分。

LLM 与指令：驱动智能体推理与决策的 LLM，以及定义智能体行为的显式指令。
记忆：Agent 用于理解当前交互的会话历史（短期记忆）。
工具：Agent 可调用的外部函数或 API。

最后，我们将把这些部分放在一个循环中一起运行。

组件 1：模型与行为指令（Model & System Prompt）

任何智能体的决策能力都来自一个大语言模型（LLM）。本章以 Google Gemini 为示例，展示如何在初始化时指定模型、传入「系统指令（system prompt）」以限定智能体的行为风格与解决问题的策略。

下面通过伪代码演示如何初始化 Gemini 客户端并发送一次简单对话请求。

# 伪代码示例，按你使用的 Gemini SDK / HTTP API 做适配
from google_gemini import GeminiClient
import os

client = GeminiClient(api_key=os.getenv("GOOGLE_API_KEY"))

system_prompt = (
    "You are an assistant that reasons step-by-step and prefers concise, checkable answers."
)

def call_model(prompt, context=None):
    """向 Gemini 发起对话请求（最简示例）"""
    resp = client.chat(
        model="gemini-pro",
        system=system_prompt,
        messages=(context or []) + [{"role": "user", "content": prompt}],
        max_output_tokens=1024,
        temperature=0.0,
    )
    return resp

现在智能体具备了简单的问答能力。我们来测试一下。

agent = Agent()

response = agent.chat("我有 4 个苹果。你有多少？")
print(response.content[0].text)

I don't have any apples - as an AI, I don't have a physical form, so I can't possess physical objects like apples. Only you have apples in this scenario (4 of them). 

Is there something you'd like to do with this information, like a math problem involving your apples?

我没有实体，因此无法拥有实物（例如苹果）。在这个场景中，你有 4 个苹果。

你想用这些信息做什么？例如进行计算或其他操作？

很好。我们接着发送第二条消息。

response = agent.chat("我吃掉了 1 个苹果，还剩多少？")
print(response.content[0].text)

I don't have enough information to answer how many apples are left. To solve this, I would need to know:

**What I need:**
- How many apples you started with

**The calculation would be:**
Starting number of apples - 1 apple eaten = Apples remaining

Could you tell me how many apples you had before eating one?

我没有足够的信息来直接回答还剩多少个苹果。为了解题，我需要知道：

**所需信息：**
- 你最初有多少个苹果

**计算方法：**
起始数量 - 吃掉的数量 = 剩余数量

请告诉我你最开始有多少个苹果。

如你所见，Agent 丢失了第一条消息中的信息。这就是为什么我们需要让智能体可以访问会话历史。

组件 2：会话记忆（短期）

短期记忆就是把最近的对话作为上下文传给模型，以便模型能在多轮对话中保持一致性。最直观的实现是维护一个消息列表（user/assistant），并在每次调用模型时把该列表一并发送。

注意：对话越长会越占用上下文配额。常见做法包括窗口化（只保留最近 N 条）、对话抽象（rollup summary），或把长期信息存入向量数据库并在需要时检索。

下面给出一个简化的智能体类，用 Gemini 的 call_model 封装作为内部调用点：

class Agent:
    def __init__(self, client, system_prompt):
        self.client = client
        self.system_prompt = system_prompt
        self.messages = []  # short-term memory

    def chat(self, user_text):
        # 把用户消息放入会话历史
        self.messages.append({"role": "user", "content": user_text})

        # 调用模型（将 messages 作为上下文）
        resp = call_model(user_text, context=self.messages)

        # 假设 resp.content 是字符串形式的回复
        assistant_text = resp.content
        self.messages.append({"role": "assistant", "content": assistant_text})

        return assistant_text

现在，再次用之前的例子测试智能体。

agent = Agent()

response = agent.chat("I have 4 apples. How many do you have?")
print(response.content[0].text)

response = agent.chat("I ate 1 apple. How many are left?")
print(response.content[0].text)

I don't have any apples - as an AI, I don't have a physical form and can't possess physical objects like apples. You have 4 apples, and I have 0 apples.

Is there something you'd like to do with your 4 apples, like a math problem or recipe suggestion?
Let me solve this step by step:

**Step 1:** Identify the starting amount
- You started with 4 apples

**Step 2:** Identify what was consumed
- You ate 1 apple

**Step 3:** Calculate the remaining amount
- Apples left = Starting amount - Apples eaten
- Apples left = 4 - 1 = 3

**Answer:** You have 3 apples left.

如你所见，智能体现在可以保持多轮会话并引用先前的信息。

但如果你让智能体去做更复杂的数学计算，会发生什么？

agent = Agent()

response = agent.chat("请计算 157.09 * 493.89 的结果。")

print(response.content[0].text)

我将按常规乘法分步计算：

157.09 × 493.89

步骤概览（示意）：
- 先按位相乘并累计部分乘积
- 将所有部分乘积相加得到最终结果

部分乘积（示例）：
    1,413.81
   12,567.2
   47,127
14,138,100
62,836,000
-----------
77,035,208.01

因此，计算结果为 **157.09 × 493.89 = 77,035.2081**（注意：该手算示例用于说明步骤，实际使用工具或计算器更可靠）。

这个答案听起来很可信，但如果你校验它，你会发现即便是像 Gemini 这样的强大 LLM 在没有工具辅助时也可能在算术上出错。

157.09 * 493.89

77585.1801

组件 3：工具（外部功能）

智能体的强大之处在于把模型的“建议”变成可执行的操作：例如进行精确计算、查询数据库或调用外部 API。这里我们用一个极简的计算器工具来演示如何把工具暴露给模型。

关键做法如下：

每个工具实现实际的函数/方法（执行行为）。
同时给出工具的 schema（名字、描述、参数结构），以便把工具信息传给模型，让模型知道何时以及如何调用。

下面是一个非常简单的计算器工具与把工具信息加入智能体的方法（伪代码）：

class CalculatorTool:
    def get_schema(self):
        return {"name": "calculator", "description": "Evaluate math expressions", "input_schema": {"expression": "string"}}

    def execute(self, expression: str):
        # 警告：示例使用 eval，仅用于学习，请在生产中使用安全解析器
        try:
            result = eval(expression)
            return {"result": result}
        except Exception as e:
            return {"error": str(e)}


class AgentWithTools(Agent):
    def __init__(self, client, system_prompt, tools):
        super().__init__(client, system_prompt)
        self.tools = tools
        self.tool_map = {t.get_schema()["name"]: t for t in tools}

    def chat(self, user_text):
        self.messages.append({"role": "user", "content": user_text})
        # 把工具 schema 传给模型（如果模型支持 tool-aware 接口）
        resp = call_model(user_text, context=self.messages)

        # 简化：检测 resp 是否请求工具执行（依赖于模型返回的结构）
        if getattr(resp, "stop_reason", None) == "tool_use":
            # 假设 resp.content 包含 tool_use 信息
            tool_name = resp.tool_name
            tool_input = resp.tool_input
            tool = self.tool_map[tool_name]
            tool_result = tool.execute(**tool_input)
            # 将工具结果再作为用户消息继续对话
            self.messages.append({"role": "user", "content": str(tool_result)})
            return self.chat(str(tool_result))

        assistant_text = resp.content
        self.messages.append({"role": "assistant", "content": assistant_text})
        return assistant_text

上面的示例展示了一个常见模式：模型负责决策（是否应该用工具、用哪个工具、如何填参数），系统负责执行工具并把结果回传给模型，从而完成一个闭环。

组件 4：执行循环（Agent Loop）

将模型决策与工具执行串联起来，就形成了智能体的主循环：模型提出动作，执行器（executor）运行动作并把结果回传，模型基于新状态继续决策。这个循环直到满足终止条件（例如达到目标或达到步数上限）。

下面是一个更紧凑的 run_agent 驱动函数（伪代码），演示如何处理工具调用并把工具结果回传给模型：

def run_agent(user_input, agent, max_turns=10):
    i = 0
    input_payload = user_input

    while i < max_turns:
        i += 1
        print(f"Iteration {i}: input={input_payload}")

        resp = agent.chat(input_payload)

        # 假设 resp 包含结构化的 tool_use 指示
        if getattr(resp, "stop_reason", None) == "tool_use":
            for block in resp.content:
                if getattr(block, "type", None) == "tool_use":
                    name = block.name
                    params = block.input
                    tool = agent.tool_map[name]
                    result = tool.execute(**params)
                    # 把工具结果作为下一轮的上下文
                    input_payload = {"type": "tool_result", "tool_use_id": block.id, "content": result}
                    break
            continue

        # 否则，模型返回最终文本回复
        return resp.content

    return None

测试已实现的智能体

下面用一些示例测试该智能体的实现。

测试 1：普通问题（无需工具）

该测试展示智能体能够回答不需要外部工具的简单一般性问题。

response = run_agent("我有 4 个苹果。你有多少？")

Iteration 1:
User input: 我有 4 个苹果。你有多少？
Agent output: 我没有实体，因此无法拥有实物。但我可以帮你计算或处理关于你 4 个苹果的问题。

你想用这些信息做什么，比如计算数量或提供食谱建议？

测试 2：工具调用

该测试演示智能体在遇到需要工具才能解决的任务时，如何使用 CalculatorTool 获取正确结果。

response = run_agent("请计算 157.09 * 493.89 的结果。")

Iteration 1:
User input: 请计算 157.09 * 493.89 的结果。
Agent output: 我将为你计算 157.09 * 493.89。
Using tool calculator with input {'expression': '157.09 * 493.89'}
Tool result: {'result': 77585.1801}

Iteration 2:
User input: [{'type': 'tool_result', 'tool_use_id': 'toolu_01FC9yLWt2Cf6a8zLGhj7ZJz', 'content': '{"result": 77585.1801}'}]
Agent output: 157.09 * 493.89 的结果是 **77,585.1801**。

测试 3：分步工具调用

该测试展示智能体能将更复杂的问题分解成更小的步骤，并在一个会话中多次调用 CalculatorTool 来得到最终答案。

response = run_agent("如果我 20 岁，我妈妈比我大 30 岁，而我弟弟比我妈妈小 32 岁，我弟弟多少岁？")

Iteration 1:
User input: 如果我 20 岁，我妈妈比我大 30 岁，而我弟弟比我妈妈小 32 岁，我弟弟多少岁？
Agent output: 我将按步骤计算这个问题。

已知条件：
- 你：20 岁
- 你妈妈比你大 30 岁
- 你弟弟比你妈妈小 32 岁

先计算你妈妈的年龄：
Using tool calculator with input {'expression': '20 + 30'}
Tool result: {'result': 50}

Iteration 2:
User input: [{'type': 'tool_result', 'tool_use_id': 'toolu_01WPMQRzCi4roua9vQ7qXeCR', 'content': '{"result": 50}'}]
Agent output: 所以你妈妈 50 岁。

现在计算你弟弟的年龄：
Using tool calculator with input {'expression': '50 - 32'}
Tool result: {'result': 18}

Iteration 3:
User input: [{'type': 'tool_result', 'tool_use_id': 'toolu_01UL7n7a85XJUn7Tgk8kiHhX', 'content': '{"result": 18}'}]
Agent output: 你弟弟 18 岁。

总结：
- 你：20 岁
- 你妈妈：50 岁（比你大 30 岁）
- 你弟弟：18 岁（比你妈妈小 32 岁）

总结

通过将模型（决策器）、短期记忆、工具（执行器）与主循环逐步拆解并手工实现，你可以清晰理解每一层的责任边界。以下是一些实用建议：

开发初期建议直接用 API 调用理解底层原理（如本文示例的 Gemini 调用），后续再考虑引入框架以节省工程成本。
工具执行要有明确的接口和安全边界：不要在生产中直接用 eval；对外部调用需做超时与错误处理。
会话增长时采用窗口、摘要或检索记忆方案，避免超出 context 限制。

参考资料

Building an AI agent from scratch in Python - leoniemonigatti.com