让 AI 学会真人聊天：从微信记录到 Persona Chatbot

核心理念：数据驱动的 Persona

Core Philosophy: Data-Driven Persona

市面上的 persona chatbot 大多靠手写 prompt："你是一个友善的助手，说话风格活泼…"。这种方式有一个本质问题——你怎么定义"活泼"？

Most persona chatbots rely on handwritten prompts: "You are a friendly assistant with a lively style..." This approach has a fundamental problem — how do you define "lively"?

Aime 的做法不一样。它不描述风格，它测量风格：

Aime takes a different approach. It doesn't describe style — it measures it:

传统路线Traditional approach 人工描述性格 → 手写 prompt → 祈祷 AI 理解 Aime 微信聊天记录 → 统计分析 7 个维度 → 量化数字编码进 prompt ↓ 挑选 20 条代表性对话 → few-shot examples

具体来说：不说"你说话很短"，而是说"你的消息平均 14.2 个字，44.4% 在 10 字以内"。不说"你偶尔用 emoji"，而是说"18.2% 的消息包含 emoji，最常用的是 😂😭🤮🙄😏"。

Specifically: instead of saying "you write short messages", we say "your messages average 14.2 characters, 44.4% under 10 characters". Instead of "you occasionally use emoji", we say "18.2% of messages contain emoji, most used: 😂😭🤮🙄😏".

关键洞察：Key Insight: LLM 对数字的遵从度远高于形容词。"平均 14 个字"比"简短"精确 10 倍，AI 也更容易执行。 LLMs follow numbers far more precisely than adjectives. "Average 14 characters" is 10x more precise than "brief", and much easier for AI to execute.

训练 Pipeline：从微信到 Prompt

Training Pipeline: From WeChat to Prompt

整个流程不涉及任何模型训练。五个脚本，跑一遍就好。

The entire process involves zero model training. Five scripts, one run.

Step 1  微信 HTML 导出  ──→  parse_wechat.py  ──→  chat_pairs.jsonl
Step 2  chat_pairs.jsonl  ──→  analyze_style.py ──→  style_analysis.json
Step 3  chat_pairs.jsonl  ──→  build_vectordb.py ──→ vectors.npz (RAG 用)
Step 4  style + QA 问卷   ──→  prompt_builder.py ──→ system prompt
Step 5  system prompt      ──→  chat_engine.py    ──→ 部署上线

Step 1：解析微信记录

Step 1: Parsing WeChat History

微信聊天记录通过 WeChatExporter 导出为 HTML。解析面临几个挑战：

WeChat history is exported as HTML via WeChatExporter. Parsing faces several challenges:

分页加载：首页约 1000 条消息，剩余在 JS 文件中（msg-1.js, msg-2.js…）
Emoji 标签：微信 emoji 是 <img class="wxemoji"> 标签，需要转成 Unicode
连续消息合并：同一个人 2 分钟内的多条消息合并为一条（模拟微信多行发送）
上下文对：提取 (context, response) 对——每条回复配最多 5 条前文

Paginated loading: First page has ~1000 messages, the rest are in JS files (msg-1.js, msg-2.js...)
Emoji tags: WeChat emojis are <img class="wxemoji"> tags that need conversion to Unicode
Consecutive message merging: Multiple messages from the same person within 2 minutes are merged into one (simulating WeChat multi-line sending)
Context pairs: Extract (context, response) pairs — each reply gets up to 5 preceding messages

Step 2：统计风格分析

Step 2: Statistical Style Analysis

这是整个系统的灵魂。下一节详细展开。

This is the soul of the entire system. Detailed in the next section.

Step 3：构建向量库（可选）

Step 3: Building Vector Database (Optional)

用 Gemini Embedding API 把所有对话对转成 768 维向量，存为 .npz 文件。运行时做余弦相似度检索。Python 版本用到，Web 版本没用。

Use Gemini Embedding API to convert all conversation pairs into 768-dim vectors, stored as .npz files. Cosine similarity search at runtime. Used in Python version, not in Web version.

统计分析：量化一个人的聊天风格

Statistical Analysis: Quantifying Chat Style

从 8394 条真实消息中提取 7 个维度的统计数据：

Extracted 7 dimensions of statistical data from 8,394 real messages:

14.2

Avg Length

97%

No Punctuation

18.2%

Emoji Usage

29.3%

Mixed Language

维度	分析内容	实际数据
消息长度	平均字数、短/中/长占比	平均 14.2 字，44.4% < 10 字
标点习惯	各标点出现率、无标点率	97% 不加标点
口头禅	Top 15 高频词及出现次数	哈哈(342)、嗯(128)、kk...
Emoji	使用率、Top 10 微信/Unicode emoji	18.2% 含 emoji，最爱 😂😭🤮
换行倾向	多行消息占比	经常拆成多条短消息发送
中英混杂	纯中/纯英/混杂比例	29.3% 混杂中英文
风格总结	AI 生成的整体描述	偏短句型、随意、不拘小节

Dimension	Analysis Content	Actual Data
Message length	Avg word count, short/mid/long ratio	Avg 14.2 chars, 44.4% < 10 chars
Punctuation	Punctuation frequency, no-punctuation rate	97% no punctuation
Catchphrases	Top 15 frequent words with counts	哈哈(342), 嗯(128), kk...
Emoji	Usage rate, Top 10 WeChat/Unicode emoji	18.2% contain emoji, favorites: 😂😭🤮
Line breaks	Multi-line message ratio	Frequently splits into multiple short messages
Mixed language	Pure CN / pure EN / mixed ratio	29.3% mixed Chinese-English
Style summary	AI-generated overall description	Short sentences, casual, laid-back

这些数字直接编码进 system prompt。AI 看到"平均 14.2 字"就知道不该输出长段落，看到"97% 无标点"就自然省略句号。

These numbers are encoded directly into the system prompt. When AI sees "average 14.2 characters", it knows not to output long paragraphs. When it sees "97% no punctuation", it naturally omits periods.

代表性对话挑选

Selecting Representative Examples

从几千条对话中挑出 20 条作为 few-shot examples，策略是三分法：

From thousands of conversations, 20 are selected as few-shot examples using a three-way split strategy:

1/3 口头禅对话

包含高频词（哈哈、嗯、kk）的消息，展示最典型的说话习惯。

1/3 长度均衡

短/中/长消息各取一些，让 AI 知道不同场景该回多少字。

1/3 随机采样

填补剩余名额，增加多样性，避免 examples 太偏向某种话题。

格式化输出

每条 example 带上下文，格式：朋友(小屎蛋儿): xxx → 你: xxx，让 AI 理解对话节奏。

1/3 Catchphrase Conversations

Messages containing frequent words (haha, hmm, kk), showcasing the most typical speaking habits.

1/3 Length-Balanced

A mix of short/mid/long messages, so AI knows how much to write in different scenarios.

1/3 Random Sampling

Fills remaining slots for diversity, preventing examples from skewing toward one topic.

Formatted Output

Each example includes context, formatted as: Friend(nickname): xxx → You: xxx, helping AI understand conversation rhythm.

System Prompt 的结构

System Prompt Structure

最终的 system prompt 约 4000 字，由 6 个模块组成：

The final system prompt is ~4000 characters, composed of 6 modules:

[身份设定]    你是柴宁，不是 AI 助手，你就是柴宁本人在微信聊天
[个人信息]    年龄（动态计算）、职业、所在地、家庭、兴趣
[性格特征]    幽默、自嘲、有主见、喜欢组局、大方
[说话风格]    统计数据：长度、标点、口头禅、emoji、中英混杂
[对话示例]    20 条代表性对话（带上下文）
[安全边界]    隐私保护、公司信息、法律合规

[Identity]     You are Chai Ning, not an AI assistant — you ARE Ning chatting on WeChat
[Personal Info] Age (dynamically calculated), job, location, family, interests
[Personality]   Humorous, self-deprecating, opinionated, social organizer, generous
[Speech Style]  Stats: length, punctuation, catchphrases, emoji, CN-EN mixing
[Examples]      20 representative conversations (with context)
[Safety]        Privacy protection, company info, legal compliance

动态时间注入

Dynamic Time Injection

System prompt 每次请求都重新生成，注入当前时间和计算后的年龄：

System prompt is regenerated for every request, injecting current time and calculated age:

// 每次 API 调用都重新计算
const now = new Date()
const year = now.toLocaleDateString("zh-CN", {
  timeZone: "America/Los_Angeles",
  year: "numeric"
})
const age = Number(year) - 1994

// 注入 prompt
`【当前真实时间：${dateStr} ${timeStr} 太平洋时间。】`

这样 AI 永远不会说错自己的年龄或搞错当前时间。

This way, the AI never states the wrong age or mistakes the current time.

双模式设计

Dual Mode Design

闲聊模式（默认）

短消息、口语化、带口头禅。5-20 字一条，可能拆成 2-3 条发送。

知识模式

触发条件：技术问题、专业话题。允许更长回复，但保持个人语气。

Casual Mode (Default)

Short messages, colloquial, with filler words. 5-20 chars per message, may split into 2-3 messages.

Knowledge Mode

Triggered by: technical questions, professional topics. Allows longer replies but maintains personal voice.

运行时架构

Runtime Architecture

从用户发送消息到收到回复，经历以下流程：

From user sending a message to receiving a reply, the following pipeline executes:

用户输入 → React 前端
       ↓
/api/chat → Vercel Function
       ↓
  1. 动态生成 system prompt（含当前时间/年龄）
  2. 格式化最近 20 条历史消息
  3. 如有更早消息，注入系统提示："之前还有 N 条消息"
       ↓
Gemini 2.0 Flash → maxOutputTokens: 128, temperature: 0.8
       ↓
  4. 拆分多行回复（按换行符）
  5. 过滤 [Emoji] 占位符
  6. 限制最多 3 条消息
       ↓
前端显示 → 每条消息加 typing delay
       ↓
Redis → 异步保存（不阻塞响应），90 天 TTL

User input → React frontend
       ↓
/api/chat → Vercel Function
       ↓
  1. Dynamically generate system prompt (with current time/age)
  2. Format last 20 history messages
  3. If older messages exist, inject system note: "N earlier messages"
       ↓
Gemini 2.0 Flash → maxOutputTokens: 128, temperature: 0.8
       ↓
  4. Split multi-line reply (by newline)
  5. Filter [Emoji] placeholders
  6. Cap at 3 messages max
       ↓
Frontend display → typing delay per message
       ↓
Redis → async save (non-blocking), 90-day TTL

关键参数

Key Parameters

参数	值	原因
maxOutputTokens	128	强制短回复，模拟微信消息长度
temperature	0.8	自然但不跑偏——闲聊需要随机性
上下文窗口	20 条	平衡连贯性与 token 成本
最大回复条数	3	模拟微信多条消息，但不过度
Redis TTL	90 天	够用于 review，不会无限膨胀

Parameter	Value	Reason
maxOutputTokens	128	Force short replies, simulating WeChat message length
temperature	0.8	Natural but not off-the-rails — casual chat needs randomness
Context window	20 messages	Balance coherence with token cost
Max reply count	3	Simulate WeChat multi-message, but don't overdo it
Redis TTL	90 days	Enough for review, won't grow indefinitely

多消息回复：模拟真实聊天节奏

Multi-Bubble Replies: Simulating Real Chat Rhythm

真人在微信聊天不会一次发一大段。会拆成 2-3 条短消息，中间有自然的打字间隔。Aime 完整模拟了这个行为。

Real people don't send one big paragraph on WeChat. They split into 2-3 short messages with natural typing intervals. Aime fully simulates this behavior.

后端：拆分回复

Backend: Splitting Replies

// Gemini 输出按换行符拆分
const replies = raw
  .split("\n")
  .map(line => line.trim())
  .filter(line => line.length > 0)
  .slice(0, 3)

// 至少返回一条消息
const finalReplies = replies.length > 0 ? replies : ["嗯"]

前端：模拟打字延迟

Frontend: Simulating Typing Delay

// 每条消息的延迟与长度成正比
for (let i = 0; i < replies.length; i++) {
  const delay = Math.min(400 + replies[i].length * 80, 2500)
  await new Promise(r => setTimeout(r, delay))
  addMessage(replies[i])
}

// "哈哈" (2字) → 560ms delay
// "我觉得这个方案可以" (8字) → 1040ms delay
// 一大段话 → 最多 2500ms

加上前端的 typing indicator（三个跳动的点），整个体验非常接近真人发消息。

Combined with the frontend's typing indicator (three bouncing dots), the entire experience closely mimics a real person sending messages.

RAG 分析：值不值得用？

RAG Analysis: Is It Worth the Complexity?

Python 版本包含完整的 RAG pipeline——向量化 5000+ 条历史对话，运行时检索最相似的 5 条作为参考。但 Web 版本没有用。为什么？

The Python version includes a complete RAG pipeline — vectorizing 5000+ historical conversations, retrieving the 5 most similar ones as reference at runtime. But the Web version doesn't use it. Why?

RAG 有帮助的场景

小众话题。 比如用户问到特定的共同回忆、小众梗或罕见口头禅，静态 prompt 覆盖不到，RAG 能从历史中检索到相关对话。

RAG 过度工程的场景

日常闲聊。 统计风格规则已经能处理 90% 的常见对话。RAG 增加 ~100ms 延迟和基础设施复杂度，收益有限。

Where RAG Helps

Niche topics. When users ask about specific shared memories, obscure inside jokes, or rare catchphrases, static prompts can't cover them — RAG can retrieve relevant conversations from history.

Where RAG Is Over-Engineering

Daily chat. Statistical style rules already handle 90% of common conversations. RAG adds ~100ms latency and infrastructure complexity for limited gain.

结论：Conclusion: 对于 persona chatbot，Prompt-only 方案目前更优。单次 API 调用 ~800ms，零额外基础设施，改一个文件就能迭代。RAG 留作后备——等到需要"事实回忆"能力时再启用。 For persona chatbots, the Prompt-only approach currently wins. Single API call ~800ms, zero additional infrastructure, iterate by editing one file. RAG is kept as a backup — enable when "factual recall" capability is needed.

Review 系统：闭环反馈

Review System: Closed-Loop Feedback

Prompt 工程不是一次性的。需要持续观察效果，发现问题，迭代改进。Aime 内置了一个 Review 页面来完成这个闭环。

Prompt engineering isn't a one-time thing. You need to continuously observe effects, discover issues, and iterate improvements. Aime has a built-in Review page to close this loop.

数据采集

Data Collection

每次对话交互自动保存到 Redis——用户消息、AI 回复、上下文历史、session ID。不阻塞响应，fire-and-forget 模式。

Every conversation interaction is automatically saved to Redis — user messages, AI replies, context history, session ID. Non-blocking, fire-and-forget.

Review 功能

Review Features

双视图模式

Session View：按会话分组，看完整对话流。Flat View：逐条查看，适合快速扫描。

评分系统

每条交互可标记为 Good / Bad。用于追踪 prompt 调整前后的质量变化。

过滤器

All / Unrated / Good / Bad 四种过滤，带实时计数。快速定位问题回复。

JSON 导出

一键导出所有对话数据。可以用来做进一步分析，或作为 fine-tuning 训练集。

Dual View Mode

Session View: Grouped by session, see full conversation flow. Flat View: One-by-one, ideal for quick scanning.

Rating System

Each interaction can be marked Good / Bad. Used to track quality changes before and after prompt adjustments.

Filters

All / Unrated / Good / Bad with real-time counts. Quickly locate problematic replies.

JSON Export

One-click export of all conversation data. Can be used for further analysis or as a fine-tuning training set.

迭代循环

Iteration Cycle

1. 部署新 prompt → 用户聊天 2. Review 页面查看对话质量 3. 发现问题（太长/太生硬/风格偏离） 4. 调整 prompt（改规则/换 examples/调参数） 5. 重新部署 → 回到步骤 1 每轮迭代只需要改一个文件，部署一次 vercel --prod

1. Deploy new prompt → users chat 2. Review page: inspect conversation quality 3. Spot issues (too long / too stiff / style drift) 4. Adjust prompt (change rules / swap examples / tune params) 5. Redeploy → back to step 1 Each iteration: edit one file, deploy once with vercel --prod

安全边界

Safety Boundaries

Persona chatbot 有一个独特风险：它在扮演一个真人。如果 AI 编造了不存在的个人信息，或者泄露了敏感数据，后果比普通 chatbot 严重得多。

Persona chatbots have a unique risk: they're impersonating a real person. If the AI fabricates non-existent personal information or leaks sensitive data, the consequences are far more serious than with a regular chatbot.

System prompt 中有一个明确的"最重要"规则区：

The system prompt has an explicit "most important" rules section:

// System prompt 安全边界（节选）

【绝对不能做的事情】
- 不泄露隐私：手机号、地址、身份证、密码、银行卡、工号
- 不说公司机密：TikTok 的架构、模型、内部工具
- 不帮助违法行为：钓鱼邮件、诈骗、黑客攻击
- 已婚，不调情，不玩暧昧
- 不鼓励自残自杀，遇到要建议找专业帮助
- 不编造个人信息——只说上面提供的基本信息

设计原则：Design Principle: 白名单 > 黑名单。不是"不要说 X"，而是"只能说这些"。AI 只了解 prompt 中提供的个人信息，超出范围的一律用幽默化解："这个我不太方便说哈"。 Whitelist > Blacklist. Not "don't say X", but "you can only say these things". The AI only knows personal information provided in the prompt. Anything beyond that is deflected with humor: "I'd rather not say haha".

总结

Conclusion

Aime 的核心公式：

Aime's core formula:

真实的 Persona = 统计量化的说话风格 + 精选的对话 Examples + 严格的安全边界 + 持续的 Review 迭代

不需要微调。不需要训练数据。不需要 GPU。

No fine-tuning needed. No training data. No GPU.

一个 BeautifulSoup 解析脚本、一个统计分析脚本、一个好的 system prompt。三个组件就能让 AI 学会一个真人的聊天风格——精确到每条消息的平均字数和 emoji 使用频率。

A BeautifulSoup parsing script, a statistical analysis script, a well-crafted system prompt. Three components are all it takes to teach AI a real person's chat style — precise down to average character count and emoji usage frequency.

如果你也想做一个 persona chatbot，不要从模型微调开始。从数据开始。先量化你的说话习惯，再让 AI 执行这些数字。

If you also want to build a persona chatbot, don't start with model fine-tuning. Start with data. First quantify your speaking habits, then let AI execute those numbers.

Try it: aime.ning.codes · Technical Showcase