LLM合成数据生成指南：如何创建多样化测试输入覆盖AI管道故障空间

generate-synthetic-data by hamelsmu/evals-skills

129 周安装量

955 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/hamelsmu/evals-skills --skill generate-synthetic-data

AI/机器学习测试提示工程

🇨🇳中文介绍

生成合成数据

生成多样化、逼真的测试输入，以覆盖 LLM 管道的故障空间。

先决条件

在生成合成数据之前，请先确定管道可能在哪些地方失败。询问用户已知的易故障区域，查看现有用户反馈，或根据可用跟踪记录形成假设。维度（步骤 1）必须针对预期的故障，而不是任意的变化。

核心流程

步骤 1：定义维度

维度是针对您应用程序的特定变化轴。根据您预期会发生故障的地方选择维度。

Dimension 1: [名称] — [它捕获的内容]
  值: [value_a, value_b, value_c, ...]

Dimension 2: [名称] — [它捕获的内容]
  值: [value_a, value_b, value_c, ...]

Dimension 3: [名称] — [它捕获的内容]
  值: [value_a, value_b, value_c, ...]

房地产助手示例：

Feature: 用户想要完成的任务
  值: [property search, scheduling, email drafting]

Client Persona: 用户服务的对象
  值: [first-time buyer, investor, luxury buyer]

Scenario Type: 查询的清晰度
  值: [well-specified, ambiguous, out-of-scope]

从 3 个维度开始。仅当初始跟踪记录揭示出沿新轴的故障模式时，才添加更多维度。

步骤 2：与用户一起起草 20 个元组

元组是定义特定测试用例的维度值的一种组合。向用户展示 20 个草拟的元组，并进行迭代，直到他们确认这些元组反映了真实的场景。用户的领域知识在这里至关重要——他们知道哪些组合实际发生，哪些是不切实际的。

(Feature: Property Search, Persona: Investor, Scenario: Ambiguous)
(Feature: Scheduling, Persona: First-time Buyer, Scenario: Well-specified)
(Feature: Email Drafting, Persona: Luxury Buyer, Scenario: Out-of-scope)

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

步骤 3：使用 LLM 生成更多元组

Generate 10 random combinations of ({dim1}, {dim2}, {dim3})
for a {your application description}.

The dimensions are:
{dim1}: {description}. Possible values: {values}
{dim2}: {description}. Possible values: {values}
{dim3}: {description}. Possible values: {values}

Output each tuple in the format: ({dim1}, {dim2}, {dim3})
Avoid duplicates. Vary values across dimensions.

步骤 4：将每个元组转换为自然语言查询

此步骤使用单独的提示。单步生成（元组 + 查询一起）会产生重复的措辞。

We are generating synthetic user queries for a {your application}.
{Brief description of what it does.}

Given:
{dim1}: {value}
{dim2}: {value}
{dim3}: {value}

Write a realistic query that a user might enter. The query should
reflect the specified persona and scenario characteristics.

Example: "{one of your hand-written examples}"

Now generate a new query.

步骤 5：质量筛选

审查生成的查询。在以下情况下丢弃并重新生成：

措辞生硬或不切实际
内容与元组的意图不符
查询彼此过于相似

可选：使用 LLM 按 1-5 分制评估真实性，丢弃低于 3 分的查询。

步骤 6：通过管道运行查询

通过完整的 LLM 管道执行所有查询。捕获完整的跟踪记录：输入、所有中间步骤、工具调用、检索到的文档、最终输出。

目标：约 100 个高质量、多样化的跟踪记录。 这是达到饱和（新跟踪记录不再揭示新的故障类别）的一个粗略经验法则。具体数量取决于系统复杂性。

采样真实用户数据

当有真实查询可用时，不要随机采样。使用分层采样：

识别高方差维度 — 通读查询，找出它们的不同之处（长度、主题、复杂性、是否存在约束）。
分配标签 — 对于小数据集，与用户一起完成；对于大数据集，在查询嵌入上使用 K-means 聚类。
从每组中采样 — 确保覆盖所有查询类型，而不仅仅是最常见的类型。

当真实数据和合成数据都可用时，使用合成数据来填补代表性不足的查询类型的空白。

非结构化生成。 在没有维度/元组结构的情况下提示"给我测试查询"，会产生通用、重复、理想路径的示例。
单步生成。 在一个提示中生成元组和查询，比两步分离产生的结果多样性更差。
任意维度。 不针对易故障区域的维度会浪费测试资源。
跳过用户对元组的审查。 如果没有用户首先验证元组，就无法判断 LLM 生成的元组是否真实。
在无人能判断真实性的情况下使用合成数据。 如果没有人能判断合成跟踪记录是否真实，请改用真实数据。
为复杂的特定领域内容（法律文件、医疗记录）使用合成数据，因为 LLM 会遗漏结构性细节。
为低资源语言或方言使用合成数据，因为 LLM 生成的样本不切实际。

🇺🇸English

Generate Synthetic Data

Generate diverse, realistic test inputs that cover the failure space of an LLM pipeline.

Prerequisites

Before generating synthetic data, identify where the pipeline is likely to fail. Ask the user about known failure-prone areas, review existing user feedback, or form hypotheses from available traces. Dimensions (Step 1) must target anticipated failures, not arbitrary variation.

Core Process

Step 1: Define Dimensions

Dimensions are axes of variation specific to your application. Choose dimensions based on where you expect failures.

Dimension 1: [Name] — [What it captures]
  Values: [value_a, value_b, value_c, ...]

Dimension 2: [Name] — [What it captures]
  Values: [value_a, value_b, value_c, ...]

Dimension 3: [Name] — [What it captures]
  Values: [value_a, value_b, value_c, ...]

Example for a real estate assistant:

Feature: what task the user wants
  Values: [property search, scheduling, email drafting]

Client Persona: who the user serves
  Values: [first-time buyer, investor, luxury buyer]

Scenario Type: query clarity
  Values: [well-specified, ambiguous, out-of-scope]

Start with 3 dimensions. Add more only if initial traces reveal failure patterns along new axes.

Step 2: Draft 20 Tuples with the User

A tuple is one combination of dimension values defining a specific test case. Present 20 draft tuples to the user and iterate until they confirm the tuples reflect realistic scenarios. The user's domain knowledge is essential here — they know which combinations actually occur and which are unrealistic.

(Feature: Property Search, Persona: Investor, Scenario: Ambiguous)
(Feature: Scheduling, Persona: First-time Buyer, Scenario: Well-specified)
(Feature: Email Drafting, Persona: Luxury Buyer, Scenario: Out-of-scope)

Step 3: Generate More Tuples with an LLM

Generate 10 random combinations of ({dim1}, {dim2}, {dim3})
for a {your application description}.

The dimensions are:
{dim1}: {description}. Possible values: {values}
{dim2}: {description}. Possible values: {values}
{dim3}: {description}. Possible values: {values}

Output each tuple in the format: ({dim1}, {dim2}, {dim3})
Avoid duplicates. Vary values across dimensions.

Step 4: Convert Each Tuple to a Natural Language Query

Use a separate prompt for this step. Single-step generation (tuples + queries together) produces repetitive phrasing.

We are generating synthetic user queries for a {your application}.
{Brief description of what it does.}

Given:
{dim1}: {value}
{dim2}: {value}
{dim3}: {value}

Write a realistic query that a user might enter. The query should
reflect the specified persona and scenario characteristics.

Example: "{one of your hand-written examples}"

Now generate a new query.

Step 5: Filter for Quality

Review generated queries. Discard and regenerate when:

Phrasing is awkward or unrealistic
Content doesn't match the tuple's intent
Queries are too similar to each other

Optional: use an LLM to rate realism on a 1-5 scale, discard below 3.

Step 6: Run Queries Through the Pipeline

Execute all queries through the full LLM pipeline. Capture complete traces: input, all intermediate steps, tool calls, retrieved docs, final output.

Target: ~100 high-quality, diverse traces. This is a rough heuristic for reaching saturation (where new traces stop revealing new failure categories). The number depends on system complexity.

Sampling Real User Data

When you have real queries available, don't sample randomly. Use stratified sampling:

Identify high-variance dimensions — read through queries and find ways they differ (length, topic, complexity, presence of constraints).
Assign labels — for small sets, with the user; for large sets, use K-means clustering on query embeddings.
Sample from each group — ensures coverage across query types, not just the most common ones.

When both real and synthetic data are available, use synthetic data to fill gaps in underrepresented query types.

Anti-Patterns

Unstructured generation. Prompting "give me test queries" without the dimension/tuple structure produces generic, repetitive, happy-path examples.
Single-step generation. Generating tuples and queries in one prompt produces less diverse results than the two-step separation.
Arbitrary dimensions. Dimensions that don't target failure-prone regions waste test budget.
Skipping user review of tuples. Without the user validating tuples first, you can't judge whether LLM-generated tuples are realistic.
Synthetic data when no one can judge realism. If no one can judge whether a synthetic trace is realistic, use real data instead.
Synthetic data for complex domain-specific content (legal filings, medical records) where LLMs miss structural nuance.
Synthetic data for low-resource languages or dialects where LLM-generated samples are unrealistic.

Weekly Installs

129

Repository

hamelsmu/evals-skills

GitHub Stars

955

First Seen

Mar 3, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

codex127

gemini-cli126

kimi-cli126

github-copilot126

cursor126

opencode126

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

48,300 周安装