generate-synthetic-data by hamelsmu/evals-skills
npx skills add https://github.com/hamelsmu/evals-skills --skill generate-synthetic-data生成多样化、逼真的测试输入,以覆盖 LLM 管道的故障空间。
在生成合成数据之前,请先确定管道可能在哪些地方失败。询问用户已知的易故障区域,查看现有用户反馈,或根据可用跟踪记录形成假设。维度(步骤 1)必须针对预期的故障,而不是任意的变化。
维度是针对您应用程序的特定变化轴。根据您预期会发生故障的地方选择维度。
Dimension 1: [名称] — [它捕获的内容]
值: [value_a, value_b, value_c, ...]
Dimension 2: [名称] — [它捕获的内容]
值: [value_a, value_b, value_c, ...]
Dimension 3: [名称] — [它捕获的内容]
值: [value_a, value_b, value_c, ...]
房地产助手示例:
Feature: 用户想要完成的任务
值: [property search, scheduling, email drafting]
Client Persona: 用户服务的对象
值: [first-time buyer, investor, luxury buyer]
Scenario Type: 查询的清晰度
值: [well-specified, ambiguous, out-of-scope]
从 3 个维度开始。仅当初始跟踪记录揭示出沿新轴的故障模式时,才添加更多维度。
元组是定义特定测试用例的维度值的一种组合。向用户展示 20 个草拟的元组,并进行迭代,直到他们确认这些元组反映了真实的场景。用户的领域知识在这里至关重要——他们知道哪些组合实际发生,哪些是不切实际的。
(Feature: Property Search, Persona: Investor, Scenario: Ambiguous)
(Feature: Scheduling, Persona: First-time Buyer, Scenario: Well-specified)
(Feature: Email Drafting, Persona: Luxury Buyer, Scenario: Out-of-scope)
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
Generate 10 random combinations of ({dim1}, {dim2}, {dim3})
for a {your application description}.
The dimensions are:
{dim1}: {description}. Possible values: {values}
{dim2}: {description}. Possible values: {values}
{dim3}: {description}. Possible values: {values}
Output each tuple in the format: ({dim1}, {dim2}, {dim3})
Avoid duplicates. Vary values across dimensions.
此步骤使用单独的提示。单步生成(元组 + 查询一起)会产生重复的措辞。
We are generating synthetic user queries for a {your application}.
{Brief description of what it does.}
Given:
{dim1}: {value}
{dim2}: {value}
{dim3}: {value}
Write a realistic query that a user might enter. The query should
reflect the specified persona and scenario characteristics.
Example: "{one of your hand-written examples}"
Now generate a new query.
审查生成的查询。在以下情况下丢弃并重新生成:
可选:使用 LLM 按 1-5 分制评估真实性,丢弃低于 3 分的查询。
通过完整的 LLM 管道执行所有查询。捕获完整的跟踪记录:输入、所有中间步骤、工具调用、检索到的文档、最终输出。
目标:约 100 个高质量、多样化的跟踪记录。 这是达到饱和(新跟踪记录不再揭示新的故障类别)的一个粗略经验法则。具体数量取决于系统复杂性。
当有真实查询可用时,不要随机采样。使用分层采样:
当真实数据和合成数据都可用时,使用合成数据来填补代表性不足的查询类型的空白。
每周安装次数
129
代码库
GitHub 星标数
955
首次出现
Mar 3, 2026
安全审计
安装于
codex127
gemini-cli126
kimi-cli126
github-copilot126
cursor126
opencode126
Generate diverse, realistic test inputs that cover the failure space of an LLM pipeline.
Before generating synthetic data, identify where the pipeline is likely to fail. Ask the user about known failure-prone areas, review existing user feedback, or form hypotheses from available traces. Dimensions (Step 1) must target anticipated failures, not arbitrary variation.
Dimensions are axes of variation specific to your application. Choose dimensions based on where you expect failures.
Dimension 1: [Name] — [What it captures]
Values: [value_a, value_b, value_c, ...]
Dimension 2: [Name] — [What it captures]
Values: [value_a, value_b, value_c, ...]
Dimension 3: [Name] — [What it captures]
Values: [value_a, value_b, value_c, ...]
Example for a real estate assistant:
Feature: what task the user wants
Values: [property search, scheduling, email drafting]
Client Persona: who the user serves
Values: [first-time buyer, investor, luxury buyer]
Scenario Type: query clarity
Values: [well-specified, ambiguous, out-of-scope]
Start with 3 dimensions. Add more only if initial traces reveal failure patterns along new axes.
A tuple is one combination of dimension values defining a specific test case. Present 20 draft tuples to the user and iterate until they confirm the tuples reflect realistic scenarios. The user's domain knowledge is essential here — they know which combinations actually occur and which are unrealistic.
(Feature: Property Search, Persona: Investor, Scenario: Ambiguous)
(Feature: Scheduling, Persona: First-time Buyer, Scenario: Well-specified)
(Feature: Email Drafting, Persona: Luxury Buyer, Scenario: Out-of-scope)
Generate 10 random combinations of ({dim1}, {dim2}, {dim3})
for a {your application description}.
The dimensions are:
{dim1}: {description}. Possible values: {values}
{dim2}: {description}. Possible values: {values}
{dim3}: {description}. Possible values: {values}
Output each tuple in the format: ({dim1}, {dim2}, {dim3})
Avoid duplicates. Vary values across dimensions.
Use a separate prompt for this step. Single-step generation (tuples + queries together) produces repetitive phrasing.
We are generating synthetic user queries for a {your application}.
{Brief description of what it does.}
Given:
{dim1}: {value}
{dim2}: {value}
{dim3}: {value}
Write a realistic query that a user might enter. The query should
reflect the specified persona and scenario characteristics.
Example: "{one of your hand-written examples}"
Now generate a new query.
Review generated queries. Discard and regenerate when:
Optional: use an LLM to rate realism on a 1-5 scale, discard below 3.
Execute all queries through the full LLM pipeline. Capture complete traces: input, all intermediate steps, tool calls, retrieved docs, final output.
Target: ~100 high-quality, diverse traces. This is a rough heuristic for reaching saturation (where new traces stop revealing new failure categories). The number depends on system complexity.
When you have real queries available, don't sample randomly. Use stratified sampling:
When both real and synthetic data are available, use synthetic data to fill gaps in underrepresented query types.
Weekly Installs
129
Repository
GitHub Stars
955
First Seen
Mar 3, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
codex127
gemini-cli126
kimi-cli126
github-copilot126
cursor126
opencode126
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
48,300 周安装