experiment-pipeline by evoscientist/evoskills
npx skills add https://github.com/evoscientist/evoskills --skill experiment-pipeline一个结构化的 4 阶段框架,用于执行从初始实现到消融研究的科研实验,包含尝试预算和门控条件,以防止无效努力。这遵循了 EvoScientist 论文中的实验树搜索设计,其中工程师代理迭代生成可执行代码、运行实验,并在每个阶段记录结构化的执行结果。
实验失败有两个原因:错误的顺序和没有停止标准。 大多数研究人员直接跳过去测试他们的新方法,而没有验证他们的基线设置,然后疑惑为什么结果不合理。另一些人则在没有预算的情况下花费数周调整超参数,希望下一次运行会成功。
4 阶段流水线解决了这两个问题。它强制执行严格的顺序(每个阶段验证下一阶段所依赖的假设),并分配尝试预算(迫使进行系统性思考而非暴力迭代)。
如果来自 idea-tournament,你的研究提案(第 4 阶段)提供了实验计划——数据集、基线、指标和消融设计——这些直接映射到下面的第 1-4 阶段。
进入流水线之前,从先前的周期加载实验记忆(M_E):
/memory/experiment-memory.md 读取 M_E每个阶段遵循 生成 → 执行 → 记录 → 诊断 → 修订 循环:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 阶段 | 目标 | 预算 (N_E^s) | 门控条件 |
|---|
| 1. 初始实现 | 让基线代码运行并复现已知结果 | ≤20 次尝试 | 指标在报告值的 2% 以内(或在报告的方差范围内) |
| 2. 超参数调优 | 为你的设置优化配置 | ≤12 次尝试 | 稳定配置,3 次运行方差 < 5% |
| 3. 提出的方法 | 实现并验证新方法 | ≤12 次尝试 | 在主要指标上优于调优后的基线,在 3 次运行中保持一致 |
| 4. 消融研究 | 证明每个组件的贡献 | ≤18 次尝试 | 所有声明都有对照实验证据支持 |
每个阶段将产物保存到 /experiments/stageN_name/。
在每个阶段内,为每次尝试重复此循环:
experiment-craft 进行 5 步诊断流程。目标 :找到或生成可执行的基线代码,并验证其能复现已发表的结果。此阶段对应于论文中的"初始实现"——工程师代理搜索可运行的代码,执行它,并记录结构化的执行结果。
为何重要 :如果你无法让基线运行并复现已知结果,那么后续的所有比较都毫无意义。初始实现验证了你的数据流水线、评估代码、训练基础设施以及对先前工作的理解。
预算 :≤20 次尝试(N_E^1=20)。基线可能很棘手——论文中缺少细节、版本不匹配、未报告的预处理步骤。20 次尝试提供了足够的调试空间,同时避免了无限调整。
门控条件 :主要指标在报告值的 2% 以内(或如果提供了报告方差,则在方差范围内)。
流程 :
何时加载 experiment-craft :如果第 1-5 次尝试都显著失败(>10% 差距),在消耗更多尝试次数之前,切换到 5 步诊断流程以隔离原因。
输出 :/experiments/stage1_baseline/ 包含结果、配置和已验证的基线代码。
有关详细的初始实现检查清单,请参阅 references/stage-protocols.md。
目标 :为你的特定设置找到最优的超参数配置。
为何重要 :已发表的超参数是针对作者设置调优的。你的硬件、数据版本、框架版本或细微的实现差异意味着他们的配置可能对你来说不是最优的。现在进行调优可以防止将新方法的结果与次优基线混淆。
预算 :≤12 次尝试。超参数调优存在收益递减。如果 12 次结构化尝试未能找到稳定配置,问题可能比超参数更深层。
门控条件 :找到稳定配置——使用不同随机种子的 3 次独立运行方差 < 5%。
流程 :
调优优先级顺序 :学习率 → 批量大小 → 损失权重 → 正则化 → 架构特定参数。此顺序反映了典型的敏感性。
何时加载 experiment-craft :如果不同运行间的结果高度不稳定(方差 > 20%),可能存在训练不稳定问题。使用诊断流程。
输出 :/experiments/stage2_tuning/ 包含调优日志、最终配置和稳定性验证。
有关预算原理和调整规则,请参阅 references/attempt-budget-guide.md。
目标 :实现并验证你的新方法,证明其相对于调优后基线的改进。
为何重要 :这是核心贡献。但是因为你已经验证了基线(第 1 阶段)并优化了配置(第 2 阶段),你看到的任何改进都真正归因于你的方法——而不是更好的调优设置或损坏的基线。
预算 :≤12 次尝试。如果底层想法是合理的,你的方法应该在合理次数的迭代内奏效。过多的尝试表明存在根本性问题,而非调优问题。
门控条件 :在主要指标上优于调优后的基线。改进应在至少 3 次运行中保持一致。
流程 :
集成策略 :将你方法的组件逐个添加到正常工作的基线中。每个添加的组件应保持在基线性能的 20% 以内——如果单个组件导致 >20% 的性能下降,在继续之前隔离并调试它。切勿一次性集成完整方法。
何时加载 experiment-craft :当你的方法尽管实现正确但性能仍低于基线时。5 步诊断流程将有助于区分实现错误和根本性问题。
关键决策——失败分类 :如果在耗尽尝试预算后方法性能仍低于基线,则移交至 evo-memory 进行 IVE(想法验证演化)——这是 evo-memory 的工作,而非此技能。IVE 在两种条件下触发:
evo-memory 技能会将失败分类为:
输出 :/experiments/stage3_method/ 包含方法代码、结果、与基线的比较。
目标 :证明你方法的每个组件都对最终结果有有意义的贡献。
为何重要 :审稿人会针对你方法的每个部分提问"组件 X 真的必要吗?"。没有消融研究,你无法回答。更重要的是,消融研究帮助你理解你的方法为何有效——有时你认为重要的组件并不重要,反之亦然。
预算 :≤18 次尝试。消融需要多个对照实验——每个被消融的组件一个,加上交互效应。18 次尝试覆盖一个包含 4-5 个组件的方法。
门控条件 :每个声明的贡献都有显示其效果的对照实验支持。
流程 :
三种消融设计 :
何时加载 experiment-craft :如果消融结果与你的假设相矛盾(移除组件反而改善结果),使用诊断流程来理解原因。
输出 :/experiments/stage4_ablation/ 包含消融结果表、按组件分析。
有关详细的消融设计模式,请参阅 references/stage-protocols.md。
当阶段尝试失败时,参考 experiment-craft 技能进行结构化诊断:
触发点 :任何阶段任何尝试失败后。特别重要的是:
所有阶段中的每次尝试都应记录在结构化格式中,不仅要捕捉你做了什么,还要捕捉原因以及你学到了什么。这些日志会反馈到 evo-memory 的实验策略演化(ESE)机制中。
对于每次尝试,记录:
有关完整记录格式以及日志如何反馈到 evo-memory,请参阅 references/code-trajectory-logging.md。
在实验执行期间优先考虑这些规则:
初始实现不是浪费时间 :它验证了你的整个基础设施——数据流水线、评估代码、训练设置。跳过它意味着每个后续结果都建立在未经验证的基础上。大多数"方法无效"的错误实际上是基线设置错误。
预算限制防止陷入困境 :固定的尝试预算迫使你进行系统性思考。当你知道只有 12 次尝试时,你会设计每一次以最大化信息获取。没有限制,第 47 次尝试很少比第 12 次尝试提供更多信息——只是更加绝望。
阶段顺序不可协商 :每个阶段验证下一阶段所依赖的假设。跳过第 1 阶段意味着第 3 阶段的结果可能因基线损坏而错误。跳过第 2 阶段意味着第 3 阶段的改进可能只是更好的超参数,而不是更好的方法。没有捷径。
消融研究不是可选的清理工作 :它是你的方法因正确原因而有效的主要证据。一个优于基线但没有消融研究的方法是一个你不理解的方法。审稿人知道这一点。
失败的尝试是数据,不是浪费 :每次失败的尝试都缩小了搜索空间并揭示了问题的某些方面。仔细记录失败——它们会反馈到 evo-memory 并防止未来的研究人员重复同样的错误。
提前终止是一个特性 :在预算耗尽前停止是明智的,不是懒惰。如果在系统性尝试后门控条件明显无法实现,应升级到 evo-memory IVE,而不是在越来越随机的变体上消耗剩余的预算。
当所有四个阶段完成后,将这些产物传递给 paper-writing:
| 产物 | 来源阶段 | 用途 |
|---|---|---|
| 初始实现结果 | 第 1 阶段 | 比较表格、设置验证 |
| 最优超参数配置 | 第 2 阶段 | 可复现性部分 |
| 方法与基线比较 | 第 3 阶段 | 主要结果表 |
| 消融研究结果 | 第 4 阶段 | 消融表、贡献声明 |
| 代码轨迹日志(所有阶段) | 所有阶段 | 方法部分细节、补充材料 |
| 实现细节和技巧 | 第 1-3 阶段 | 方法部分、可复现性(记录在轨迹日志的分析字段和 [Reusable] 标签中) |
同时将结果传递给 evo-memory 进行演化更新:
参考 evo-memory 技能读取实验记忆:→ 在 /memory/experiment-memory.md 读取 M_E
参考 experiment-craft 技能进行 5 步诊断:→ 运行诊断 → 返回流水线
参考 evo-memory 技能进行失败分类:→ 运行 IVE 协议
参考 evo-memory 技能进行策略提取:→ 使用轨迹日志运行 ESE 协议
参考 paper-writing 技能:→ 传递所有阶段产物
| 主题 | 参考文件 | 何时使用 |
|---|---|---|
| 各阶段检查清单和模式 | stage-protocols.md | 每个阶段的详细指导 |
| 预算原理和调整 | attempt-budget-guide.md | 当预算感觉太紧或太松时 |
| 代码轨迹记录格式 | code-trajectory-logging.md | 为 evo-memory 记录尝试 |
| 阶段日志模板 | stage-log-template.md | 记录单个阶段的进展 |
| 流水线跟踪器模板 | pipeline-tracker-template.md | 跟踪完整的 4 阶段流水线 |
每周安装次数
69
仓库
GitHub 星标数
105
首次出现
9 天前
安全审计
安装于
kimi-cli69
gemini-cli69
amp69
cline69
github-copilot69
codex69
A structured 4-stage framework for executing research experiments from initial implementation through ablation study, with attempt budgets and gate conditions that prevent wasted effort. This follows the Experiment Tree Search design from the EvoScientist paper, where the engineer agent iteratively generates executable code, runs experiments, and records structured execution results at each stage.
Experiments fail for two reasons: wrong order and no stopping criteria. Most researchers jump straight to testing their novel method without verifying their baseline setup, then wonder why results don't make sense. Others spend weeks tuning hyperparameters without a budget, hoping the next run will work.
The 4-stage pipeline solves both problems. It enforces a strict order (each stage validates assumptions the next stage depends on) and assigns attempt budgets (forcing systematic thinking over brute-force iteration).
If coming from idea-tournament, your research proposal (Phase 4) provides the experiment plan — datasets, baselines, metrics, and ablation design — that maps directly to Stages 1-4 below.
Before entering the pipeline, load Experimentation Memory (M_E) from prior cycles:
/memory/experiment-memory.mdEach stage follows a generate → execute → record → diagnose → revise loop:
| Stage | Goal | Budget (N_E^s) | Gate Condition |
|---|---|---|---|
| 1. Initial Implementation | Get baseline code running and reproduce known results | ≤20 attempts | Metrics within 2% of reported values (or within reported variance) |
| 2. Hyperparameter Tuning | Optimize config for your setup | ≤12 attempts | Stable config, variance < 5% across 3 runs |
| 3. Proposed Method | Implement & validate novel method | ≤12 attempts | Outperforms tuned baseline on primary metric, consistent across 3 runs |
| 4. Ablation Study | Prove each component's contribution | ≤18 attempts | All claims evidenced with controlled experiments |
Each stage saves artifacts to /experiments/stageN_name/.
Within every stage, repeat this cycle for each attempt:
experiment-craft for the 5-step diagnostic flow.Goal : Find or generate executable baseline code and verify it reproduces published results. This stage corresponds to the paper's "initial implementation" — the engineer agent searches for working code, runs it, and records structured execution results.
Why this matters : If you can't get the baseline running and reproducing known results, every subsequent comparison is meaningless. Initial implementation validates your data pipeline, evaluation code, training infrastructure, and understanding of prior work.
Budget : ≤20 attempts (N_E^1=20). Baselines can be tricky — missing details in papers, version mismatches, unreported preprocessing steps. 20 attempts gives enough room to debug without allowing infinite tinkering.
Gate : Primary metrics within 2% of reported values (or within the reported variance if provided).
Process :
When to loadexperiment-craft: If attempts 1-5 all fail significantly (>10% gap), switch to the 5-step diagnostic flow to isolate the cause before burning more attempts.
Output : /experiments/stage1_baseline/ containing results, config, and verified baseline code.
See references/stage-protocols.md for detailed initial implementation checklists.
Goal : Find the optimal hyperparameter configuration for YOUR specific setup.
Why this matters : Published hyperparameters are tuned for the authors' setup. Your hardware, data version, framework version, or subtle implementation differences mean their config may not be optimal for you. Tuning now prevents confounding your novel method's results with suboptimal baselines.
Budget : ≤12 attempts. Hyperparameter tuning has diminishing returns. If 12 structured attempts don't find a stable config, the problem is likely deeper than hyperparameters.
Gate : Stable configuration found — variance < 5% across 3 independent runs with different random seeds.
Process :
Priority order for tuning : Learning rate → batch size → loss weights → regularization → architecture-specific params. This order reflects typical sensitivity.
When to loadexperiment-craft: If results are highly unstable (variance > 20%) across runs, there's likely a training instability issue. Use diagnostic flow.
Output : /experiments/stage2_tuning/ containing tuning logs, final config, and stability verification.
See references/attempt-budget-guide.md for budget rationale and adjustment rules.
Goal : Implement and validate your novel method, demonstrating improvement over the tuned baseline.
Why this matters : This is the core contribution. But because you've verified the baseline (Stage 1) and optimized the config (Stage 2), any improvement you see is genuinely attributable to your method — not to a better-tuned setup or a broken baseline.
Budget : ≤12 attempts. Your method should work within a reasonable number of iterations if the underlying idea is sound. Excessive attempts suggest a fundamental problem, not a tuning issue.
Gate : Outperforms the tuned baseline on the primary metric. The improvement should be consistent across at least 3 runs.
Process :
Integration strategy : Add your method's components one at a time to the working baseline. Each added component should stay within 20% of the baseline's performance — if a single component causes a >20% regression, isolate and debug it before proceeding. Never integrate the full method in one shot.
When to loadexperiment-craft: When your method underperforms the baseline despite correct implementation. The 5-step diagnostic flow will help distinguish between implementation bugs and fundamental issues.
Critical decision — failure classification : If the method underperforms the baseline after exhausting the attempt budget, hand off to evo-memory for IVE (Idea Validation Evolution) — this is evo-memory's job, not this skill's. IVE triggers under two conditions:
The evo-memory skill will classify the failure as:
Output : /experiments/stage3_method/ containing method code, results, comparison with baseline.
Goal : Prove that each component of your method contributes meaningfully to the final result.
Why this matters : Reviewers will ask "is component X really necessary?" for every part of your method. Without ablation, you can't answer. More importantly, ablation helps YOU understand why your method works — sometimes components you thought were important aren't, and vice versa.
Budget : ≤18 attempts. Ablation requires multiple controlled experiments — one per component being ablated, plus interaction effects. 18 attempts covers a method with 4-5 components.
Gate : Every claimed contribution is supported by a controlled experiment showing its effect.
Process :
Three ablation designs :
When to loadexperiment-craft: If ablation results contradict your hypothesis (removing a component improves results), use diagnostic flow to understand why.
Output : /experiments/stage4_ablation/ containing ablation results table, per-component analysis.
See references/stage-protocols.md for detailed ablation design patterns.
When a stage attempt fails, refer to the experiment-craft skill for structured diagnosis:
Trigger points : After any failed attempt in any stage. Especially important:
Every attempt across all stages should be logged in a structured format that captures not just WHAT you did but WHY and WHAT YOU LEARNED. These logs feed into evo-memory's Experiment Strategy Evolution (ESE) mechanism.
For each attempt, record:
See references/code-trajectory-logging.md for the full logging format and how logs feed into evo-memory.
Prioritize these rules during experiment execution:
Initial implementation is not wasted time : It validates your entire infrastructure — data pipeline, evaluation code, training setup. Skipping it means every subsequent result is built on unverified ground. Most "method doesn't work" bugs are actually baseline setup bugs.
Budget limits prevent rabbit holes : Fixed attempt budgets force you to think systematically. When you know you have 12 attempts, you design each one to maximize information. Without limits, attempt #47 is rarely more informative than attempt #12 — it's just more desperate.
Stage order is non-negotiable : Each stage validates assumptions the next depends on. Skipping Stage 1 means Stage 3 results could be wrong due to a broken baseline. Skipping Stage 2 means Stage 3 improvements might just be better hyperparameters, not a better method. There are no shortcuts.
Ablation is not optional cleanup : It's the primary evidence that your method works for the right reasons. A method that outperforms the baseline but has no ablation is a method you don't understand. Reviewers know this.
Failed attempts are data, not waste : Each failed attempt narrows the search space and reveals something about the problem. Log failures carefully — they feed into evo-memory and prevent future researchers from repeating the same mistakes.
Early termination is a feature : Stopping before budget exhaustion is smart, not lazy. If the gate is clearly unachievable after systematic attempts, escalate to evo-memory IVE rather than burning remaining budget on increasingly random variations.
When all four stages are complete, pass these artifacts to paper-writing:
| Artifact | Source Stage | Used By |
|---|---|---|
| Initial implementation results | Stage 1 | Comparison tables, setup verification |
| Optimal hyperparameter config | Stage 2 | Reproducibility section |
| Method vs baseline comparison | Stage 3 | Main results table |
| Ablation study results | Stage 4 | Ablation table, contribution claims |
| Code trajectory logs (all stages) | All stages | Method section details, supplementary |
| Implementation details and tricks | Stages 1-3 | Method section, reproducibility (captured in trajectory log Analysis fields and [Reusable] tags) |
Also pass results to evo-memory for evolution updates:
Refer to the evo-memory skill to read Experimentation Memory: → Read M_E at /memory/experiment-memory.md
Refer to the experiment-craft skill for 5-step diagnostic: → Run diagnosis → Return to pipeline
Refer to the evo-memory skill for failure classification: → Run IVE protocol
Refer to the evo-memory skill for strategy extraction: → Run ESE protocol with trajectory logs
Refer to the paper-writing skill: → Pass all stage artifacts
| Topic | Reference File | When to Use |
|---|---|---|
| Per-stage checklists and patterns | stage-protocols.md | Detailed guidance for each stage |
| Budget rationale and adjustment | attempt-budget-guide.md | When budgets feel too tight or too loose |
| Code trajectory logging format | code-trajectory-logging.md | Recording attempts for evo-memory |
| Stage log template | stage-log-template.md | Logging a single stage's progress |
| Pipeline tracker template | pipeline-tracker-template.md | Tracking the full 4-stage pipeline |
Weekly Installs
69
Repository
GitHub Stars
105
First Seen
9 days ago
Security Audits
Gen Agent Trust HubWarnSocketPassSnykWarn
Installed on
kimi-cli69
gemini-cli69
amp69
cline69
github-copilot69
codex69
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
50,900 周安装
Cloudflare 开发平台完整指南:API、Workers、存储、AI与安全产品决策树
132 周安装
Playwright自动化测试模式完整指南:选择器、等待策略与最佳实践
137 周安装
Google Sheets自动化指南:通过Rube MCP与Composio实现数据读写与管理
132 周安装
Modal Serverless GPU 指南:无服务器GPU云平台运行机器学习工作负载
68 周安装
UX文案撰写指南:清晰、简洁、人性化的界面文案写作原则与模式
132 周安装
Windows 基础设施管理员:Active Directory、组策略、PowerShell 自动化与混合身份管理专家
134 周安装