实验流水线框架：4阶段科研实验执行与消融研究方法论 | EvoScientist

experiment-pipeline by evoscientist/evoskills

69 周安装量

105 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/evoscientist/evoskills --skill experiment-pipeline

AI/机器学习方法论科研工具

🇨🇳中文介绍

实验流水线

一个结构化的 4 阶段框架，用于执行从初始实现到消融研究的科研实验，包含尝试预算和门控条件，以防止无效努力。这遵循了 EvoScientist 论文中的实验树搜索设计，其中工程师代理迭代生成可执行代码、运行实验，并在每个阶段记录结构化的执行结果。

何时使用此技能

用户有一个计划好的实验，需要组织执行工作流
用户希望系统性地验证新方法相对于基线的效果
用户询问实验阶段、尝试预算或何时进入下一阶段
用户在测试自己的方法之前需要复现基线结果
用户提到"实验流水线"、"先做基线"、"消融研究"、"阶段预算"、"实验执行"

流水线思维模式

实验失败有两个原因：错误的顺序和没有停止标准。 大多数研究人员直接跳过去测试他们的新方法，而没有验证他们的基线设置，然后疑惑为什么结果不合理。另一些人则在没有预算的情况下花费数周调整超参数，希望下一次运行会成功。

4 阶段流水线解决了这两个问题。它强制执行严格的顺序（每个阶段验证下一阶段所依赖的假设），并分配尝试预算（迫使进行系统性思考而非暴力迭代）。

开始之前：加载先验知识

如果来自 idea-tournament，你的研究提案（第 4 阶段）提供了实验计划——数据集、基线、指标和消融设计——这些直接映射到下面的第 1-4 阶段。

进入流水线之前，从先前的周期加载实验记忆（M_E）：

参考 evo-memory 技能 → 在 /memory/experiment-memory.md 读取 M_E
通过将每个条目的上下文和类别与当前问题进行比较，选择与当前实验领域最相关的 top-1 条目（k_E=1）
所选策略为所有阶段的超参数范围（第 2 阶段）、调试方法（第 1-3 阶段）和训练配置提供信息
如果 M_E 尚不存在（第一个周期），跳过此步骤并继续——你的结果将在流水线完成后通过 ESE 填充 M_E

4 阶段流水线概述

每个阶段遵循 生成 → 执行 → 记录 → 诊断 → 修订 循环：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

第 1 阶段：初始实现

目标：找到或生成可执行的基线代码，并验证其能复现已发表的结果。此阶段对应于论文中的"初始实现"——工程师代理搜索可运行的代码，执行它，并记录结构化的执行结果。

为何重要 ：如果你无法让基线运行并复现已知结果，那么后续的所有比较都毫无意义。初始实现验证了你的数据流水线、评估代码、训练基础设施以及对先前工作的理解。

预算：≤20 次尝试（N_E^1=20）。基线可能很棘手——论文中缺少细节、版本不匹配、未报告的预处理步骤。20 次尝试提供了足够的调试空间，同时避免了无限调整。

门控条件 ：主要指标在报告值的 2% 以内（或如果提供了报告方差，则在方差范围内）。

找到原始的基线代码（官方仓库、重新实现或根据论文描述编写）
在你的环境中运行代码——解决依赖关系、修复兼容性问题
匹配论文中确切的训练配置（数据集划分、预处理、超参数）
运行并比较指标。如果偏差 >2%，诊断差距原因
常见陷阱：不同的随机种子、不同的数据划分、未报告的数据增强、框架版本差异

何时加载 experiment-craft ：如果第 1-5 次尝试都显著失败（>10% 差距），在消耗更多尝试次数之前，切换到 5 步诊断流程以隔离原因。

输出：/experiments/stage1_baseline/ 包含结果、配置和已验证的基线代码。

有关详细的初始实现检查清单，请参阅 references/stage-protocols.md。

第 2 阶段：超参数调优

目标：为你的特定设置找到最优的超参数配置。

为何重要 ：已发表的超参数是针对作者设置调优的。你的硬件、数据版本、框架版本或细微的实现差异意味着他们的配置可能对你来说不是最优的。现在进行调优可以防止将新方法的结果与次优基线混淆。

预算：≤12 次尝试。超参数调优存在收益递减。如果 12 次结构化尝试未能找到稳定配置，问题可能比超参数更深层。

门控条件 ：找到稳定配置——使用不同随机种子的 3 次独立运行方差 < 5%。

识别最敏感的超参数（通常是：学习率、批量大小、损失权重）
从最敏感参数的粗粒度搜索开始
根据结果缩小范围，然后转向下一个参数
用 3 次独立运行验证最终配置

调优优先级顺序 ：学习率 → 批量大小 → 损失权重 → 正则化 → 架构特定参数。此顺序反映了典型的敏感性。

何时加载 experiment-craft ：如果不同运行间的结果高度不稳定（方差 > 20%），可能存在训练不稳定问题。使用诊断流程。

输出：/experiments/stage2_tuning/ 包含调优日志、最终配置和稳定性验证。

有关预算原理和调整规则，请参阅 references/attempt-budget-guide.md。

第 3 阶段：提出的方法

目标：实现并验证你的新方法，证明其相对于调优后基线的改进。

为何重要 ：这是核心贡献。但是因为你已经验证了基线（第 1 阶段）并优化了配置（第 2 阶段），你看到的任何改进都真正归因于你的方法——而不是更好的调优设置或损坏的基线。

预算：≤12 次尝试。如果底层想法是合理的，你的方法应该在合理次数的迭代内奏效。过多的尝试表明存在根本性问题，而非调优问题。

门控条件 ：在主要指标上优于调优后的基线。改进应在至少 3 次运行中保持一致。

逐步实现核心方法——不要一次性添加所有内容
测试每个组件与基线流水线的集成
运行完整训练并与第 2 阶段结果比较
如果性能不佳，隔离导致差距的组件

集成策略 ：将你方法的组件逐个添加到正常工作的基线中。每个添加的组件应保持在基线性能的 20% 以内——如果单个组件导致 >20% 的性能下降，在继续之前隔离并调试它。切勿一次性集成完整方法。

何时加载 experiment-craft ：当你的方法尽管实现正确但性能仍低于基线时。5 步诊断流程将有助于区分实现错误和根本性问题。

关键决策——失败分类 ：如果在耗尽尝试预算后方法性能仍低于基线，则移交至 evo-memory 进行 IVE（想法验证演化）——这是 evo-memory 的工作，而非此技能。IVE 在两种条件下触发：

无可执行代码 ：在任何阶段预算内都无法找到可运行的代码。
比基线差 ：实验完成但方法性能低于基线。

evo-memory 技能会将失败分类为：

实现失败 ：错误或缺少技巧 → 可在未来周期重试。
根本方向失败 ：核心想法无效 → 更新构思记忆以防止重试。

输出：/experiments/stage3_method/ 包含方法代码、结果、与基线的比较。

第 4 阶段：消融研究

目标：证明你方法的每个组件都对最终结果有有意义的贡献。

为何重要 ：审稿人会针对你方法的每个部分提问"组件 X 真的必要吗？"。没有消融研究，你无法回答。更重要的是，消融研究帮助你理解你的方法为何有效——有时你认为重要的组件并不重要，反之亦然。

预算：≤18 次尝试。消融需要多个对照实验——每个被消融的组件一个，加上交互效应。18 次尝试覆盖一个包含 4-5 个组件的方法。

门控条件 ：每个声明的贡献都有显示其效果的对照实验支持。

列出你声称对性能有贡献的所有方法组件
设计消融实验：一次移除一个组件，测量影响
对于相互作用的组件，测试交互效应
验证移除任何单个组件都不会改善结果（否则会否定该声明）

三种消融设计 ：

留一法 ：单独移除每个组件。显示每个组件的边际贡献。
累加法 ：从基线开始，逐个添加组件。显示增量收益。
替代法 ：用替代方法替换你的组件。显示你的组件优于替代方案，而不仅仅是优于无。

何时加载 experiment-craft ：如果消融结果与你的假设相矛盾（移除组件反而改善结果），使用诊断流程来理解原因。

输出：/experiments/stage4_ablation/ 包含消融结果表、按组件分析。

有关详细的消融设计模式，请参阅 references/stage-protocols.md。

集成 experiment-craft 进行诊断

当阶段尝试失败时，参考 experiment-craft 技能进行结构化诊断：

遵循 experiment-craft 诊断协议
运行 5 步诊断流程（观察、假设、测试、结论、处方）
诊断不消耗你的阶段预算——这是一个免费的分析步骤
诊断输出（一个处方）成为你下一次尝试的计划
返回流水线并在你的轨迹日志中记录诊断

触发点 ：任何阶段任何尝试失败后。特别重要的是：

第 1 阶段 ：5 次以上尝试失败后（与报告指标差距 >10%）
第 2 阶段 ：当不同运行间方差 > 20% 时
第 3 阶段 ：当方法持续性能低于基线时
第 4 阶段 ：当消融结果与你的假设相矛盾时

所有阶段中的每次尝试都应记录在结构化格式中，不仅要捕捉你做了什么，还要捕捉原因以及你学到了什么。这些日志会反馈到 evo-memory 的实验策略演化（ESE）机制中。

对于每次尝试，记录：

尝试编号 和阶段
假设：你期望什么以及为什么
代码更改 ：修改内容的摘要（不是完整差异，而是关键更改）
结果：指标和观察结果
分析：假设是被证实还是被反驳，以及你学到了什么

有关完整记录格式以及日志如何反馈到 evo-memory，请参阅 references/code-trajectory-logging.md。

反直觉的流水线规则

在实验执行期间优先考虑这些规则：

初始实现不是浪费时间 ：它验证了你的整个基础设施——数据流水线、评估代码、训练设置。跳过它意味着每个后续结果都建立在未经验证的基础上。大多数"方法无效"的错误实际上是基线设置错误。
预算限制防止陷入困境 ：固定的尝试预算迫使你进行系统性思考。当你知道只有 12 次尝试时，你会设计每一次以最大化信息获取。没有限制，第 47 次尝试很少比第 12 次尝试提供更多信息——只是更加绝望。
阶段顺序不可协商 ：每个阶段验证下一阶段所依赖的假设。跳过第 1 阶段意味着第 3 阶段的结果可能因基线损坏而错误。跳过第 2 阶段意味着第 3 阶段的改进可能只是更好的超参数，而不是更好的方法。没有捷径。
消融研究不是可选的清理工作 ：它是你的方法因正确原因而有效的主要证据。一个优于基线但没有消融研究的方法是一个你不理解的方法。审稿人知道这一点。
失败的尝试是数据，不是浪费 ：每次失败的尝试都缩小了搜索空间并揭示了问题的某些方面。仔细记录失败——它们会反馈到 evo-memory 并防止未来的研究人员重复同样的错误。
提前终止是一个特性 ：在预算耗尽前停止是明智的，不是懒惰。如果在系统性尝试后门控条件明显无法实现，应升级到 evo-memory IVE，而不是在越来越随机的变体上消耗剩余的预算。

移交至论文撰写

当所有四个阶段完成后，将这些产物传递给 paper-writing：

产物	来源阶段	用途
初始实现结果	第 1 阶段	比较表格、设置验证
最优超参数配置	第 2 阶段	可复现性部分
方法与基线比较	第 3 阶段	主要结果表
消融研究结果	第 4 阶段	消融表、贡献声明
代码轨迹日志（所有阶段）	所有阶段	方法部分细节、补充材料
实现细节和技巧	第 1-3 阶段	方法部分、可复现性（记录在轨迹日志的分析字段和 `[Reusable]` 标签中）

同时将结果传递给 evo-memory 进行演化更新：

如果任何阶段耗尽预算但未产生可执行代码，或第 3 阶段方法性能低于调优后的基线 → 触发 IVE（想法验证演化）
如果所有阶段都成功 → 触发 ESE（实验策略演化）

开始之前（加载记忆）

参考 evo-memory 技能读取实验记忆：→ 在 /memory/experiment-memory.md 读取 M_E

失败时（任何阶段内）

参考 experiment-craft 技能进行 5 步诊断：→ 运行诊断 → 返回流水线

触发 IVE 时（预算耗尽或方法性能低于基线）

参考 evo-memory 技能进行失败分类：→ 运行 IVE 协议

流水线成功时（所有 4 个阶段完成）

参考 evo-memory 技能进行策略提取：→ 使用轨迹日志运行 ESE 协议

移交至论文撰写

参考 paper-writing 技能：→ 传递所有阶段产物

主题	参考文件	何时使用
各阶段检查清单和模式	stage-protocols.md	每个阶段的详细指导
预算原理和调整	attempt-budget-guide.md	当预算感觉太紧或太松时
代码轨迹记录格式	code-trajectory-logging.md	为 evo-memory 记录尝试
阶段日志模板	stage-log-template.md	记录单个阶段的进展
流水线跟踪器模板	pipeline-tracker-template.md	跟踪完整的 4 阶段流水线

🇺🇸English

Experiment Pipeline

A structured 4-stage framework for executing research experiments from initial implementation through ablation study, with attempt budgets and gate conditions that prevent wasted effort. This follows the Experiment Tree Search design from the EvoScientist paper, where the engineer agent iteratively generates executable code, runs experiments, and records structured execution results at each stage.

When to Use This Skill

User has a planned experiment and needs to organize the execution workflow
User wants to systematically validate a novel method against baselines
User asks about experiment stages, attempt budgets, or when to move on
User needs to reproduce baseline results before testing their method
User mentions "experiment pipeline", "baseline first", "ablation study", "stage budget", "experiment execution"

The Pipeline Mindset

Experiments fail for two reasons: wrong order and no stopping criteria. Most researchers jump straight to testing their novel method without verifying their baseline setup, then wonder why results don't make sense. Others spend weeks tuning hyperparameters without a budget, hoping the next run will work.

The 4-stage pipeline solves both problems. It enforces a strict order (each stage validates assumptions the next stage depends on) and assigns attempt budgets (forcing systematic thinking over brute-force iteration).

Before Starting: Load Prior Knowledge

If coming from idea-tournament, your research proposal (Phase 4) provides the experiment plan — datasets, baselines, metrics, and ablation design — that maps directly to Stages 1-4 below.

Before entering the pipeline, load Experimentation Memory (M_E) from prior cycles:

Refer to the evo-memory skill → Read M_E at /memory/experiment-memory.md
Select the top-1 entry (k_E=1) most relevant to the current experiment domain by comparing each entry's Context and Category against the current problem
The selected strategy informs hyperparameter ranges (Stage 2), debugging approaches (Stages 1-3), and training configurations across all stages
If M_E doesn't exist yet (first cycle), skip this step and proceed — your results will seed M_E via ESE after pipeline completion

4-Stage Pipeline Overview

Each stage follows a generate → execute → record → diagnose → revise loop:

Stage	Goal	Budget (N_E^s)	Gate Condition
1. Initial Implementation	Get baseline code running and reproduce known results	≤20 attempts	Metrics within 2% of reported values (or within reported variance)
2. Hyperparameter Tuning	Optimize config for your setup	≤12 attempts	Stable config, variance < 5% across 3 runs
3. Proposed Method	Implement & validate novel method	≤12 attempts	Outperforms tuned baseline on primary metric, consistent across 3 runs
4. Ablation Study	Prove each component's contribution	≤18 attempts	All claims evidenced with controlled experiments

Each stage saves artifacts to /experiments/stageN_name/.

The Stage Loop

Within every stage, repeat this cycle for each attempt:

Generate : Form a hypothesis or plan for this attempt. What specifically will you try? What do you expect to happen?
Execute : Run the experiment. Record exact configuration, code changes, and runtime.
Record : Log results immediately using the stage log template. Include both metrics and observations.
Diagnose : Compare results to expectations. If they match, assess the gate condition. If they don't, load experiment-craft for the 5-step diagnostic flow.
Revise : Based on diagnosis, either advance to the next stage (gate met) or plan the next attempt (gate not met).

Stage 1: Initial Implementation

Goal : Find or generate executable baseline code and verify it reproduces published results. This stage corresponds to the paper's "initial implementation" — the engineer agent searches for working code, runs it, and records structured execution results.

Why this matters : If you can't get the baseline running and reproducing known results, every subsequent comparison is meaningless. Initial implementation validates your data pipeline, evaluation code, training infrastructure, and understanding of prior work.

Budget : ≤20 attempts (N_E^1=20). Baselines can be tricky — missing details in papers, version mismatches, unreported preprocessing steps. 20 attempts gives enough room to debug without allowing infinite tinkering.

Gate : Primary metrics within 2% of reported values (or within the reported variance if provided).

Process :

Find the original baseline code (official repo, re-implementations, or write from paper description)
Get the code running in your environment — resolve dependencies, fix compatibility issues
Match the exact training configuration from the paper (dataset splits, preprocessing, hyperparameters)
Run and compare metrics. If off by >2%, diagnose the gap
Common pitfalls: different random seeds, different data splits, unreported data augmentation, framework version differences

When to loadexperiment-craft: If attempts 1-5 all fail significantly (>10% gap), switch to the 5-step diagnostic flow to isolate the cause before burning more attempts.

Output : /experiments/stage1_baseline/ containing results, config, and verified baseline code.

See references/stage-protocols.md for detailed initial implementation checklists.

Stage 2: Hyperparameter Tuning

Goal : Find the optimal hyperparameter configuration for YOUR specific setup.

Why this matters : Published hyperparameters are tuned for the authors' setup. Your hardware, data version, framework version, or subtle implementation differences mean their config may not be optimal for you. Tuning now prevents confounding your novel method's results with suboptimal baselines.

Budget : ≤12 attempts. Hyperparameter tuning has diminishing returns. If 12 structured attempts don't find a stable config, the problem is likely deeper than hyperparameters.

Gate : Stable configuration found — variance < 5% across 3 independent runs with different random seeds.

Process :

Identify the most sensitive hyperparameters (usually: learning rate, batch size, loss weights)
Start with coarse search on the most sensitive parameter
Narrow the range based on results, then move to the next parameter
Validate final config with 3 independent runs

Priority order for tuning : Learning rate → batch size → loss weights → regularization → architecture-specific params. This order reflects typical sensitivity.

When to loadexperiment-craft: If results are highly unstable (variance > 20%) across runs, there's likely a training instability issue. Use diagnostic flow.

Output : /experiments/stage2_tuning/ containing tuning logs, final config, and stability verification.

See references/attempt-budget-guide.md for budget rationale and adjustment rules.

Stage 3: Proposed Method

Goal : Implement and validate your novel method, demonstrating improvement over the tuned baseline.

Why this matters : This is the core contribution. But because you've verified the baseline (Stage 1) and optimized the config (Stage 2), any improvement you see is genuinely attributable to your method — not to a better-tuned setup or a broken baseline.

Budget : ≤12 attempts. Your method should work within a reasonable number of iterations if the underlying idea is sound. Excessive attempts suggest a fundamental problem, not a tuning issue.

Gate : Outperforms the tuned baseline on the primary metric. The improvement should be consistent across at least 3 runs.

Process :

Implement the core method incrementally — don't add everything at once
Test each component's integration with the baseline pipeline
Run full training and compare against Stage 2 results
If underperforming, isolate which component causes the gap

Integration strategy : Add your method's components one at a time to the working baseline. Each added component should stay within 20% of the baseline's performance — if a single component causes a >20% regression, isolate and debug it before proceeding. Never integrate the full method in one shot.

When to loadexperiment-craft: When your method underperforms the baseline despite correct implementation. The 5-step diagnostic flow will help distinguish between implementation bugs and fundamental issues.

Critical decision — failure classification : If the method underperforms the baseline after exhausting the attempt budget, hand off to evo-memory for IVE (Idea Validation Evolution) — this is evo-memory's job, not this skill's. IVE triggers under two conditions:

No executable code : Cannot find working code within the budget at any stage.
Worse than baseline : Experiments complete but the method underperforms.

The evo-memory skill will classify the failure as:

Implementation failure : Bugs or missing tricks → retryable in a future cycle.
Fundamental direction failure : Core idea doesn't work → update ideation memory to prevent retrying.

Output : /experiments/stage3_method/ containing method code, results, comparison with baseline.

Stage 4: Ablation Study

Goal : Prove that each component of your method contributes meaningfully to the final result.

Why this matters : Reviewers will ask "is component X really necessary?" for every part of your method. Without ablation, you can't answer. More importantly, ablation helps YOU understand why your method works — sometimes components you thought were important aren't, and vice versa.

Budget : ≤18 attempts. Ablation requires multiple controlled experiments — one per component being ablated, plus interaction effects. 18 attempts covers a method with 4-5 components.

Gate : Every claimed contribution is supported by a controlled experiment showing its effect.

Process :

List all components of your method that you claim contribute to performance
Design ablation experiments: remove ONE component at a time, measure the impact
For components that interact, test interaction effects
Verify that no single component's removal improves results (would invalidate the claim)

Three ablation designs :

Leave-one-out : Remove each component individually. Shows each component's marginal contribution.
Additive : Start from baseline, add components one at a time. Shows incremental gains.
Substitution : Replace your component with an alternative approach. Shows your component is better than alternatives, not just better than nothing.

When to loadexperiment-craft: If ablation results contradict your hypothesis (removing a component improves results), use diagnostic flow to understand why.

Output : /experiments/stage4_ablation/ containing ablation results table, per-component analysis.

See references/stage-protocols.md for detailed ablation design patterns.

Integrating experiment-craft for Diagnosis

When a stage attempt fails, refer to the experiment-craft skill for structured diagnosis:

Follow the experiment-craft diagnostic protocol
Run the 5-step diagnostic flow (observe, hypothesize, test, conclude, prescribe)
The diagnosis does NOT consume your stage budget — it's a free analysis step
The diagnosis output (a prescription) becomes the plan for your next attempt
Return to the pipeline and record the diagnosis in your trajectory log

Trigger points : After any failed attempt in any stage. Especially important:

Stage 1 : After 5+ failed attempts (>10% gap from reported metrics)
Stage 2 : When variance > 20% across runs
Stage 3 : When method consistently underperforms baseline
Stage 4 : When ablation results contradict your hypothesis

Code Trajectory Logging

Every attempt across all stages should be logged in a structured format that captures not just WHAT you did but WHY and WHAT YOU LEARNED. These logs feed into evo-memory's Experiment Strategy Evolution (ESE) mechanism.

For each attempt, record:

Attempt number and stage
Hypothesis : What you expected and why
Code changes : Summary of what was modified (not a full diff, but the key changes)
Result : Metrics and observations
Analysis : Whether the hypothesis was confirmed or refuted, and what you learned

See references/code-trajectory-logging.md for the full logging format and how logs feed into evo-memory.

Counterintuitive Pipeline Rules

Prioritize these rules during experiment execution:

Initial implementation is not wasted time : It validates your entire infrastructure — data pipeline, evaluation code, training setup. Skipping it means every subsequent result is built on unverified ground. Most "method doesn't work" bugs are actually baseline setup bugs.
Budget limits prevent rabbit holes : Fixed attempt budgets force you to think systematically. When you know you have 12 attempts, you design each one to maximize information. Without limits, attempt #47 is rarely more informative than attempt #12 — it's just more desperate.
Stage order is non-negotiable : Each stage validates assumptions the next depends on. Skipping Stage 1 means Stage 3 results could be wrong due to a broken baseline. Skipping Stage 2 means Stage 3 improvements might just be better hyperparameters, not a better method. There are no shortcuts.
Ablation is not optional cleanup : It's the primary evidence that your method works for the right reasons. A method that outperforms the baseline but has no ablation is a method you don't understand. Reviewers know this.
Failed attempts are data, not waste : Each failed attempt narrows the search space and reveals something about the problem. Log failures carefully — they feed into evo-memory and prevent future researchers from repeating the same mistakes.
Early termination is a feature : Stopping before budget exhaustion is smart, not lazy. If the gate is clearly unachievable after systematic attempts, escalate to evo-memory IVE rather than burning remaining budget on increasingly random variations.

Handoff to Paper Writing

When all four stages are complete, pass these artifacts to paper-writing:

Artifact	Source Stage	Used By
Initial implementation results	Stage 1	Comparison tables, setup verification
Optimal hyperparameter config	Stage 2	Reproducibility section
Method vs baseline comparison	Stage 3	Main results table
Ablation study results	Stage 4	Ablation table, contribution claims
Code trajectory logs (all stages)	All stages	Method section details, supplementary
Implementation details and tricks	Stages 1-3	Method section, reproducibility (captured in trajectory log Analysis fields and `[Reusable]` tags)

Also pass results to evo-memory for evolution updates:

If any stage exhausts budget without executable code, OR Stage 3 method underperforms the tuned baseline → trigger IVE (Idea Validation Evolution)
If all stages succeeded → trigger ESE (Experiment Strategy Evolution)

Skill Integration

Before Starting (load memory)

Refer to the evo-memory skill to read Experimentation Memory: → Read M_E at /memory/experiment-memory.md

On Failure (within any stage)

Refer to the experiment-craft skill for 5-step diagnostic: → Run diagnosis → Return to pipeline

On IVE Trigger (budget exhausted or method underperforms)

Refer to the evo-memory skill for failure classification: → Run IVE protocol

On Pipeline Success (all 4 stages complete)

Refer to the evo-memory skill for strategy extraction: → Run ESE protocol with trajectory logs

Handoff to Paper Writing

Refer to the paper-writing skill: → Pass all stage artifacts

Reference Navigation

Topic	Reference File	When to Use
Per-stage checklists and patterns	stage-protocols.md	Detailed guidance for each stage
Budget rationale and adjustment	attempt-budget-guide.md	When budgets feel too tight or too loose
Code trajectory logging format	code-trajectory-logging.md	Recording attempts for evo-memory
Stage log template	stage-log-template.md	Logging a single stage's progress
Pipeline tracker template	pipeline-tracker-template.md	Tracking the full 4-stage pipeline

Weekly Installs

Repository

evoscientist/evoskills

GitHub Stars

105

First Seen

9 days ago

Security Audits

Gen Agent Trust HubWarn SocketPass SnykWarn

Installed on

kimi-cli69

gemini-cli69

amp69

cline69

github-copilot69

codex69

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

50,900 周安装

1. 初始实现	让基线代码运行并复现已知结果	≤20 次尝试	指标在报告值的 2% 以内（或在报告的方差范围内）
2. 超参数调优	为你的设置优化配置	≤12 次尝试	稳定配置，3 次运行方差 < 5%
3. 提出的方法	实现并验证新方法	≤12 次尝试	在主要指标上优于调优后的基线，在 3 次运行中保持一致
4. 消融研究	证明每个组件的贡献	≤18 次尝试	所有声明都有对照实验证据支持

实验流水线框架：4阶段科研实验执行与消融研究方法论 | EvoScientist

🇨🇳中文介绍

实验流水线

何时使用此技能

流水线思维模式

开始之前：加载先验知识

4 阶段流水线概述

相关 Skills

阶段循环

第 1 阶段：初始实现

第 2 阶段：超参数调优

第 3 阶段：提出的方法

第 4 阶段：消融研究

集成 experiment-craft 进行诊断

代码轨迹记录

反直觉的流水线规则

移交至论文撰写

技能集成

开始之前（加载记忆）

失败时（任何阶段内）

触发 IVE 时（预算耗尽或方法性能低于基线）

流水线成功时（所有 4 个阶段完成）

移交至论文撰写

参考导航

🇺🇸English

Experiment Pipeline

When to Use This Skill

The Pipeline Mindset

Before Starting: Load Prior Knowledge

4-Stage Pipeline Overview

The Stage Loop

Stage 1: Initial Implementation

Stage 2: Hyperparameter Tuning

Stage 3: Proposed Method

Stage 4: Ablation Study

Integrating experiment-craft for Diagnosis

Code Trajectory Logging

Counterintuitive Pipeline Rules

Handoff to Paper Writing

Skill Integration

Before Starting (load memory)

On Failure (within any stage)

On IVE Trigger (budget exhausted or method underperforms)

On Pipeline Success (all 4 stages complete)

Handoff to Paper Writing

Reference Navigation

最新 Skills