autoresearch：自主机器学习实验框架，AI智能体自动优化模型，提升研究效率

autoresearch by supercent-io/skills-template

525 周安装量

88 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/supercent-io/skills-template --skill autoresearch

AI/机器学习自动化开发运维

🇨🇳中文介绍

autoresearch

“研究者的工作从编写 Python 转变为编写 Markdown。” — Andrej Karpathy

Autoresearch 是一个自主的机器学习实验框架。一个 AI 智能体会迭代修改 train.py，运行固定的 5 分钟 GPU 实验，使用单一指标（val_bpb）进行评估，并通过 git 棘轮机制仅提交改进。结果：醒来时，已有 100 多个实验被记录，模型单调地变得更好。

何时使用此技能

首次在 GPU 机器上设置 autoresearch
为智能体编写或完善 program.md 研究指令
启动夜间自主实验循环
解读 results.tsv 以理解智能体的发现
为受限硬件（有限的 VRAM）配置系统
理解棘轮机制和 git 工作流程
移植到 Apple Silicon (MLX) 或 Windows RTX

核心架构

Human authors program.md
       │
       ▼
Agent reads program.md + train.py
       │
       ▼
Agent modifies train.py → git commit
       │
       ▼
uv run train.py  (exactly 300 seconds)
       │
       ▼
Extract val_bpb + peak_vram_mb
       │
  ┌────┴────┐
improved?   no improvement
  │              │
keep commit   git reset HEAD~1
  │              │
  └──────┬───────┘
         │
   log to results.tsv
         │
         ▼
    repeat ∞

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

可变文件与不可变文件

文件	智能体访问权限	用途
`train.py`	读取 + 写入	模型、优化器、训练循环（约 630 行）
`program.md`	只读	人类研究指令
`prepare.py`	只读	数据流水线 + `evaluate_bpb()` 评估框架
`constants.py`	只读	`TIME_BUDGET=300`、`MAX_SEQ_LEN`、`EVAL_TOKENS`
`pyproject.toml`	只读	锁定的依赖项（无新包）
`results.tsv`	追加	所有实验：保留的和丢弃的

步骤 1：安装先决条件

# Install uv (fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository
git clone https://github.com/karpathy/autoresearch
cd autoresearch

# Install locked dependencies
uv sync

步骤 2：准备数据（一次性，约 2 分钟）

# Downloads FineWeb-Edu parquet shards, trains BPE tokenizer
# Last shard is reserved for validation — never seen during training
uv run prepare.py

对于受限硬件，在运行前编辑 prepare.py：

# Lower MAX_SEQ_LEN for GPUs with limited VRAM
MAX_SEQ_LEN = 256   # default: 2048

步骤 3：运行基线实验

# Single 5-minute experiment to verify setup
uv run train.py > run.log 2>&1

# Extract key metrics
grep "^val_bpb:\|^peak_vram_mb:" run.log

val_bpb: 0.9979
peak_vram_mb: 38420

步骤 4：编写 program.md

program.md 是人类编写的研究章程，智能体在每个循环迭代开始时都会读取。请将其编写为精确的 Markdown 指令：

# Research Program

## Goal
Minimize val_bpb on the FineWeb-Edu validation set within the 300-second budget.

## Current Baseline
val_bpb: 0.9979 (depth-12 GPT, Muon + AdamW optimizer)

## Directions to Explore
1. Attention variants: MLA, GQA, sliding window, local-global hybrid
2. Layer types: MoE FFN layers, SwiGLU activations
3. Optimizer tuning: Muon momentum, AdamW β values, learning rate schedule
4. Architectural depth/width tradeoffs within VRAM budget

## Constraints
- Must complete within 300 seconds
- Peak VRAM must stay under 39GB
- No new packages (use only what is in pyproject.toml)
- Do not modify prepare.py or constants.py

## Notes from Previous Runs
- Depth-12 improvements transfer to depth-24 (scale-invariant gains)
- RoPE positional encoding outperformed learned embeddings (+0.008 val_bpb)

有效的 program.md 原则：

明确说明要探索的内容——模糊的指令会浪费实验次数
记录已经尝试过的内容（防止重复实验）
明确注明硬件限制
使用当前最佳的 val_bpb 作为参考点

步骤 5：运行自主智能体循环

将你的 AI 智能体（Claude Code、Codex 等）指向仓库，并将 program.md 作为其研究上下文。智能体将：

读取 program.md + 当前的 train.py
提出改进假设
修改 train.py + 提交
执行 uv run train.py（300 秒）
提取 val_bpb；通过 git 保留或回滚
追加到 results.tsv
重复

使用 Claude Code (OMC)：

# From inside autoresearch/
# Give Claude the context: "Run the autoresearch loop following program.md"

直接使用 Claude Code CLI：

claude "Follow program.md. Run autonomous research loop on train.py.
Execute: uv run train.py, extract val_bpb, keep improvements, revert failures.
Log everything to results.tsv. Do not stop until I say so."

步骤 6：监控结果

# Live monitoring during a run
watch -n 30 "tail -20 results.tsv"

# Count kept vs. discarded
awk -F'\t' '{print $4}' results.tsv | sort | uniq -c

# Find the best experiment
sort -t$'\t' -k2 -n results.tsv | head -5

# Check current best val_bpb
git log --oneline -5

步骤 7：解读 results.tsv

commit    val_bpb    memory_gb    status     description
a3f2c91   0.9697     37.2         keep       SwiGLU activation + depth-12
b8e1d04   0.9821     38.1         discard    MoE 4-expert: marginal gain
c1a5f30   crash      —            crash      OOM: sequence length 4096

状态	含义
`keep`	`val_bpb` 改进；提交保留在分支上
`discard`	无改进；应用了 `git reset HEAD~1`
`crash`	OOM、语法错误或超时；总是被回滚

示例 1：夜间运行摘要

Session summary: 126 experiments, 18 improvements
Best val_bpb: 0.9697 (started: 0.9979)
Top improvements:
- SwiGLU activation: -0.012 val_bpb
- GQA with 4 KV heads: -0.009 val_bpb
- Muon momentum 0.92→0.95: -0.006 val_bpb

示例 2：低 VRAM 配置（6GB GPU）

# In prepare.py — edit before uv run prepare.py
MAX_SEQ_LEN = 256       # was 2048
EVAL_TOKENS = 2_097_152  # was 20_971_520 (scale down proportionally)

示例 3：按类别提取实验

# Find all attention-related experiments
grep -i "attention\|GQA\|MLA\|MHA" results.tsv

# List only improvements sorted by gain
awk -F'\t' '$4=="keep"' results.tsv | sort -t$'\t' -k2 -n

从 autoresearch 仓库目录内运行：

脚本	用途	用法
`setup.sh`	一次性环境设置	`bash scripts/setup.sh [--seq-len 512]`
`run-experiment.sh`	单次 5 分钟实验 + 指标提取	`bash scripts/run-experiment.sh`
`run-loop.sh`	自主循环：运行 → 保留/回滚 → 重复	`bash scripts/run-loop.sh [--max 20]`
`show-results.sh`	人类可读的 results.tsv 报告	`bash scripts/show-results.sh [--top 10]`
`check-hardware.sh`	GPU/CUDA/uv 可用性检查（JSON 输出）	`bash scripts/check-hardware.sh`

# Typical overnight session
bash scripts/check-hardware.sh
bash scripts/setup.sh --seq-len 512     # adjust for your VRAM
# Edit program.md with your research directives
bash scripts/run-loop.sh --max 100 --desc "session-1"
bash scripts/show-results.sh --kept-only

references/ 目录中的详细文档：

文件	内容
`references/architecture.md`	系统设计、不可变性契约、git 棘轮机制、关键设计决策
`references/program-md-guide.md`	如何编写有效的 `program.md` 指令；完整模板 + 原则
`references/hardware-config.md`	按 GPU 划分的 VRAM 设置、内存优化技术、故障排除

在运行前编写 program.md —— 智能体的好坏取决于其指令；模糊的程序会浪费算力
首先运行基线实验 —— 在启动循环前，始终手动运行 uv run train.py 以确认设置正常工作
保持 prepare.py 中的 MAX_SEQ_LEN 一致 —— 中途更改会使 val_bpb 比较无效
切勿修改 prepare.py 或 constants.py —— 评估框架必须保持固定，结果才有意义
在提交前扩展改进 —— 在将深度-12 的改进视为根本性增益之前，测试其在深度-24 上是否也成立
提交 program.md 更新 —— 将你的研究指令与 results.tsv 一起进行版本控制，以确保可复现性
监控 VRAM —— 在 program.md 中为你的 GPU 余量添加 peak_vram_mb 约束
无新依赖项 —— 智能体不能 pip install；它只能使用 pyproject.toml 中的内容

硬件	状态	备注
H100 80GB	推荐	默认配置，完整的 MAX_SEQ_LEN=2048
A100 40GB	支持	如果需要，降低 MAX_SEQ_LEN
RTX 4090 24GB	社区	将 MAX_SEQ_LEN 减少到 512
GTX 1660 Ti 6GB	社区分支	MAX_SEQ_LEN=256，减少 EVAL_TOKENS
Apple Silicon (M系列)	MLX 移植	社区分支；不同的优化器 API
Windows RTX	社区	推荐使用 WSL2 + CUDA

指标	方向	描述
`val_bpb`	越低越好	验证集每字节比特数；与词汇表大小无关
`peak_vram_mb`	越低余量越大	训练运行期间的峰值 GPU 内存
实验次数/小时	越高搜索越快	在 TIME_BUDGET=300 时约 12 次

🇺🇸English

autoresearch

"The researcher's job shifts from writing Python to writing Markdown." — Andrej Karpathy

Autoresearch is an autonomous ML experimentation framework. An AI agent iteratively modifies train.py, runs fixed 5-minute GPU experiments, evaluates with a single metric (val_bpb), and commits only improvements via git ratcheting. The result: wake up to 100+ experiments logged and a monotonically better model.

When to use this skill

Setting up autoresearch on a GPU machine for the first time
Writing or refining program.md research directives for the agent
Launching an overnight autonomous experiment loop
Interpreting results.tsv to understand what the agent found
Configuring the system for constrained hardware (limited VRAM)
Understanding the ratcheting mechanism and git workflow
Porting to Apple Silicon (MLX) or Windows RTX

Core Architecture

Human authors program.md
       │
       ▼
Agent reads program.md + train.py
       │
       ▼
Agent modifies train.py → git commit
       │
       ▼
uv run train.py  (exactly 300 seconds)
       │
       ▼
Extract val_bpb + peak_vram_mb
       │
  ┌────┴────┐
improved?   no improvement
  │              │
keep commit   git reset HEAD~1
  │              │
  └──────┬───────┘
         │
   log to results.tsv
         │
         ▼
    repeat ∞

Mutable vs. Immutable Files

File	Agent access	Purpose
`train.py`	Read + Write	Model, optimizer, training loop (~630 lines)
`program.md`	Read-only	Human research directives
`prepare.py`	Read-only	Data pipeline + `evaluate_bpb()` harness
`constants.py`	Read-only	`TIME_BUDGET=300`, ,

Instructions

Step 1: Install Prerequisites

# Install uv (fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository
git clone https://github.com/karpathy/autoresearch
cd autoresearch

# Install locked dependencies
uv sync

Step 2: Prepare Data (One-Time, ~2 Minutes)

# Downloads FineWeb-Edu parquet shards, trains BPE tokenizer
# Last shard is reserved for validation — never seen during training
uv run prepare.py

For constrained hardware, edit prepare.py before running:

# Lower MAX_SEQ_LEN for GPUs with limited VRAM
MAX_SEQ_LEN = 256   # default: 2048

Step 3: Run a Baseline Experiment

# Single 5-minute experiment to verify setup
uv run train.py > run.log 2>&1

# Extract key metrics
grep "^val_bpb:\|^peak_vram_mb:" run.log

Expected output:

val_bpb: 0.9979
peak_vram_mb: 38420

Step 4: Author program.md

program.md is the human-written research charter the agent reads at the start of every loop iteration. Write it as precise Markdown instructions:

# Research Program

## Goal
Minimize val_bpb on the FineWeb-Edu validation set within the 300-second budget.

## Current Baseline
val_bpb: 0.9979 (depth-12 GPT, Muon + AdamW optimizer)

## Directions to Explore
1. Attention variants: MLA, GQA, sliding window, local-global hybrid
2. Layer types: MoE FFN layers, SwiGLU activations
3. Optimizer tuning: Muon momentum, AdamW β values, learning rate schedule
4. Architectural depth/width tradeoffs within VRAM budget

## Constraints
- Must complete within 300 seconds
- Peak VRAM must stay under 39GB
- No new packages (use only what is in pyproject.toml)
- Do not modify prepare.py or constants.py

## Notes from Previous Runs
- Depth-12 improvements transfer to depth-24 (scale-invariant gains)
- RoPE positional encoding outperformed learned embeddings (+0.008 val_bpb)

Effective program.md principles:

Be specific about what to explore — vague directives waste experiments
Record what has already been tried (prevents redundant experiments)
Note hardware constraints explicitly
Use the current best val_bpb as a reference point

Step 5: Run the Autonomous Agent Loop

Point your AI agent (Claude Code, Codex, etc.) at the repository with program.md as its research context. The agent will:

Read program.md + current train.py
Hypothesize an improvement
Modify train.py + commit
Execute uv run train.py (300 seconds)
Extract val_bpb; keep or revert via git
Append to results.tsv
Repeat

With Claude Code (OMC):

# From inside autoresearch/
# Give Claude the context: "Run the autoresearch loop following program.md"

With Claude Code CLI directly:

claude "Follow program.md. Run autonomous research loop on train.py.
Execute: uv run train.py, extract val_bpb, keep improvements, revert failures.
Log everything to results.tsv. Do not stop until I say so."

Step 6: Monitor Results

# Live monitoring during a run
watch -n 30 "tail -20 results.tsv"

# Count kept vs. discarded
awk -F'\t' '{print $4}' results.tsv | sort | uniq -c

# Find the best experiment
sort -t$'\t' -k2 -n results.tsv | head -5

# Check current best val_bpb
git log --oneline -5

Step 7: Interpret results.tsv

commit    val_bpb    memory_gb    status     description
a3f2c91   0.9697     37.2         keep       SwiGLU activation + depth-12
b8e1d04   0.9821     38.1         discard    MoE 4-expert: marginal gain
c1a5f30   crash      —            crash      OOM: sequence length 4096

Status	Meaning
`keep`	`val_bpb` improved; commit retained on branch
`discard`	No improvement; `git reset HEAD~1` applied
`crash`	OOM, syntax error, or timeout; always reverted

Examples

Example 1: Overnight Run Summary

Session summary: 126 experiments, 18 improvements
Best val_bpb: 0.9697 (started: 0.9979)
Top improvements:
- SwiGLU activation: -0.012 val_bpb
- GQA with 4 KV heads: -0.009 val_bpb
- Muon momentum 0.92→0.95: -0.006 val_bpb

Example 2: Low-VRAM Configuration (6GB GPU)

# In prepare.py — edit before uv run prepare.py
MAX_SEQ_LEN = 256       # was 2048
EVAL_TOKENS = 2_097_152  # was 20_971_520 (scale down proportionally)

Example 3: Extract Experiments by Category

# Find all attention-related experiments
grep -i "attention\|GQA\|MLA\|MHA" results.tsv

# List only improvements sorted by gain
awk -F'\t' '$4=="keep"' results.tsv | sort -t$'\t' -k2 -n

Available scripts

Run from inside the autoresearch repository directory:

Script	Purpose	Usage
`setup.sh`	One-time environment setup	`bash scripts/setup.sh [--seq-len 512]`
`run-experiment.sh`	Single 5-min experiment + metric extraction	`bash scripts/run-experiment.sh`
`run-loop.sh`	Autonomous loop: run → keep/revert → repeat	`bash scripts/run-loop.sh [--max 20]`

# Typical overnight session
bash scripts/check-hardware.sh
bash scripts/setup.sh --seq-len 512     # adjust for your VRAM
# Edit program.md with your research directives
bash scripts/run-loop.sh --max 100 --desc "session-1"
bash scripts/show-results.sh --kept-only

References

Detailed documentation in references/:

File	Contents
`references/architecture.md`	System design, immutability contract, git ratcheting, key design decisions
`references/program-md-guide.md`	How to write effective `program.md` directives; full template + principles
`references/hardware-config.md`	VRAM settings by GPU, memory optimization techniques, troubleshooting

Best practices

Write program.md before running — the agent is only as good as its directives; vague programs waste compute
Start with the baseline first — always uv run train.py manually before launching the loop to confirm the setup works
KeepMAX_SEQ_LEN in prepare.py consistent — changing it mid-run invalidates val_bpb comparisons
Never modifyprepare.py or constants.py — the evaluation harness must stay fixed for results to be meaningful
Scale improvements before committing — test that a depth-12 improvement also holds at depth-24 before treating it as a fundamental gain
Commitprogram.md updates — version-control your research directives alongside results.tsv for reproducibility

Hardware Requirements

Hardware	Status	Notes
H100 80GB	Recommended	Default config, full MAX_SEQ_LEN=2048
A100 40GB	Supported	Lower MAX_SEQ_LEN if needed
RTX 4090 24GB	Community	Reduce MAX_SEQ_LEN to 512
GTX 1660 Ti 6GB	Community fork	MAX_SEQ_LEN=256, reduced EVAL_TOKENS
Apple Silicon (M-series)	MLX port	Community fork; different optimizer API
Windows RTX	Community	WSL2 + CUDA recommended

Key Metrics Reference

Metric	Direction	Description
`val_bpb`	Lower = better	Validation bits-per-byte; vocabulary-size-independent
`peak_vram_mb`	Lower = more headroom	Peak GPU memory during the training run
Experiments/hour	Higher = faster search	~12 at TIME_BUDGET=300

References

Weekly Installs

525

Repository

supercent-io/sk…template

GitHub Stars

First Seen

Mar 11, 2026

Security Audits

Gen Agent Trust HubFail SocketPass SnykFail

Installed on

gemini-cli476

codex469

opencode457

github-copilot454

cursor454

kimi-cli453

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

103,800 周安装

Monitor VRAM — add peak_vram_mb constraints in program.md for your GPU's headroom

No new dependencies — the agent cannot pip install; it can only use what is in pyproject.toml

autoresearch：自主机器学习实验框架，AI智能体自动优化模型，提升研究效率

🇨🇳中文介绍

autoresearch

何时使用此技能

核心架构

相关 Skills

可变文件与不可变文件

使用说明

步骤 1：安装先决条件

步骤 2：准备数据（一次性，约 2 分钟）

步骤 3：运行基线实验

步骤 4：编写 program.md

步骤 5：运行自主智能体循环

步骤 6：监控结果

步骤 7：解读 results.tsv

示例

示例 1：夜间运行摘要

示例 2：低 VRAM 配置（6GB GPU）

示例 3：按类别提取实验

可用脚本

参考资料

最佳实践

硬件要求

关键指标参考

参考链接

🇺🇸English

autoresearch

When to use this skill

Core Architecture

Mutable vs. Immutable Files

Instructions

Step 1: Install Prerequisites

Step 2: Prepare Data (One-Time, ~2 Minutes)

Step 3: Run a Baseline Experiment

Step 4: Author program.md

Step 5: Run the Autonomous Agent Loop

Step 6: Monitor Results

Step 7: Interpret results.tsv

Examples

Example 1: Overnight Run Summary

Example 2: Low-VRAM Configuration (6GB GPU)

Example 3: Extract Experiments by Category

Available scripts

References

Best practices

Hardware Requirements

Key Metrics Reference

References

最新 Skills