autoresearch by supercent-io/skills-template
npx skills add https://github.com/supercent-io/skills-template --skill autoresearch“研究者的工作从编写 Python 转变为编写 Markdown。” — Andrej Karpathy
Autoresearch 是一个自主的机器学习实验框架。一个 AI 智能体会迭代修改 train.py,运行固定的 5 分钟 GPU 实验,使用单一指标(val_bpb)进行评估,并通过 git 棘轮机制仅提交改进。结果:醒来时,已有 100 多个实验被记录,模型单调地变得更好。
program.md 研究指令results.tsv 以理解智能体的发现Human authors program.md
│
▼
Agent reads program.md + train.py
│
▼
Agent modifies train.py → git commit
│
▼
uv run train.py (exactly 300 seconds)
│
▼
Extract val_bpb + peak_vram_mb
│
┌────┴────┐
improved? no improvement
│ │
keep commit git reset HEAD~1
│ │
└──────┬───────┘
│
log to results.tsv
│
▼
repeat ∞
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 文件 | 智能体访问权限 | 用途 |
|---|---|---|
train.py | 读取 + 写入 | 模型、优化器、训练循环(约 630 行) |
program.md | 只读 | 人类研究指令 |
prepare.py | 只读 | 数据流水线 + evaluate_bpb() 评估框架 |
constants.py | 只读 | TIME_BUDGET=300、MAX_SEQ_LEN、EVAL_TOKENS |
pyproject.toml | 只读 | 锁定的依赖项(无新包) |
results.tsv | 追加 | 所有实验:保留的和丢弃的 |
# Install uv (fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone the repository
git clone https://github.com/karpathy/autoresearch
cd autoresearch
# Install locked dependencies
uv sync
# Downloads FineWeb-Edu parquet shards, trains BPE tokenizer
# Last shard is reserved for validation — never seen during training
uv run prepare.py
对于受限硬件,在运行前编辑 prepare.py:
# Lower MAX_SEQ_LEN for GPUs with limited VRAM
MAX_SEQ_LEN = 256 # default: 2048
# Single 5-minute experiment to verify setup
uv run train.py > run.log 2>&1
# Extract key metrics
grep "^val_bpb:\|^peak_vram_mb:" run.log
预期输出:
val_bpb: 0.9979
peak_vram_mb: 38420
program.md 是人类编写的研究章程,智能体在每个循环迭代开始时都会读取。请将其编写为精确的 Markdown 指令:
# Research Program
## Goal
Minimize val_bpb on the FineWeb-Edu validation set within the 300-second budget.
## Current Baseline
val_bpb: 0.9979 (depth-12 GPT, Muon + AdamW optimizer)
## Directions to Explore
1. Attention variants: MLA, GQA, sliding window, local-global hybrid
2. Layer types: MoE FFN layers, SwiGLU activations
3. Optimizer tuning: Muon momentum, AdamW β values, learning rate schedule
4. Architectural depth/width tradeoffs within VRAM budget
## Constraints
- Must complete within 300 seconds
- Peak VRAM must stay under 39GB
- No new packages (use only what is in pyproject.toml)
- Do not modify prepare.py or constants.py
## Notes from Previous Runs
- Depth-12 improvements transfer to depth-24 (scale-invariant gains)
- RoPE positional encoding outperformed learned embeddings (+0.008 val_bpb)
有效的 program.md 原则:
val_bpb 作为参考点将你的 AI 智能体(Claude Code、Codex 等)指向仓库,并将 program.md 作为其研究上下文。智能体将:
program.md + 当前的 train.pytrain.py + 提交uv run train.py(300 秒)val_bpb;通过 git 保留或回滚results.tsv使用 Claude Code (OMC):
# From inside autoresearch/
# Give Claude the context: "Run the autoresearch loop following program.md"
直接使用 Claude Code CLI:
claude "Follow program.md. Run autonomous research loop on train.py.
Execute: uv run train.py, extract val_bpb, keep improvements, revert failures.
Log everything to results.tsv. Do not stop until I say so."
# Live monitoring during a run
watch -n 30 "tail -20 results.tsv"
# Count kept vs. discarded
awk -F'\t' '{print $4}' results.tsv | sort | uniq -c
# Find the best experiment
sort -t$'\t' -k2 -n results.tsv | head -5
# Check current best val_bpb
git log --oneline -5
commit val_bpb memory_gb status description
a3f2c91 0.9697 37.2 keep SwiGLU activation + depth-12
b8e1d04 0.9821 38.1 discard MoE 4-expert: marginal gain
c1a5f30 crash — crash OOM: sequence length 4096
| 状态 | 含义 |
|---|---|
keep | val_bpb 改进;提交保留在分支上 |
discard | 无改进;应用了 git reset HEAD~1 |
crash | OOM、语法错误或超时;总是被回滚 |
Session summary: 126 experiments, 18 improvements
Best val_bpb: 0.9697 (started: 0.9979)
Top improvements:
- SwiGLU activation: -0.012 val_bpb
- GQA with 4 KV heads: -0.009 val_bpb
- Muon momentum 0.92→0.95: -0.006 val_bpb
# In prepare.py — edit before uv run prepare.py
MAX_SEQ_LEN = 256 # was 2048
EVAL_TOKENS = 2_097_152 # was 20_971_520 (scale down proportionally)
# Find all attention-related experiments
grep -i "attention\|GQA\|MLA\|MHA" results.tsv
# List only improvements sorted by gain
awk -F'\t' '$4=="keep"' results.tsv | sort -t$'\t' -k2 -n
从 autoresearch 仓库目录内运行:
| 脚本 | 用途 | 用法 |
|---|---|---|
setup.sh | 一次性环境设置 | bash scripts/setup.sh [--seq-len 512] |
run-experiment.sh | 单次 5 分钟实验 + 指标提取 | bash scripts/run-experiment.sh |
run-loop.sh | 自主循环:运行 → 保留/回滚 → 重复 | bash scripts/run-loop.sh [--max 20] |
show-results.sh | 人类可读的 results.tsv 报告 | bash scripts/show-results.sh [--top 10] |
check-hardware.sh | GPU/CUDA/uv 可用性检查(JSON 输出) | bash scripts/check-hardware.sh |
# Typical overnight session
bash scripts/check-hardware.sh
bash scripts/setup.sh --seq-len 512 # adjust for your VRAM
# Edit program.md with your research directives
bash scripts/run-loop.sh --max 100 --desc "session-1"
bash scripts/show-results.sh --kept-only
references/ 目录中的详细文档:
| 文件 | 内容 |
|---|---|
references/architecture.md | 系统设计、不可变性契约、git 棘轮机制、关键设计决策 |
references/program-md-guide.md | 如何编写有效的 program.md 指令;完整模板 + 原则 |
references/hardware-config.md | 按 GPU 划分的 VRAM 设置、内存优化技术、故障排除 |
uv run train.py 以确认设置正常工作prepare.py 中的 MAX_SEQ_LEN 一致 —— 中途更改会使 val_bpb 比较无效prepare.py 或 constants.py —— 评估框架必须保持固定,结果才有意义program.md 更新 —— 将你的研究指令与 results.tsv 一起进行版本控制,以确保可复现性program.md 中为你的 GPU 余量添加 peak_vram_mb 约束pip install;它只能使用 pyproject.toml 中的内容| 硬件 | 状态 | 备注 |
|---|---|---|
| H100 80GB | 推荐 | 默认配置,完整的 MAX_SEQ_LEN=2048 |
| A100 40GB | 支持 | 如果需要,降低 MAX_SEQ_LEN |
| RTX 4090 24GB | 社区 | 将 MAX_SEQ_LEN 减少到 512 |
| GTX 1660 Ti 6GB | 社区分支 | MAX_SEQ_LEN=256,减少 EVAL_TOKENS |
| Apple Silicon (M系列) | MLX 移植 | 社区分支;不同的优化器 API |
| Windows RTX | 社区 | 推荐使用 WSL2 + CUDA |
| 指标 | 方向 | 描述 |
|---|---|---|
val_bpb | 越低越好 | 验证集每字节比特数;与词汇表大小无关 |
peak_vram_mb | 越低余量越大 | 训练运行期间的峰值 GPU 内存 |
| 实验次数/小时 | 越高搜索越快 | 在 TIME_BUDGET=300 时约 12 次 |
每周安装量
525
仓库
GitHub 星标数
88
首次出现
2026年3月11日
安全审计
安装于
gemini-cli476
codex469
opencode457
github-copilot454
cursor454
kimi-cli453
"The researcher's job shifts from writing Python to writing Markdown." — Andrej Karpathy
Autoresearch is an autonomous ML experimentation framework. An AI agent iteratively modifies train.py, runs fixed 5-minute GPU experiments, evaluates with a single metric (val_bpb), and commits only improvements via git ratcheting. The result: wake up to 100+ experiments logged and a monotonically better model.
program.md research directives for the agentresults.tsv to understand what the agent foundHuman authors program.md
│
▼
Agent reads program.md + train.py
│
▼
Agent modifies train.py → git commit
│
▼
uv run train.py (exactly 300 seconds)
│
▼
Extract val_bpb + peak_vram_mb
│
┌────┴────┐
improved? no improvement
│ │
keep commit git reset HEAD~1
│ │
└──────┬───────┘
│
log to results.tsv
│
▼
repeat ∞
| File | Agent access | Purpose |
|---|---|---|
train.py | Read + Write | Model, optimizer, training loop (~630 lines) |
program.md | Read-only | Human research directives |
prepare.py | Read-only | Data pipeline + evaluate_bpb() harness |
constants.py | Read-only | TIME_BUDGET=300, , |
# Install uv (fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone the repository
git clone https://github.com/karpathy/autoresearch
cd autoresearch
# Install locked dependencies
uv sync
# Downloads FineWeb-Edu parquet shards, trains BPE tokenizer
# Last shard is reserved for validation — never seen during training
uv run prepare.py
For constrained hardware, edit prepare.py before running:
# Lower MAX_SEQ_LEN for GPUs with limited VRAM
MAX_SEQ_LEN = 256 # default: 2048
# Single 5-minute experiment to verify setup
uv run train.py > run.log 2>&1
# Extract key metrics
grep "^val_bpb:\|^peak_vram_mb:" run.log
Expected output:
val_bpb: 0.9979
peak_vram_mb: 38420
program.md is the human-written research charter the agent reads at the start of every loop iteration. Write it as precise Markdown instructions:
# Research Program
## Goal
Minimize val_bpb on the FineWeb-Edu validation set within the 300-second budget.
## Current Baseline
val_bpb: 0.9979 (depth-12 GPT, Muon + AdamW optimizer)
## Directions to Explore
1. Attention variants: MLA, GQA, sliding window, local-global hybrid
2. Layer types: MoE FFN layers, SwiGLU activations
3. Optimizer tuning: Muon momentum, AdamW β values, learning rate schedule
4. Architectural depth/width tradeoffs within VRAM budget
## Constraints
- Must complete within 300 seconds
- Peak VRAM must stay under 39GB
- No new packages (use only what is in pyproject.toml)
- Do not modify prepare.py or constants.py
## Notes from Previous Runs
- Depth-12 improvements transfer to depth-24 (scale-invariant gains)
- RoPE positional encoding outperformed learned embeddings (+0.008 val_bpb)
Effective program.md principles:
val_bpb as a reference pointPoint your AI agent (Claude Code, Codex, etc.) at the repository with program.md as its research context. The agent will:
program.md + current train.pytrain.py + commituv run train.py (300 seconds)val_bpb; keep or revert via gitresults.tsvWith Claude Code (OMC):
# From inside autoresearch/
# Give Claude the context: "Run the autoresearch loop following program.md"
With Claude Code CLI directly:
claude "Follow program.md. Run autonomous research loop on train.py.
Execute: uv run train.py, extract val_bpb, keep improvements, revert failures.
Log everything to results.tsv. Do not stop until I say so."
# Live monitoring during a run
watch -n 30 "tail -20 results.tsv"
# Count kept vs. discarded
awk -F'\t' '{print $4}' results.tsv | sort | uniq -c
# Find the best experiment
sort -t$'\t' -k2 -n results.tsv | head -5
# Check current best val_bpb
git log --oneline -5
commit val_bpb memory_gb status description
a3f2c91 0.9697 37.2 keep SwiGLU activation + depth-12
b8e1d04 0.9821 38.1 discard MoE 4-expert: marginal gain
c1a5f30 crash — crash OOM: sequence length 4096
| Status | Meaning |
|---|---|
keep | val_bpb improved; commit retained on branch |
discard | No improvement; git reset HEAD~1 applied |
crash | OOM, syntax error, or timeout; always reverted |
Session summary: 126 experiments, 18 improvements
Best val_bpb: 0.9697 (started: 0.9979)
Top improvements:
- SwiGLU activation: -0.012 val_bpb
- GQA with 4 KV heads: -0.009 val_bpb
- Muon momentum 0.92→0.95: -0.006 val_bpb
# In prepare.py — edit before uv run prepare.py
MAX_SEQ_LEN = 256 # was 2048
EVAL_TOKENS = 2_097_152 # was 20_971_520 (scale down proportionally)
# Find all attention-related experiments
grep -i "attention\|GQA\|MLA\|MHA" results.tsv
# List only improvements sorted by gain
awk -F'\t' '$4=="keep"' results.tsv | sort -t$'\t' -k2 -n
Run from inside the autoresearch repository directory:
| Script | Purpose | Usage |
|---|---|---|
setup.sh | One-time environment setup | bash scripts/setup.sh [--seq-len 512] |
run-experiment.sh | Single 5-min experiment + metric extraction | bash scripts/run-experiment.sh |
run-loop.sh | Autonomous loop: run → keep/revert → repeat | bash scripts/run-loop.sh [--max 20] |
# Typical overnight session
bash scripts/check-hardware.sh
bash scripts/setup.sh --seq-len 512 # adjust for your VRAM
# Edit program.md with your research directives
bash scripts/run-loop.sh --max 100 --desc "session-1"
bash scripts/show-results.sh --kept-only
Detailed documentation in references/:
| File | Contents |
|---|---|
references/architecture.md | System design, immutability contract, git ratcheting, key design decisions |
references/program-md-guide.md | How to write effective program.md directives; full template + principles |
references/hardware-config.md | VRAM settings by GPU, memory optimization techniques, troubleshooting |
uv run train.py manually before launching the loop to confirm the setup worksMAX_SEQ_LEN in prepare.py consistent — changing it mid-run invalidates val_bpb comparisonsprepare.py or constants.py — the evaluation harness must stay fixed for results to be meaningfulprogram.md updates — version-control your research directives alongside results.tsv for reproducibility| Hardware | Status | Notes |
|---|---|---|
| H100 80GB | Recommended | Default config, full MAX_SEQ_LEN=2048 |
| A100 40GB | Supported | Lower MAX_SEQ_LEN if needed |
| RTX 4090 24GB | Community | Reduce MAX_SEQ_LEN to 512 |
| GTX 1660 Ti 6GB | Community fork | MAX_SEQ_LEN=256, reduced EVAL_TOKENS |
| Apple Silicon (M-series) | MLX port | Community fork; different optimizer API |
| Windows RTX | Community | WSL2 + CUDA recommended |
| Metric | Direction | Description |
|---|---|---|
val_bpb | Lower = better | Validation bits-per-byte; vocabulary-size-independent |
peak_vram_mb | Lower = more headroom | Peak GPU memory during the training run |
| Experiments/hour | Higher = faster search | ~12 at TIME_BUDGET=300 |
Weekly Installs
525
Repository
GitHub Stars
88
First Seen
Mar 11, 2026
Security Audits
Gen Agent Trust HubFailSocketPassSnykFail
Installed on
gemini-cli476
codex469
opencode457
github-copilot454
cursor454
kimi-cli453
React 组合模式指南:Vercel 组件架构最佳实践,提升代码可维护性
103,800 周安装
云设计模式大全:构建可靠、高性能云应用的架构指南与最佳实践
501 周安装
Google Apps Script 自动化脚本教程 - 免费实现 Google Sheets 与 Workspace 自动化
502 周安装
Sensei:GitHub Copilot for Azure技能合规性自动化改进工具
502 周安装
Electron 跨平台桌面应用开发教程 - 从入门到精通
1,100 周安装
Monorepo 包链接指南:pnpm/npm/yarn/bun 工作区依赖管理详解
502 周安装
Flutter无障碍访问与自适应设计指南:实现WCAG标准与响应式布局
974 周安装
MAX_SEQ_LENEVAL_TOKENSpyproject.toml | Read-only | Locked dependencies (no new packages) |
results.tsv | Append | All experiments: kept and discarded |
show-results.sh| Human-readable results.tsv report |
bash scripts/show-results.sh [--top 10] |
check-hardware.sh | GPU/CUDA/uv availability check (JSON output) | bash scripts/check-hardware.sh |
peak_vram_mb constraints in program.md for your GPU's headroompip install; it can only use what is in pyproject.toml