Codex Autoresearch Loop：AI驱动的代码自主优化与验证循环技能

codex-autoresearch-loop by aradotso/trending-skills

385 周安装量

22 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/aradotso/trending-skills --skill codex-autoresearch-loop

AI/机器学习自动化代码质量

🇨🇳中文介绍

Codex Autoresearch

Skill 由 ara.so 提供 — Daily 2026 Skills 合集。

Codex Autoresearch 是一个 Codex 技能，它能在你的代码库上运行一个自主的修改→验证→保留/回滚循环。你用一句话描述一个可衡量的目标；Codex 确认计划，然后无人值守地迭代 — 每一次改进都提交到 git，每一次失败都自动回滚 — 直到被中断或达到上限。灵感来源于 Karpathy 的 autoresearch 概念，将其从机器学习训练推广到任何软件指标。

安装

选项 A — 手动复制到你的项目：

git clone https://github.com/leo-lilinxiao/codex-autoresearch.git
cp -r codex-autoresearch your-project/.agents/skills/codex-autoresearch

选项 B — Codex 技能安装器：

$skill-installer install https://github.com/leo-lilinxiao/codex-autoresearch

该技能位于你项目内的 .agents/skills/codex-autoresearch/ 目录。首次使用前无需配置文件。

如何激活

在你的项目目录中打开 Codex，并在你的目标前加上 $codex-autoresearch：

$codex-autoresearch
我想在我的 TypeScript 代码中消除所有 `any` 类型

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

循环（内部原理）

PHASE 0: 探测环境（CPU/GPU/RAM/工具链），检查会话恢复
PHASE 1: 从之前的运行中读取上下文 + 经验文件（如果有的话）

LOOP （无限循环或 N 次）：
  1. 查看当前状态、git 历史、结果日志、经验
  2. 选择一个假设（应用视角，根据环境过滤）
     -- 如果并行模式激活，则选择 N 个假设
  3. 进行一次原子更改
  4. git commit（在验证之前）
  5. 运行验证命令 → 目标指标是否改善？
     运行防护命令 → 是否有其他东西被破坏？
  6. 改善 → 保留（提取经验）
     恶化 → 批准的回滚策略（git revert）
     崩溃 → 修复或跳过
  7. 将结果记录到结果日志
  8. 健康检查（磁盘、git、验证健康状态）
  9. 如果连续丢弃 3+ 次 → REFINE；5+ 次 → PIVOT；2 次 PIVOT → 网络搜索
 10. 重复。永不停止。永不询问。

除非你在确认过程中指定 Iterations: N，否则循环将无限运行。

两个命令有不同的用途：

门	目的	失败意味着
验证	目标指标是否改善？	更改被丢弃、回滚
防护	是否有其他东西被破坏？	更改被重做（最多尝试 2 次），然后回滚

防护文件永远不会被循环修改。

用于 Python 覆盖率运行的验证 + 防护对示例：

Verify: pytest --cov=src --cov-report=term 2>&1 | grep TOTAL | awk '{print $NF}'
Guard:  python -m mypy src --ignore-missing-imports

用于 TypeScript 类型清理的示例：

Verify: grep -r "any" src --include="*.ts" | wc -l
Guard:  npx tsc --noEmit

Codex 会自动将你的句子映射到七种模式之一 — 你永远不需要显式选择模式。

`loop` — 朝着可衡量的目标迭代（默认）

$codex-autoresearch
将 src/ 中的测试覆盖率提高到至少 80%



$codex-autoresearch
减少打包体积 — 目前是 2.3 MB，将其降到 1 MB 以下

`plan` — 将模糊的目标转化为经过验证的循环配置

$codex-autoresearch
我想让我们的 API 更快，但不知道从哪里开始

Codex 将采访你（p95 延迟 vs 吞吐量？哪个端点？）并生成一个可立即运行的循环配置。

`fix` — 修复错误直到数量为零

$codex-autoresearch
pytest 失败了，重构后有 12 个测试损坏 — 修复它们

`debug` — 基于证据的根因查找

$codex-autoresearch
我们的 API 在负载下随机返回 503，不知道为什么

每次迭代测试一个可证伪的假设。Codex 呈现证据，而不是猜测。

`security` — 只读的 STRIDE + OWASP 审计

$codex-autoresearch
这段代码安全吗？

`ship` — 就绪性验证和发布门控

$codex-autoresearch
发布它

`exec` — 单次执行，无循环

$codex-autoresearch
运行基准测试套件并总结结果

内联配置（可选）

你可以在确认步骤中内联覆盖默认值 — 无需编辑文件：

短语	效果
`Iterations: 20`	将循环限制为 20 次迭代
`Parallel: 3`	每轮并行测试 3 个假设
`Guard: npm test`	覆盖推断的防护命令
`Verify: <command>`	覆盖推断的验证命令
`Scope: src/api/`	将更改限制在子目录

确认过程中的示例：

你:   Go. Iterations: 30, Guard: npm test, Scope: src/api/

在每次迭代结束时，Codex 会将结构化的经验写入 .agents/skills/codex-autoresearch/lessons.md：

Iteration 7 — KEPT
Hypothesis: 将 src/utils/mapper.ts 中的显式 `any` 替换为推断的泛型
Change: 在 mapKeys() 中添加了 <T extends Record<string, unknown>>
Result: any 计数 31 → 29
Lesson: 对工具函数使用泛型约束可以消除下游的 `any` 集群。

在会话恢复时，Codex 会首先读取此文件。每次新的运行都受益于之前的运行。

要恢复中断的运行：

$codex-autoresearch
Resume

Codex 会重新读取经验文件，检查 git 状态，重新建立基线，然后继续。

在确认过程中或任何时候请求并行模式：

你:   Go, parallel 4

Codex 并行运行四个假设，保留最佳结果，丢弃其余部分。当假设空间很大时很有用。

如果循环停滞，会自动进行升级：

连续丢弃次数	行动
3	REFINE — 缩小假设范围，尝试更小的原子更改
5	PIVOT — 完全改变策略
2 次 PIVOT	网络搜索 — Codex 获取外部参考资料以摆脱困境

在升级过程中永远不会征求你的许可。循环继续。

示例 1 — TypeScript `any` 消除（Python 验证脚本）

如果你想要自定义验证脚本而不是单行命令：

# scripts/count_any.py
import subprocess, sys

result = subprocess.run(
    ["grep", "-r", "--include=*.ts", r"\bany\b", "src/"],
    capture_output=True, text=True
)
count = len(result.stdout.strip().splitlines())
print(count)
sys.exit(0)  # always exit 0; the number is what matters

在确认时告诉 Codex：

Verify: python scripts/count_any.py
Guard:  npx tsc --noEmit

示例 2 — pytest 覆盖率循环（Python）

# scripts/coverage_pct.py
import subprocess, re, sys

out = subprocess.check_output(
    ["pytest", "--cov=src", "--cov-report=term", "-q"],
    stderr=subprocess.STDOUT, text=True
)
match = re.search(r"TOTAL\s+\d+\s+\d+\s+(\d+)%", out)
if match:
    print(int(match.group(1)))
    sys.exit(0)
print(0)
sys.exit(0)



$codex-autoresearch
提高测试覆盖率 — 目标 85%

Verify: python scripts/coverage_pct.py
Guard:  python -m mypy src
Direction: higher
Target: 85
Iterations: 50

示例 3 — 打包体积循环（Node.js 项目）

# scripts/bundle_size.sh
#!/usr/bin/env bash
npm run build --silent 2>/dev/null
du -k dist/bundle.js | awk '{print $1}'



$codex-autoresearch
减少我们的 JS 打包体积，目前约 2300 KB，目标低于 900 KB

Verify: bash scripts/bundle_size.sh
Guard:  npm test
Direction: lower
Target: 900

示例 4 — lint 警告计数（任何语言）

# scripts/lint_count.sh
#!/usr/bin/env bash
npx eslint src/ --format json 2>/dev/null \
  | python3 -c "import sys,json; d=json.load(sys.stdin); print(sum(len(f['messages']) for f in d))"



$codex-autoresearch
将我们的 ESLint 警告计数降为零

Verify: bash scripts/lint_count.sh
Direction: lower
Target: 0

对于通宵或长时间运行，请确保 Codex CLI 的批准设置不会中断 git commit 或 git revert 命令。最简单的选择是在一次性或沙盒化的仓库克隆中运行：

git clone . /tmp/autoresearch-sandbox
cd /tmp/autoresearch-sandbox
# 在此处启动具有完全权限的 Codex

结果会累积在 git 历史中。完成后，将成功的提交拉取回你的主仓库：

# 在你的主仓库中
git fetch /tmp/autoresearch-sandbox main
git cherry-pick <winning-commit-sha>

文件	内容
`.agents/skills/codex-autoresearch/lessons.md`	每次迭代的结构化经验
`.agents/skills/codex-autoresearch/results.log`	完整的每次迭代日志（指标值、保留/回滚、耗时）
`.agents/skills/codex-autoresearch/session.json`	当前会话状态，用于恢复

这些文件在 Codex 会话之间持久存在。删除它们以重新开始。

循环回滚每一次更改：

验证命令可能返回非数值。手动测试：bash -c "<your verify command>" 应该打印一个单独的数字。
指标方向可能错误。在设置时确认 Direction: lower 或 Direction: higher。

防护命令在不相关的文件上触发：

缩小范围：Scope: src/specific-module/
或者在确认时明确告诉 Codex：Do not touch tests/。

会话恢复选择了错误的基线：

删除 session.json 以强制重新建立基线：rm .agents/skills/codex-autoresearch/session.json

并行模式产生合并冲突：

Codex 通过转向协议在内部处理此问题，但如果卡住，请减少并行度：Parallel: 2

Codex 在循环中途提问：

这意味着防护崩溃产生了不明确的输出。通过指定 Guard: <command> || true（如果防护失败应该是非致命的）来预先防止，或者给 Codex 更完整的沙盒权限，使其可以自由运行 git 命令。

循环达到 PIVOT 但没有进展：

在确认时提供一个初始假设：Hint: try tree-shaking unused imports first
或者先运行 plan 模式以生成更丰富的假设列表，然后再切换到 loop。

# 启动一个循环
$codex-autoresearch
<你的目标，一句话>

# 恢复中断的运行
$codex-autoresearch
Resume

# 有限次数的运行
$codex-autoresearch
<目标> — Iterations: 25

# 并行假设
$codex-autoresearch
<目标> — Parallel: 4

# 强制使用某种模式
$codex-autoresearch fix
pytest 有 8 个失败，修复它们

# 只读审计
$codex-autoresearch security
审计 src/api/ 是否存在注入漏洞

🇺🇸English

Codex Autoresearch

Skill by ara.so — Daily 2026 Skills collection.

Codex Autoresearch is a Codex skill that runs an autonomous modify→verify→keep/revert loop on your codebase. You describe a measurable goal in one sentence; Codex confirms the plan, then iterates unattended — every improvement stacks in git, every failure reverts automatically — until interrupted or a cap is reached. Inspired by Karpathy's autoresearch concept, generalized beyond ML training to any software metric.

Installation

Option A — manual copy into your project:

git clone https://github.com/leo-lilinxiao/codex-autoresearch.git
cp -r codex-autoresearch your-project/.agents/skills/codex-autoresearch

Option B — Codex skill installer:

$skill-installer install https://github.com/leo-lilinxiao/codex-autoresearch

The skill lives at .agents/skills/codex-autoresearch/ inside your project. No config file is required before first use.

How to Activate

Open Codex in your project directory and prefix your goal with $codex-autoresearch:

$codex-autoresearch
I want to get rid of all `any` types in my TypeScript code

Codex will:

Scan the repo and infer scope, metric, verify command, and guard command.
Present a confirmation summary — reply go (or correct anything).
Run the loop unattended until you interrupt it or the goal is met.

You never write config. Codex infers everything.

Confirmation Flow

Before the loop starts Codex always shows what it found and asks you to confirm. Example exchange:

Codex: I found 47 `any` occurrences across src/**/*.ts.

       Confirmed:
       - Target: eliminate `any` types in src/**/*.ts
       - Metric: `any` count (current: 47), direction: lower
       - Verify: grep + tsc --noEmit as guard

       Need to confirm:
       - Run until all gone, or cap at N iterations?

       Reply "go" to start, or tell me what to change.

You:   Go, run overnight.

Codex: Starting — baseline: 47. Iterating until interrupted.

Up to five confirmation rounds are possible. After that, Codex proceeds.

The Loop (internals)

PHASE 0: Probe environment (CPU/GPU/RAM/toolchains), check for session resume
PHASE 1: Read context + lessons file from prior run (if any)

LOOP (forever or N times):
  1. Review current state, git history, results log, lessons
  2. Pick ONE hypothesis (apply perspectives, filter by environment)
     -- or N hypotheses if parallel mode is active
  3. Make ONE atomic change
  4. git commit (before verification)
  5. Run verify command  →  did the target metric improve?
     Run guard command   →  did anything else break?
  6. Improved → keep (extract lesson)
     Worse    → approved rollback strategy (git revert)
     Crashed  → fix or skip
  7. Log the result to results log
  8. Health check (disk, git, verify health)
  9. If 3+ discards → REFINE; 5+ → PIVOT; 2 PIVOTs → web search
 10. Repeat. Never stop. Never ask.

The loop runs unbounded unless you say Iterations: N during confirmation.

Dual-Gate Verification

Two commands serve distinct purposes:

Gate	Purpose	Fails means
Verify	Did the target metric improve?	Change discarded, reverted
Guard	Did anything else break?	Change reworked (up to 2 attempts), then reverted

Guard files are never modified by the loop.

Example verify + guard pair for a Python coverage run:

Verify: pytest --cov=src --cov-report=term 2>&1 | grep TOTAL | awk '{print $NF}'
Guard:  python -m mypy src --ignore-missing-imports

Example for TypeScript type cleanup:

Verify: grep -r "any" src --include="*.ts" | wc -l
Guard:  npx tsc --noEmit

Modes

Codex maps your sentence to one of seven modes automatically — you never pick a mode explicitly.

`loop` — iterate toward a measurable target (default)

$codex-autoresearch
Improve test coverage in src/ to at least 80%



$codex-autoresearch
Reduce bundle size — it's currently 2.3 MB, get it under 1 MB

`plan` — turn a vague goal into a validated loop config

$codex-autoresearch
I want to make our API faster but I don't know where to start

Codex will interview you (p95 latency vs throughput? which endpoint?) and produce a ready-to-run loop config.

`fix` — repair errors until count reaches zero

$codex-autoresearch
pytest is failing, 12 tests broken after the refactor — fix them all

`debug` — evidence-driven root-cause hunting

$codex-autoresearch
Our API returns 503 randomly under load, no idea why

Each iteration tests one falsifiable hypothesis. Codex presents evidence, not guesses.

`security` — read-only STRIDE + OWASP audit

$codex-autoresearch
Is this code secure?

`ship` — readiness verification and release gating

$codex-autoresearch
Ship it

`exec` — one-shot execution with no loop

$codex-autoresearch
Run the benchmark suite and summarize results

Inline Configuration (optional)

You can override defaults inline during the confirmation step — no file edits needed:

Phrase	Effect
`Iterations: 20`	Cap the loop at 20 iterations
`Parallel: 3`	Test 3 hypotheses concurrently per round
`Guard: npm test`	Override the inferred guard command
`Verify: <command>`	Override the inferred verify command
`Scope: src/api/`	Restrict changes to a subdirectory

Example during confirmation:

You:   Go. Iterations: 30, Guard: npm test, Scope: src/api/

Cross-Run Learning

At the end of each iteration Codex writes a structured lesson to .agents/skills/codex-autoresearch/lessons.md:

Iteration 7 — KEPT
Hypothesis: replace explicit `any` with inferred generic in src/utils/mapper.ts
Change: added <T extends Record<string, unknown>> to mapKeys()
Result: any count 31 → 29
Lesson: Generic constraints on utility functions eliminate clusters of `any` downstream.

On session resume Codex reads this file first. Each new run benefits from prior runs.

To resume an interrupted run:

$codex-autoresearch
Resume

Codex re-reads the lessons file, checks git state, re-establishes the baseline, and continues.

Parallel Experiments

Request parallel mode during confirmation or at any time:

You:   Go, parallel 4

Codex runs four hypotheses concurrently, keeps the best result, discards the rest. Useful when hypothesis space is large.

Pivot Protocol

If the loop stalls, escalation happens automatically:

Consecutive discards	Action
3	REFINE — narrow hypothesis, try smaller atomic changes
5	PIVOT — change strategy entirely
2 PIVOTs	Web search — Codex fetches external references to unstick itself

You are never asked for permission during escalation. The loop continues.

Real Code Examples

Example 1 — TypeScript `any` elimination (Python verify script)

If you want a custom verify script instead of a one-liner:

# scripts/count_any.py
import subprocess, sys

result = subprocess.run(
    ["grep", "-r", "--include=*.ts", r"\bany\b", "src/"],
    capture_output=True, text=True
)
count = len(result.stdout.strip().splitlines())
print(count)
sys.exit(0)  # always exit 0; the number is what matters

Tell Codex during confirmation:

Verify: python scripts/count_any.py
Guard:  npx tsc --noEmit

Example 2 — pytest coverage loop (Python)

# scripts/coverage_pct.py
import subprocess, re, sys

out = subprocess.check_output(
    ["pytest", "--cov=src", "--cov-report=term", "-q"],
    stderr=subprocess.STDOUT, text=True
)
match = re.search(r"TOTAL\s+\d+\s+\d+\s+(\d+)%", out)
if match:
    print(int(match.group(1)))
    sys.exit(0)
print(0)
sys.exit(0)



$codex-autoresearch
Improve test coverage — target 85%

Verify: python scripts/coverage_pct.py
Guard:  python -m mypy src
Direction: higher
Target: 85
Iterations: 50

Example 3 — bundle size loop (Node.js project)

# scripts/bundle_size.sh
#!/usr/bin/env bash
npm run build --silent 2>/dev/null
du -k dist/bundle.js | awk '{print $1}'



$codex-autoresearch
Reduce our JS bundle size, currently ~2300 KB, target under 900 KB

Verify: bash scripts/bundle_size.sh
Guard:  npm test
Direction: lower
Target: 900

Example 4 — lint warning count (any language)

# scripts/lint_count.sh
#!/usr/bin/env bash
npx eslint src/ --format json 2>/dev/null \
  | python3 -c "import sys,json; d=json.load(sys.stdin); print(sum(len(f['messages']) for f in d))"



$codex-autoresearch
Get our ESLint warning count to zero

Verify: bash scripts/lint_count.sh
Direction: lower
Target: 0

Unattended Runs

For overnight or long runs, ensure Codex CLI approval settings do not interrupt git commit or git revert commands. The simplest option is to run in a disposable or sandboxed repo clone:

git clone . /tmp/autoresearch-sandbox
cd /tmp/autoresearch-sandbox
# launch Codex here with full permissions

Results accumulate in git history. Pull the winning commits back to your main repo when done:

# in your main repo
git fetch /tmp/autoresearch-sandbox main
git cherry-pick <winning-commit-sha>

Session Artifacts

File	Contents
`.agents/skills/codex-autoresearch/lessons.md`	Structured lessons from every iteration
`.agents/skills/codex-autoresearch/results.log`	Full per-iteration log (metric value, kept/reverted, elapsed)
`.agents/skills/codex-autoresearch/session.json`	Current session state for resume

These files persist across Codex sessions. Delete them to start fresh.

Troubleshooting

Loop reverts every change:

Verify command may be returning a non-numeric value. Test it manually: bash -c "<your verify command>" should print a single number.
Metric direction may be wrong. Confirm Direction: lower or Direction: higher during setup.

Guard fires on unrelated files:

Narrow scope: Scope: src/specific-module/
Or tell Codex explicitly: Do not touch tests/ during confirmation.

Session resume picks up wrong baseline:

Delete session.json to force a fresh baseline: rm .agents/skills/codex-autoresearch/session.json

Parallel mode produces merge conflicts:

Codex handles this internally via the pivot protocol, but if it gets stuck, reduce parallelism: Parallel: 2

Codex asks questions mid-loop:

This means a guard crash produced ambiguous output. Pre-empt it by specifying Guard: <command> || true if guard failures should be non-fatal, or by giving Codex fuller sandbox permissions so it can run git commands freely.

Loop hits PIVOT but makes no progress:

Supply a seed hypothesis during confirmation: Hint: try tree-shaking unused imports first
Or run plan mode first to produce a richer hypothesis list before switching to loop.

Quick Reference

# Start a loop
$codex-autoresearch
<your goal in one sentence>

# Resume interrupted run
$codex-autoresearch
Resume

# Bounded run
$codex-autoresearch
<goal> — Iterations: 25

# Parallel hypotheses
$codex-autoresearch
<goal> — Parallel: 4

# Force a mode
$codex-autoresearch fix
pytest has 8 failures, repair them

# Read-only audit
$codex-autoresearch security
Audit src/api/ for injection vulnerabilities

Weekly Installs

113

Repository

aradotso/trending-skills

GitHub Stars

First Seen

3 days ago

Security Audits

Gen Agent Trust HubFail SocketWarn SnykFail

Installed on

github-copilot113

codex113

warp113

kimi-cli113

gemini-cli113

amp113

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

66,200 周安装

Codex Autoresearch Loop：AI驱动的代码自主优化与验证循环技能

🇨🇳中文介绍

Codex Autoresearch

安装

如何激活

相关 Skills

确认流程

循环（内部原理）

双门验证

模式

loop — 朝着可衡量的目标迭代（默认）

plan — 将模糊的目标转化为经过验证的循环配置

fix — 修复错误直到数量为零

debug — 基于证据的根因查找

security — 只读的 STRIDE + OWASP 审计

ship — 就绪性验证和发布门控

exec — 单次执行，无循环

内联配置（可选）

跨运行学习

并行实验

转向协议

真实代码示例

示例 1 — TypeScript any 消除（Python 验证脚本）

示例 2 — pytest 覆盖率循环（Python）

示例 3 — 打包体积循环（Node.js 项目）

示例 4 — lint 警告计数（任何语言）

无人值守运行

会话产物

故障排除

快速参考

🇺🇸English

Codex Autoresearch

Installation

How to Activate

Confirmation Flow

The Loop (internals)

Dual-Gate Verification

Modes

loop — iterate toward a measurable target (default)

plan — turn a vague goal into a validated loop config

fix — repair errors until count reaches zero

debug — evidence-driven root-cause hunting

security — read-only STRIDE + OWASP audit

ship — readiness verification and release gating

exec — one-shot execution with no loop

Inline Configuration (optional)

Cross-Run Learning

Parallel Experiments

Pivot Protocol

Real Code Examples

Example 1 — TypeScript any elimination (Python verify script)

Example 2 — pytest coverage loop (Python)

Example 3 — bundle size loop (Node.js project)

Example 4 — lint warning count (any language)

Unattended Runs

Session Artifacts

Troubleshooting

Quick Reference

最新 Skills

`loop` — 朝着可衡量的目标迭代（默认）

`plan` — 将模糊的目标转化为经过验证的循环配置

`fix` — 修复错误直到数量为零

`debug` — 基于证据的根因查找

`security` — 只读的 STRIDE + OWASP 审计

`ship` — 就绪性验证和发布门控

`exec` — 单次执行，无循环

示例 1 — TypeScript `any` 消除（Python 验证脚本）

`loop` — iterate toward a measurable target (default)

`plan` — turn a vague goal into a validated loop config

`fix` — repair errors until count reaches zero

`debug` — evidence-driven root-cause hunting

`security` — read-only STRIDE + OWASP audit

`ship` — readiness verification and release gating

`exec` — one-shot execution with no loop

Example 1 — TypeScript `any` elimination (Python verify script)