⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

LLM项目开发方法论：智能体辅助开发、任务识别与流水线架构指南

project-development by guanyang/antigravity-skills

65 周安装量

606 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/guanyang/antigravity-skills --skill project-development

AI/机器学习方法论项目管理

🇨🇳中文介绍

项目开发方法论

本技能涵盖了识别适合 LLM 处理的任务、设计有效的项目架构以及使用智能体辅助开发进行快速迭代的原则。无论构建批处理流水线、多智能体研究系统还是交互式智能体应用，该方法论都适用。

何时激活

在以下情况下激活此技能：

启动一个可能受益于 LLM 处理的新项目时
评估一个任务是更适合智能体还是传统代码时
为 LLM 驱动的应用程序设计架构时
规划具有结构化输出的批处理流水线时
在单智能体和多智能体方法之间做选择时
估算 LLM 密集型项目的成本和时间线时

核心概念

任务-模型匹配识别

在编写任何代码之前评估任务-模型匹配度，因为在根本上不匹配的任务上构建自动化会浪费数天的努力。使用以下两个表格来评估每个提议的任务，以决定是继续还是停止。

当任务具有以下特征时继续：

特征	理由
跨来源综合	LLM 比基于规则的替代方案能更好地结合来自多个输入的信息
基于量规的主观判断	带有标准的评分、评估和分类自然映射到语言推理
自然语言输出	当目标是人类可读的文本时，LLM 能原生地提供它
容错性	个别失败不会破坏整个系统，因此 LLM 的非确定性是可以接受的
批处理	项目之间不需要对话状态，这保持了上下文的清洁
训练数据中的领域知识	模型已具备相关上下文，减少了提示工程的开销

当任务具有以下特征时停止：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

特征	理由
精确计算	数学、计数和精确算法在语言模型中不可靠
实时性要求	LLM 的延迟太高，无法满足亚秒级响应
完美准确性要求	幻觉风险使得 100% 的准确性不可能实现
专有数据依赖	模型缺乏必要的上下文，并且无法仅从提示中获取
顺序依赖性	每个步骤严重依赖于前一步的结果，导致错误累积
确定性输出要求	相同的输入必须产生相同的输出，这是 LLM 无法保证的

文件系统作为状态机

使用文件系统来跟踪流水线状态，而不是数据库或内存结构，因为文件的存在提供了自然的幂等性和人类可读的调试信息。

data/{id}/
  raw.json         # acquire 阶段完成
  prompt.md        # prepare 阶段完成
  response.md      # process 阶段完成
  parsed.json      # parse 阶段完成

通过检查输出文件是否存在来判断某个项目是否需要处理。通过删除其输出文件和下游文件来重新运行一个阶段。通过直接读取中间文件进行调试。这种模式有效是因为每个目录都是独立的，支持简单的并行化和轻松的缓存。

结构化输出设计

为结构化、可解析的输出设计提示，因为提示设计直接决定了解析的可靠性。在每个结构化提示中包含以下元素：

章节标记：解析器可以匹配的显式标题或前缀
格式示例：准确展示输出应该是什么样子
理由说明：声明“我将以编程方式解析此内容”，以便模型优先考虑格式合规性
约束值：枚举选项、评分范围和固定格式

构建能够优雅处理 LLM 输出变化的解析器，因为 LLM 不会完美地遵循指令。使用足够灵活的正则表达式模式来处理微小的格式变化，在缺少章节时提供合理的默认值，并记录解析失败以供审查，而不是直接崩溃。

智能体辅助开发

使用支持智能体的模型通过快速迭代来加速开发：描述项目目标和约束，让智能体生成初始实现，针对特定失败进行测试和迭代，然后根据结果优化提示和架构。

采用这些实践，因为它们能保持智能体输出的专注性和高质量：

预先提供清晰、具体的要求，以减少修订周期
将大型项目分解为离散的组件，以便每个组件可以独立验证
在进入下一个组件之前测试每个组件，以便及早发现失败
保持智能体一次专注于一个任务，以防止上下文退化

成本和规模估算

在开始之前估算 LLM 处理成本，因为令牌成本在规模上会迅速累积，而预算超支的后期发现会迫使进行昂贵的返工。使用以下公式：

Total cost = (items x tokens_per_item x price_per_token) + API overhead

对于批处理，估算每个项目的输入令牌数（提示 + 上下文），估算每个项目的输出令牌数（典型响应长度），乘以项目数量，并增加 20-30% 的缓冲以应对重试和失败。

在开发过程中跟踪实际成本。如果成本显著超过估算，则通过截断减少上下文长度，对较简单的项目使用较小的模型，缓存和重用部分结果，或者添加并行处理以减少实际运行时间。

选择单智能体与多智能体架构

对于具有独立项目的批处理，默认使用单智能体流水线，因为它们更易于管理、运行成本更低且更易于调试。仅当满足以下条件之一时，才升级到多智能体架构：

需要并行探索不同的方面
任务超出了单个上下文窗口的容量
专门的子智能体在基准测试中显著提高了质量

为上下文隔离选择多智能体，而不是角色拟人化。子智能体为专注的子任务获得新的上下文窗口，这可以防止长时间运行任务中的上下文退化。

有关详细的架构指导，请参阅 multi-agent-patterns 技能。

从最小架构开始，仅当生产证据证明有必要时才增加复杂性，因为过度设计的脚手架通常会限制而不是提升模型性能。

Vercel 的 d0 智能体通过将 17 个专用工具减少到 2 个基本工具：bash 命令执行和 SQL，将成功率从 80% 提高到 100%。文件系统智能体模式使用标准的 Unix 工具（grep、cat、find、ls）而不是自定义的探索工具。

在以下情况下简化：

数据层有良好的文档记录且结构一致
模型具备足够的推理能力
专用工具是限制性的而不是赋能性的
维护脚手架的时间多于改进结果的时间

在以下情况下增加复杂性：

底层数据混乱、不一致或文档记录不佳
领域需要模型缺乏的专门知识
安全约束要求限制智能体的能力
操作确实复杂，并且受益于结构化的工作流

有关详细的工具架构指导，请参阅 tool-design 技能。

从一开始就计划进行多次架构迭代，因为生产规模的智能体系统总是需要重构。Manus 自发布以来已经重构了其智能体框架五次。Bitter Lesson 表明，为当前模型限制添加的结构会随着模型的改进而成为约束。

通过遵循以下实践来构建适应变化的能力：

保持架构简单且无倾向性，以便重构成本低廉
跨模型代际进行测试，以验证测试框架没有限制性能
设计能够从模型改进中受益的系统，而不是锁定限制

按顺序遵循此模板，因为每个步骤都在下一步投入努力之前验证假设。

任务分析
- 明确界定输入和期望的输出
- 分类：综合、生成、分类或分析
- 根据业务影响设定可接受的错误率
- 估算每次成功完成的价值以证明成本的合理性
手动验证
- 使用目标模型测试一个代表性示例
- 根据要求评估输出质量和格式
- 识别需要解析器强化或提示修订的失败模式
- 估算每个项目的令牌数以进行成本预测
架构选择
- 根据上述标准选择单流水线或多智能体
- 识别所需的工具和数据源
- 使用文件系统状态设计存储和缓存策略
- 规划处理阶段的并行化方法
成本估算
- 计算项目数 x 令牌数 x 价格，并加上 20-30% 的缓冲
- 估算每个流水线阶段的开发时间
- 识别基础设施需求（API 密钥、存储、计算资源）
- 预测生产运行的持续运营成本
开发计划
- 分阶段实施，每个阶段在继续之前进行测试
- 为每个阶段定义测试策略和预期输出
- 设定与质量指标相关的迭代里程碑
- 规划具有回滚能力的部署方法

示例 1：批处理分析流水线（Karpathy 的 HN 时间胶囊）

任务：用后见之明评分分析 10 年前的 930 个 HN 讨论。

5 阶段流水线：fetch -> prompt -> analyze -> parse -> render
文件系统状态：data/{date}/{item_id}/ 包含阶段输出文件
结构化输出：6 个部分，具有明确的格式要求
并行执行：15 个 worker 用于 LLM 调用

结果：总成本 58 美元，执行约 1 小时，输出静态 HTML。

示例 2：架构简化（Vercel d0）

任务：用于内部分析的文本转 SQL 智能体。

之前：17 个专用工具，80% 成功率，平均执行时间 274 秒。

之后：2 个工具（bash + SQL），100% 成功率，平均执行时间 77 秒。

关键洞察：语义层已经是良好的文档。Claude 只需要能够直接读取文件。

详细分析请参阅案例研究。

在构建自动化之前，通过手动原型验证任务-模型匹配度
将流水线构建为离散、幂等、可缓存的阶段
使用文件系统进行状态管理和调试
为结构化、可解析的输出设计提示，并包含明确的格式示例
从最小架构开始；仅在证明有必要时才增加复杂性
尽早估算成本并在整个开发过程中跟踪
构建能够处理 LLM 输出变化的健壮解析器
预期并规划多次架构迭代
测试脚手架是帮助还是限制了模型性能
使用智能体辅助开发进行快速迭代实现

跳过手动验证：在验证模型能够执行任务之前就构建自动化，当方法存在根本性缺陷时会浪费大量时间。始终先通过模型界面运行一个代表性示例。
单体式流水线：将所有阶段合并到一个脚本中，使得调试和迭代变得困难。用持久的中间输出分离各个阶段，以便每个阶段可以独立重新运行。
过度约束模型：添加模型本可以自行处理的护栏、预过滤和验证逻辑会降低性能。在保留之前，测试脚手架是帮助还是损害了性能。
直到生产才考虑成本：令牌成本在规模上会迅速累积。从一开始就估算和跟踪，以避免预算意外导致架构返工。
完美的解析要求：期望 LLM 完美遵循格式指令会导致脆弱的系统。构建能够处理变化并记录失败以供审查的健壮解析器。
过早优化：在基本流水线正常工作之前就添加缓存、并行化和优化，会浪费精力在可能在迭代过程中被丢弃的代码上。
模型版本锁定：构建仅适用于特定模型版本的流水线会创建脆弱的系统。跨模型代际进行测试，并抽象 LLM 调用层，以便在不重写流水线逻辑的情况下更换模型。
无评估的部署：在没有测量输出质量的情况下发布智能体流水线，意味着回归问题无法被检测到。在开发过程中定义质量指标，并在每次模型或提示更改前后运行评估检查。

此技能关联到：

context-fundamentals - 理解提示设计的上下文约束
tool-design - 在流水线内为智能体系统设计工具
multi-agent-patterns - 何时使用多智能体与单流水线
evaluation - 评估流水线输出和智能体性能
context-compression - 当流水线超出限制时管理上下文

内部参考资料：

案例研究 - 阅读时机：评估架构权衡或审查真实世界的流水线实现（Karpathy HN Capsule, Vercel d0, Manus patterns）
流水线模式 - 阅读时机：设计新的流水线阶段布局、选择缓存策略或调试阶段边界时

此集合中的相关技能：

tool-design - 工具架构和简化模式
multi-agent-patterns - 何时使用多智能体架构
evaluation - 输出评估框架

Karpathy 的 HN 时间胶囊项目：https://github.com/karpathy/hn-time-capsule
Vercel d0 架构简化：https://vercel.com/blog/we-removed-80-percent-of-our-agents-tools
Manus 上下文工程：Peak Ji 关于上下文工程经验的博客
Anthropic 多智能体研究：我们如何构建多智能体研究系统

创建日期 : 2025-12-25 最后更新 : 2026-03-17 作者 : Agent Skills for Context Engineering Contributors 版本 : 1.1.0

🇺🇸English

Project Development Methodology

This skill covers the principles for identifying tasks suited to LLM processing, designing effective project architectures, and iterating rapidly using agent-assisted development. The methodology applies whether building a batch processing pipeline, a multi-agent research system, or an interactive agent application.

When to Activate

Activate this skill when:

Starting a new project that might benefit from LLM processing
Evaluating whether a task is well-suited for agents versus traditional code
Designing the architecture for an LLM-powered application
Planning a batch processing pipeline with structured outputs
Choosing between single-agent and multi-agent approaches
Estimating costs and timelines for LLM-heavy projects

Core Concepts

Task-Model Fit Recognition

Evaluate task-model fit before writing any code, because building automation on a fundamentally mismatched task wastes days of effort. Run every proposed task through these two tables to decide proceed-or-stop.

Proceed when the task has these characteristics:

Characteristic	Rationale
Synthesis across sources	LLMs combine information from multiple inputs better than rule-based alternatives
Subjective judgment with rubrics	Grading, evaluation, and classification with criteria map naturally to language reasoning
Natural language output	When the goal is human-readable text, LLMs deliver it natively
Error tolerance	Individual failures do not break the overall system, so LLM non-determinism is acceptable
Batch processing	No conversational state required between items, which keeps context clean
Domain knowledge in training	The model already has relevant context, reducing prompt engineering overhead

Stop when the task has these characteristics:

Characteristic	Rationale
Precise computation	Math, counting, and exact algorithms are unreliable in language models
Real-time requirements	LLM latency is too high for sub-second responses
Perfect accuracy requirements	Hallucination risk makes 100% accuracy impossible
Proprietary data dependence	The model lacks necessary context and cannot acquire it from prompts alone
Sequential dependencies	Each step depends heavily on the previous result, compounding errors
Deterministic output requirements	Same input must produce identical output, which LLMs cannot guarantee

The Manual Prototype Step

Always validate task-model fit with a manual test before investing in automation. Copy one representative input into the model interface, evaluate the output quality, and use the result to answer these questions:

Does the model have the knowledge required for this task?
Can the model produce output in the format needed?
What level of quality should be expected at scale?
Are there obvious failure modes to address?

Do this because a failed manual prototype predicts a failed automated system, while a successful one provides both a quality baseline and a prompt-design template. The test takes minutes and prevents hours of wasted development.

Pipeline Architecture

Structure LLM projects as staged pipelines because separation of deterministic and non-deterministic stages enables fast iteration and cost control. Design each stage to be:

Discrete : Clear boundaries between stages so each can be debugged independently
Idempotent : Re-running produces the same result, preventing duplicate work
Cacheable : Intermediate results persist to disk, avoiding expensive re-computation
Independent : Each stage can run separately, enabling selective re-execution

Use this canonical pipeline structure:

acquire -> prepare -> process -> parse -> render

Acquire : Fetch raw data from sources (APIs, files, databases)
Prepare : Transform data into prompt format
Process : Execute LLM calls (the expensive, non-deterministic step)
Parse : Extract structured data from LLM outputs
Render : Generate final outputs (reports, files, visualizations)

Stages 1, 2, 4, and 5 are deterministic. Stage 3 is non-deterministic and expensive. Maintain this separation because it allows re-running the expensive LLM stage only when necessary, while iterating quickly on parsing and rendering.

File System as State Machine

Use the file system to track pipeline state rather than databases or in-memory structures, because file existence provides natural idempotency and human-readable debugging.

data/{id}/
  raw.json         # acquire stage complete
  prompt.md        # prepare stage complete
  response.md      # process stage complete
  parsed.json      # parse stage complete

Check if an item needs processing by checking whether the output file exists. Re-run a stage by deleting its output file and downstream files. Debug by reading the intermediate files directly. This pattern works because each directory is independent, enabling simple parallelization and trivial caching.

Structured Output Design

Design prompts for structured, parseable outputs because prompt design directly determines parsing reliability. Include these elements in every structured prompt:

Section markers : Explicit headers or prefixes that parsers can match on
Format examples : Show exactly what output should look like
Rationale disclosure : State "I will be parsing this programmatically" so the model prioritizes format compliance
Constrained values : Enumerated options, score ranges, and fixed formats

Build parsers that handle LLM output variations gracefully, because LLMs do not follow instructions perfectly. Use regex patterns flexible enough for minor formatting variations, provide sensible defaults when sections are missing, and log parsing failures for review rather than crashing.

Agent-Assisted Development

Use agent-capable models to accelerate development through rapid iteration: describe the project goal and constraints, let the agent generate initial implementation, test and iterate on specific failures, then refine prompts and architecture based on results.

Adopt these practices because they keep agent output focused and high-quality:

Provide clear, specific requirements upfront to reduce revision cycles
Break large projects into discrete components so each can be validated independently
Test each component before moving to the next to catch failures early
Keep the agent focused on one task at a time to prevent context degradation

Cost and Scale Estimation

Estimate LLM processing costs before starting, because token costs compound quickly at scale and late discovery of budget overruns forces costly rework. Use this formula:

Total cost = (items x tokens_per_item x price_per_token) + API overhead

For batch processing, estimate input tokens per item (prompt + context), estimate output tokens per item (typical response length), multiply by item count, and add 20-30% buffer for retries and failures.

Track actual costs during development. If costs exceed estimates significantly, reduce context length through truncation, use smaller models for simpler items, cache and reuse partial results, or add parallel processing to reduce wall-clock time.

Detailed Topics

Choosing Single vs Multi-Agent Architecture

Default to single-agent pipelines for batch processing with independent items, because they are simpler to manage, cheaper to run, and easier to debug. Escalate to multi-agent architectures only when one of these conditions holds:

Parallel exploration of different aspects is required
The task exceeds single context window capacity
Specialized sub-agents demonstrably improve quality on benchmarks

Choose multi-agent for context isolation, not role anthropomorphization. Sub-agents get fresh context windows for focused subtasks, which prevents context degradation on long-running tasks.

See multi-agent-patterns skill for detailed architecture guidance.

Architectural Reduction

Start with minimal architecture and add complexity only when production evidence proves it necessary, because over-engineered scaffolding often constrains rather than enables model performance.

Vercel's d0 agent achieved 100% success rate (up from 80%) by reducing from 17 specialized tools to 2 primitives: bash command execution and SQL. The file system agent pattern uses standard Unix utilities (grep, cat, find, ls) instead of custom exploration tools.

Reduce when:

The data layer is well-documented and consistently structured
The model has sufficient reasoning capability
Specialized tools are constraining rather than enabling
More time is spent maintaining scaffolding than improving outcomes

Add complexity when:

The underlying data is messy, inconsistent, or poorly documented
The domain requires specialized knowledge the model lacks
Safety constraints require limiting agent capabilities
Operations are truly complex and benefit from structured workflows

See tool-design skill for detailed tool architecture guidance.

Iteration and Refactoring

Plan for multiple architectural iterations from the start, because production agent systems at scale always require refactoring. Manus refactored their agent framework five times since launch. The Bitter Lesson suggests that structures added for current model limitations become constraints as models improve.

Build for change by following these practices:

Keep architecture simple and unopinionated so refactoring is cheap
Test across model generations to verify the harness is not limiting performance
Design systems that benefit from model improvements rather than locking in limitations

Practical Guidance

Project Planning Template

Follow this template in order, because each step validates assumptions before the next step invests effort.

Task Analysis
- Define the input and desired output explicitly
- Classify: synthesis, generation, classification, or analysis
- Set an acceptable error rate based on business impact
- Estimate the value per successful completion to justify costs
Manual Validation
- Test one representative example with the target model
- Evaluate output quality and format against requirements
- Identify failure modes that need parser hardening or prompt revision
- Estimate tokens per item for cost projection
Architecture Selection
- Choose single pipeline vs multi-agent based on the criteria above
- Identify required tools and data sources
- Design storage and caching strategy using file-system state
- Plan parallelization approach for the process stage
Cost Estimation
- Calculate items x tokens x price with a 20-30% buffer
- Estimate development time for each pipeline stage
- Identify infrastructure requirements (API keys, storage, compute)
- Project ongoing operational costs for production runs
Development Plan
- Implement stage-by-stage, testing each before proceeding
- Define a testing strategy per stage with expected outputs
- Set iteration milestones tied to quality metrics
- Plan deployment approach with rollback capability

Examples

Example 1: Batch Analysis Pipeline (Karpathy's HN Time Capsule)

Task: Analyze 930 HN discussions from 10 years ago with hindsight grading.

Architecture:

5-stage pipeline: fetch -> prompt -> analyze -> parse -> render
File system state: data/{date}/{item_id}/ with stage output files
Structured output: 6 sections with explicit format requirements
Parallel execution: 15 workers for LLM calls

Results: $58 total cost, ~1 hour execution, static HTML output.

Example 2: Architectural Reduction (Vercel d0)

Task: Text-to-SQL agent for internal analytics.

Before: 17 specialized tools, 80% success rate, 274s average execution.

After: 2 tools (bash + SQL), 100% success rate, 77s average execution.

Key insight: The semantic layer was already good documentation. Claude just needed access to read files directly.

See Case Studies for detailed analysis.

Guidelines

Validate task-model fit with manual prototyping before building automation
Structure pipelines as discrete, idempotent, cacheable stages
Use the file system for state management and debugging
Design prompts for structured, parseable outputs with explicit format examples
Start with minimal architecture; add complexity only when proven necessary
Estimate costs early and track throughout development
Build robust parsers that handle LLM output variations
Expect and plan for multiple architectural iterations
Test whether scaffolding helps or constrains model performance
Use agent-assisted development for rapid iteration on implementation

Gotchas

Skipping manual validation : Building automation before verifying the model can do the task wastes significant time when the approach is fundamentally flawed. Always run one representative example through the model interface first.
Monolithic pipelines : Combining all stages into one script makes debugging and iteration difficult. Separate stages with persistent intermediate outputs so each can be re-run independently.
Over-constraining the model : Adding guardrails, pre-filtering, and validation logic that the model could handle on its own reduces performance. Test whether scaffolding helps or hurts before keeping it.
Ignoring costs until production : Token costs compound quickly at scale. Estimate and track from the beginning to avoid budget surprises that force architectural rework.
Perfect parsing requirements : Expecting LLMs to follow format instructions perfectly leads to brittle systems. Build robust parsers that handle variations and log failures for review.
Premature optimization : Adding caching, parallelization, and optimization before the basic pipeline works correctly wastes effort on code that may be discarded during iteration.
Model version lock-in : Building pipelines that only work with one specific model version creates fragile systems. Test across model generations and abstract the LLM call layer so models can be swapped without rewriting pipeline logic.
Evaluation-less deployment : Shipping agent pipelines without measuring output quality means regressions go undetected. Define quality metrics during development and run evaluation checks before and after every model or prompt change.

Integration

This skill connects to:

context-fundamentals - Understanding context constraints for prompt design
tool-design - Designing tools for agent systems within pipelines
multi-agent-patterns - When to use multi-agent versus single pipelines
evaluation - Evaluating pipeline outputs and agent performance
context-compression - Managing context when pipelines exceed limits

References

Internal references:

Case Studies - Read when: evaluating architecture tradeoffs or reviewing real-world pipeline implementations (Karpathy HN Capsule, Vercel d0, Manus patterns)
Pipeline Patterns - Read when: designing a new pipeline stage layout, choosing caching strategies, or debugging stage boundaries

Related skills in this collection:

tool-design - Tool architecture and reduction patterns
multi-agent-patterns - When to use multi-agent architectures
evaluation - Output evaluation frameworks

External resources:

Karpathy's HN Time Capsule project: https://github.com/karpathy/hn-time-capsule
Vercel d0 architectural reduction: https://vercel.com/blog/we-removed-80-percent-of-our-agents-tools
Manus context engineering: Peak Ji's blog on context engineering lessons
Anthropic multi-agent research: How we built our multi-agent research system

Skill Metadata

Created : 2025-12-25 Last Updated : 2026-03-17 Author : Agent Skills for Context Engineering Contributors Version : 1.1.0

Weekly Installs

Repository

guanyang/antigr…y-skills

GitHub Stars

544

First Seen

Jan 26, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode53

codex51

cursor49

github-copilot49

gemini-cli48

amp47

AI界面设计评审工具 - 全面评估UI/UX设计质量、检测AI生成痕迹与优化用户体验

58,500 周安装