LLM项目开发方法论：识别任务、设计架构与智能体辅助开发指南

project-development by crinkj/common-claude-setting

1 周安装量

GitHub

安装命令

npx skills add https://github.com/crinkj/common-claude-setting --skill project-development

AI/机器学习方法论开发

🇨🇳中文介绍

项目开发方法论

本技能涵盖了识别适合 LLM 处理的任务、设计有效的项目架构以及使用智能体辅助开发进行快速迭代的原则。无论您是在构建批处理流水线、多智能体研究系统，还是交互式智能体应用程序，该方法论都适用。

何时激活

在以下情况下激活此技能：

启动一个可能受益于 LLM 处理的新项目时
评估一项任务是否更适合使用智能体而非传统代码时
为 LLM 驱动的应用程序设计架构时
规划具有结构化输出的批处理流水线时
在单智能体和多智能体方法之间进行选择时
估算 LLM 密集型项目的成本和时间线时

核心概念

任务-模型匹配识别

并非所有问题都受益于 LLM 处理。任何项目的第一步都是评估任务特征是否与 LLM 的优势相匹配。此评估应在编写任何代码之前进行。

适合 LLM 的任务具有以下特征：

特征	匹配原因
跨来源综合	LLM 擅长整合来自多个输入的信息
基于规则的判断	LLM 能够根据标准处理评分、评估和分类
自然语言输出	当目标是生成人类可读文本，而非结构化数据时
容错性	个别失败不会破坏整个系统
批处理	项目之间不需要会话状态
训练中包含领域知识	模型已具备相关上下文

不适合 LLM 的任务具有以下特征：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

特征	失败原因
精确计算	数学、计数和精确算法不可靠
实时性要求	LLM 的延迟对于亚秒级响应来说太高
完美准确性要求	幻觉风险使得 100% 的准确性不可能实现
专有数据依赖	模型缺乏必要的上下文
顺序依赖性	每个步骤严重依赖于前一个结果
确定性输出要求	相同的输入必须产生完全相同的输出

文件系统作为状态机

使用文件系统而非数据库或内存结构来跟踪流水线状态。每个处理单元都有一个目录。每个阶段的完成由文件的存在来标记。

data/{id}/
├── raw.json         # acquire 阶段完成
├── prompt.md        # prepare 阶段完成
├── response.md      # process 阶段完成
├── parsed.json      # parse 阶段完成

要检查某个项目是否需要处理：检查输出文件是否存在。要重新运行某个阶段：删除其输出文件及下游文件。要调试：直接读取中间文件。

这种模式提供了：

自然的幂等性（文件存在控制执行）
易于调试（所有状态都是人类可读的）
简单的并行化（每个目录都是独立的）
简单的缓存（文件在多次运行间持久化）

结构化输出设计

当 LLM 输出必须以编程方式解析时，提示设计直接决定了解析的可靠性。提示必须通过示例指定确切的格式要求。

有效的结构规范包括：

节标记 ：用于解析的显式标题或前缀
格式示例 ：准确展示输出应该是什么样子
理由说明 ："我将以编程方式解析此内容"
约束值 ：枚举选项、评分范围、格式

示例提示结构：

Analyze the following and provide your response in exactly this format:

## Summary
[Your summary here]

## Score
Rating: [1-10]

## Details
- Key point 1
- Key point 2

Follow this format exactly because I will be parsing it programmatically.

解析代码必须能够优雅地处理变化。LLM 不会完美地遵循指令。构建能够：

使用足够灵活的正则表达式模式来处理微小的格式变化
在缺少部分时提供合理的默认值
记录解析失败以供后续审查，而不是直接崩溃

智能体辅助开发

现代具备智能体能力的模型可以显著加速开发。模式是：

描述项目目标和约束
让智能体生成初始实现
针对特定失败进行测试和迭代
根据结果优化提示和架构

这关乎快速迭代：生成、测试、修复、重复。智能体处理样板代码和初始结构，而您专注于特定领域的要求和边缘情况。

有效进行智能体辅助开发的关键实践：

预先提供清晰、具体的要求
将大型项目分解为离散的组件
在进入下一个组件之前测试每个组件
让智能体一次专注于一项任务

成本和规模估算

LLM 处理具有可预测的成本，应在开始前进行估算。公式：

Total cost = (items × tokens_per_item × price_per_token) + API overhead

估算每个项目的输入令牌数（提示 + 上下文）
估算每个项目的输出令牌数（典型响应长度）
乘以项目数量
增加 20-30% 的缓冲以应对重试和失败

在开发过程中跟踪实际成本。如果成本显著超出估算，请重新评估方法。考虑：

通过截断减少上下文长度
对较简单的项目使用较小的模型
缓存和重用部分结果
并行处理以减少挂钟时间（而非令牌成本）

选择单智能体与多智能体架构

单智能体流水线适用于：

具有独立项目的批处理
项目之间不交互的任务
更简单的成本和复杂性管理

多智能体架构适用于：

并行探索不同方面
超出单个上下文窗口容量的任务
当专门的子智能体可以提高质量时

采用多智能体的主要原因是上下文隔离，而非角色拟人化。子智能体获得新的上下文窗口以专注于子任务。这可以防止在长时间运行的任务中出现上下文退化。

有关详细架构指导，请参见 multi-agent-patterns 技能。

从最小架构开始。仅在证明必要时才增加复杂性。生产证据表明，移除专用工具通常能提高性能。

Vercel 的 d0 智能体通过将 17 个专用工具减少到 2 个基本原语（bash 命令执行和 SQL），将成功率从 80% 提高到 100%。文件系统智能体模式使用标准的 Unix 工具（grep、cat、find、ls）而不是自定义的探索工具。

简化优于复杂性的情况：

您的数据层文档完善且结构一致
模型具备足够的推理能力
您的专用工具是限制性的而非赋能性的
您在维护脚手架上的时间多于改进结果的时间

需要复杂性的情况：

您的底层数据混乱、不一致或文档不完善
领域需要模型缺乏的专门知识
安全约束要求限制智能体的能力
操作确实复杂，受益于结构化工作流

有关详细的工具架构指导，请参见 tool-design 技能。

预期会进行重构。大规模的生产智能体系统需要进行多次架构迭代。Manus 自推出以来已经重构了其智能体框架五次。"苦涩的教训"表明，为当前模型限制而添加的结构，随着模型的改进，会成为约束。

为变化而构建：

保持架构简单且无偏见
跨模型优势进行测试，以验证您的框架没有限制性能
设计能够从模型改进中受益的系统，而不是锁定在限制中

任务分析
- 输入是什么？期望的输出是什么？
- 这是综合、生成、分类还是分析？
- 可接受的错误率是多少？
- 每次成功完成的价值是多少？
手动验证
- 使用目标模型测试一个示例
- 评估输出质量和格式
- 识别失败模式
- 估算每个项目的令牌数
架构选择
- 单流水线 vs 多智能体
- 所需的工具和数据源
- 存储和缓存策略
- 并行化方法
成本估算
- 项目数 × 令牌数 × 价格
- 开发时间
- 基础设施要求
- 持续的运营成本
开发计划
- 分阶段实施
- 每个阶段的测试策略
- 迭代里程碑
- 部署方法

应避免的反模式

跳过手动验证 ：在验证模型能否执行任务之前就构建自动化，当方法存在根本缺陷时会浪费大量时间。

单体流水线 ：将所有阶段合并到一个脚本中，使得调试和迭代变得困难。使用持久的中间输出将各个阶段分开。

过度约束模型 ：添加护栏、预过滤和验证逻辑，而这些本可以由模型自行处理。测试您的脚手架是帮助还是阻碍了模型。

直到生产才考虑成本 ：令牌成本在规模上会迅速累积。从一开始就估算和跟踪。

完美的解析要求 ：期望 LLM 完美地遵循格式指令。构建能够处理变化的健壮解析器。

过早优化 ：在基本流水线正常工作之前就添加缓存、并行化和优化。

示例 1：批处理分析流水线（Karpathy 的 HN 时间胶囊）

任务：用后见之明评分分析 10 年前的 930 个 HN 讨论。

5 阶段流水线：获取 → 提示 → 分析 → 解析 → 呈现
文件系统状态：data/{date}/{item_id}/ 包含阶段输出文件
结构化输出：6 个部分，具有明确的格式要求
并行执行：15 个 worker 用于 LLM 调用

结果：总成本 58 美元，执行约 1 小时，静态 HTML 输出。

示例 2：架构简化（Vercel d0）

任务：用于内部分析的文本到 SQL 智能体。

之前：17 个专用工具，80% 成功率，平均执行时间 274 秒。

之后：2 个工具（bash + SQL），100% 成功率，平均执行时间 77 秒。

关键洞察：语义层已经是很好的文档。Claude 只需要能够直接读取文件。

详细分析请参见案例研究。

在构建自动化之前，通过手动原型设计验证任务与模型的匹配度
将流水线构建为离散、幂等、可缓存的阶段
使用文件系统进行状态管理和调试
设计具有明确格式示例的结构化、可解析输出提示
从最小架构开始；仅在证明必要时才增加复杂性
尽早估算成本并在整个开发过程中跟踪
构建能够处理 LLM 输出变化的健壮解析器
预期并规划多次架构迭代
测试脚手架是帮助还是限制了模型性能
使用智能体辅助开发来快速迭代实现

此技能关联到：

context-fundamentals - 理解提示设计的上下文约束
tool-design - 在流水线内为智能体系统设计工具
multi-agent-patterns - 何时使用多智能体与单流水线
evaluation - 评估流水线输出和智能体性能
context-compression - 当流水线超出限制时管理上下文

内部参考资料：

案例研究 - Karpathy HN 时间胶囊、Vercel d0、Manus 模式
流水线模式 - 详细的流水线架构指导

本集合中的相关技能：

tool-design - 工具架构和简化模式
multi-agent-patterns - 何时使用多智能体架构
evaluation - 输出评估框架

Karpathy 的 HN 时间胶囊项目：https://github.com/karpathy/hn-time-capsule
Vercel d0 架构简化：https://vercel.com/blog/we-removed-80-percent-of-our-agents-tools
Manus 上下文工程：Peak Ji 关于上下文工程经验的博客
Anthropic 多智能体研究：我们如何构建多智能体研究系统

创建日期 : 2025-12-25 最后更新 : 2025-12-25 作者 : Agent Skills for Context Engineering Contributors 版本 : 1.0.0

🇺🇸English

Project Development Methodology

This skill covers the principles for identifying tasks suited to LLM processing, designing effective project architectures, and iterating rapidly using agent-assisted development. The methodology applies whether building a batch processing pipeline, a multi-agent research system, or an interactive agent application.

When to Activate

Activate this skill when:

Starting a new project that might benefit from LLM processing
Evaluating whether a task is well-suited for agents versus traditional code
Designing the architecture for an LLM-powered application
Planning a batch processing pipeline with structured outputs
Choosing between single-agent and multi-agent approaches
Estimating costs and timelines for LLM-heavy projects

Core Concepts

Task-Model Fit Recognition

Not every problem benefits from LLM processing. The first step in any project is evaluating whether the task characteristics align with LLM strengths. This evaluation should happen before writing any code.

LLM-suited tasks share these characteristics:

Characteristic	Why It Fits
Synthesis across sources	LLMs excel at combining information from multiple inputs
Subjective judgment with rubrics	LLMs handle grading, evaluation, and classification with criteria
Natural language output	When the goal is human-readable text, not structured data
Error tolerance	Individual failures do not break the overall system
Batch processing	No conversational state required between items
Domain knowledge in training	The model already has relevant context

LLM-unsuited tasks share these characteristics:

Characteristic	Why It Fails
Precise computation	Math, counting, and exact algorithms are unreliable
Real-time requirements	LLM latency is too high for sub-second responses
Perfect accuracy requirements	Hallucination risk makes 100% accuracy impossible
Proprietary data dependence	The model lacks necessary context
Sequential dependencies	Each step depends heavily on the previous result
Deterministic output requirements	Same input must produce identical output

The evaluation should happen through manual prototyping: take one representative example and test it directly with the target model before building any automation.

The Manual Prototype Step

Before investing in automation, validate task-model fit with a manual test. Copy one representative input into the model interface. Evaluate the output quality. This takes minutes and prevents hours of wasted development.

This validation answers critical questions:

Does the model have the knowledge required for this task?
Can the model produce output in the format you need?
What level of quality should you expect at scale?
Are there obvious failure modes to address?

If the manual prototype fails, the automated system will fail. If it succeeds, you have a baseline for comparison and a template for prompt design.

Pipeline Architecture

LLM projects benefit from staged pipeline architectures where each stage is:

Discrete : Clear boundaries between stages
Idempotent : Re-running produces the same result
Cacheable : Intermediate results persist to disk
Independent : Each stage can run separately

The canonical pipeline structure:

acquire → prepare → process → parse → render

Acquire : Fetch raw data from sources (APIs, files, databases)
Prepare : Transform data into prompt format
Process : Execute LLM calls (the expensive, non-deterministic step)
Parse : Extract structured data from LLM outputs
Render : Generate final outputs (reports, files, visualizations)

Stages 1, 2, 4, and 5 are deterministic. Stage 3 is non-deterministic and expensive. This separation allows re-running the expensive LLM stage only when necessary, while iterating quickly on parsing and rendering.

File System as State Machine

Use the file system to track pipeline state rather than databases or in-memory structures. Each processing unit gets a directory. Each stage completion is marked by file existence.

data/{id}/
├── raw.json         # acquire stage complete
├── prompt.md        # prepare stage complete
├── response.md      # process stage complete
├── parsed.json      # parse stage complete

To check if an item needs processing: check if the output file exists. To re-run a stage: delete its output file and downstream files. To debug: read the intermediate files directly.

This pattern provides:

Natural idempotency (file existence gates execution)
Easy debugging (all state is human-readable)
Simple parallelization (each directory is independent)
Trivial caching (files persist across runs)

Structured Output Design

When LLM outputs must be parsed programmatically, prompt design directly determines parsing reliability. The prompt must specify exact format requirements with examples.

Effective structure specification includes:

Section markers : Explicit headers or prefixes for parsing
Format examples : Show exactly what output should look like
Rationale disclosure : "I will be parsing this programmatically"
Constrained values : Enumerated options, score ranges, formats

Example prompt structure:

Analyze the following and provide your response in exactly this format:

## Summary
[Your summary here]

## Score
Rating: [1-10]

## Details
- Key point 1
- Key point 2

Follow this format exactly because I will be parsing it programmatically.

The parsing code must handle variations gracefully. LLMs do not follow instructions perfectly. Build parsers that:

Use regex patterns flexible enough to handle minor formatting variations
Provide sensible defaults when sections are missing
Log parsing failures for later review rather than crashing

Agent-Assisted Development

Modern agent-capable models can accelerate development significantly. The pattern is:

Describe the project goal and constraints
Let the agent generate initial implementation
Test and iterate on specific failures
Refine prompts and architecture based on results

This is about rapid iteration: generate, test, fix, repeat. The agent handles boilerplate and initial structure while you focus on domain-specific requirements and edge cases.

Key practices for effective agent-assisted development:

Provide clear, specific requirements upfront
Break large projects into discrete components
Test each component before moving to the next
Keep the agent focused on one task at a time

Cost and Scale Estimation

LLM processing has predictable costs that should be estimated before starting. The formula:

Total cost = (items × tokens_per_item × price_per_token) + API overhead

For batch processing:

Estimate input tokens per item (prompt + context)
Estimate output tokens per item (typical response length)
Multiply by item count
Add 20-30% buffer for retries and failures

Track actual costs during development. If costs exceed estimates significantly, re-evaluate the approach. Consider:

Reducing context length through truncation
Using smaller models for simpler items
Caching and reusing partial results
Parallel processing to reduce wall-clock time (not token cost)

Detailed Topics

Choosing Single vs Multi-Agent Architecture

Single-agent pipelines work for:

Batch processing with independent items
Tasks where items do not interact
Simpler cost and complexity management

Multi-agent architectures work for:

Parallel exploration of different aspects
Tasks exceeding single context window capacity
When specialized sub-agents improve quality

The primary reason for multi-agent is context isolation, not role anthropomorphization. Sub-agents get fresh context windows for focused subtasks. This prevents context degradation on long-running tasks.

See multi-agent-patterns skill for detailed architecture guidance.

Architectural Reduction

Start with minimal architecture. Add complexity only when proven necessary. Production evidence shows that removing specialized tools often improves performance.

Vercel's d0 agent achieved 100% success rate (up from 80%) by reducing from 17 specialized tools to 2 primitives: bash command execution and SQL. The file system agent pattern uses standard Unix utilities (grep, cat, find, ls) instead of custom exploration tools.

When reduction outperforms complexity:

Your data layer is well-documented and consistently structured
The model has sufficient reasoning capability
Your specialized tools were constraining rather than enabling
You are spending more time maintaining scaffolding than improving outcomes

When complexity is necessary:

Your underlying data is messy, inconsistent, or poorly documented
The domain requires specialized knowledge the model lacks
Safety constraints require limiting agent capabilities
Operations are truly complex and benefit from structured workflows

See tool-design skill for detailed tool architecture guidance.

Iteration and Refactoring

Expect to refactor. Production agent systems at scale require multiple architectural iterations. Manus refactored their agent framework five times since launch. The Bitter Lesson suggests that structures added for current model limitations become constraints as models improve.

Build for change:

Keep architecture simple and unopinionated
Test across model strengths to verify your harness is not limiting performance
Design systems that benefit from model improvements rather than locking in limitations

Practical Guidance

Project Planning Template

Task Analysis
- What is the input? What is the desired output?
- Is this synthesis, generation, classification, or analysis?
- What error rate is acceptable?
- What is the value per successful completion?
Manual Validation
- Test one example with target model
- Evaluate output quality and format
- Identify failure modes
- Estimate tokens per item
Architecture Selection
- Single pipeline vs multi-agent
- Required tools and data sources
- Storage and caching strategy
- Parallelization approach
Cost Estimation
- Items × tokens × price
- Development time
- Infrastructure requirements
- Ongoing operational costs
Development Plan
- Stage-by-stage implementation
- Testing strategy per stage
- Iteration milestones
- Deployment approach

Anti-Patterns to Avoid

Skipping manual validation : Building automation before verifying the model can do the task wastes significant time when the approach is fundamentally flawed.

Monolithic pipelines : Combining all stages into one script makes debugging and iteration difficult. Separate stages with persistent intermediate outputs.

Over-constraining the model : Adding guardrails, pre-filtering, and validation logic that the model could handle on its own. Test whether your scaffolding helps or hurts.

Ignoring costs until production : Token costs compound quickly at scale. Estimate and track from the beginning.

Perfect parsing requirements : Expecting LLMs to follow format instructions perfectly. Build robust parsers that handle variations.

Premature optimization : Adding caching, parallelization, and optimization before the basic pipeline works correctly.

Examples

Example 1: Batch Analysis Pipeline (Karpathy's HN Time Capsule)

Task: Analyze 930 HN discussions from 10 years ago with hindsight grading.

Architecture:

5-stage pipeline: fetch → prompt → analyze → parse → render
File system state: data/{date}/{item_id}/ with stage output files
Structured output: 6 sections with explicit format requirements
Parallel execution: 15 workers for LLM calls

Results: $58 total cost, ~1 hour execution, static HTML output.

Example 2: Architectural Reduction (Vercel d0)

Task: Text-to-SQL agent for internal analytics.

Before: 17 specialized tools, 80% success rate, 274s average execution.

After: 2 tools (bash + SQL), 100% success rate, 77s average execution.

Key insight: The semantic layer was already good documentation. Claude just needed access to read files directly.

See Case Studies for detailed analysis.

Guidelines

Validate task-model fit with manual prototyping before building automation
Structure pipelines as discrete, idempotent, cacheable stages
Use the file system for state management and debugging
Design prompts for structured, parseable outputs with explicit format examples
Start with minimal architecture; add complexity only when proven necessary
Estimate costs early and track throughout development
Build robust parsers that handle LLM output variations
Expect and plan for multiple architectural iterations
Test whether scaffolding helps or constrains model performance
Use agent-assisted development for rapid iteration on implementation

Integration

This skill connects to:

context-fundamentals - Understanding context constraints for prompt design
tool-design - Designing tools for agent systems within pipelines
multi-agent-patterns - When to use multi-agent versus single pipelines
evaluation - Evaluating pipeline outputs and agent performance
context-compression - Managing context when pipelines exceed limits

References

Internal references:

Case Studies - Karpathy HN Capsule, Vercel d0, Manus patterns
Pipeline Patterns - Detailed pipeline architecture guidance

Related skills in this collection:

tool-design - Tool architecture and reduction patterns
multi-agent-patterns - When to use multi-agent architectures
evaluation - Output evaluation frameworks

External resources:

Karpathy's HN Time Capsule project: https://github.com/karpathy/hn-time-capsule
Vercel d0 architectural reduction: https://vercel.com/blog/we-removed-80-percent-of-our-agents-tools
Manus context engineering: Peak Ji's blog on context engineering lessons
Anthropic multi-agent research: How we built our multi-agent research system

Skill Metadata

Created : 2025-12-25 Last Updated : 2025-12-25 Author : Agent Skills for Context Engineering Contributors Version : 1.0.0

Weekly Installs

Repository

crinkj/common-c…-setting

First Seen

Today

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

zencoder1

amp1

cline1

openclaw1

opencode1

cursor1

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

109,600 周安装