智能体性能优化指南：通过数据驱动和提示工程提升AI智能体可靠性与效率

agent-orchestration-improve-agent by sickn33/antigravity-awesome-skills

211 周安装量

27,100 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill agent-orchestration-improve-agent

AI/机器学习自动化性能优化

🇨🇳中文介绍

Agent Performance Optimization Workflow

通过性能分析、提示工程和持续迭代，系统性改进现有智能体。

[扩展思考：智能体优化需要采用数据驱动的方法，结合性能指标、用户反馈分析和先进的提示工程技术。成功取决于系统性评估、针对性改进以及具备回滚能力的严格测试，以确保生产环境安全。]

使用此技能的时机

改进现有智能体的性能或可靠性时
分析故障模式、提示质量或工具使用情况时
运行结构化的 A/B 测试或评估套件时
为智能体设计迭代优化工作流时

不应使用此技能的时机

从头开始构建一个全新的智能体时
没有可用的指标、反馈或测试用例时
任务与智能体性能或提示质量无关时

操作指南

建立基线指标并收集代表性示例。
识别故障模式并优先处理高影响度的修复。
应用提示和工作流改进，并设定可衡量的目标。
通过测试验证，并以可控阶段的方式推出更改。

安全须知

避免在未进行回归测试的情况下部署提示更改。
如果质量或安全指标出现倒退，请迅速回滚。

第一阶段：性能分析与基线指标

使用 context-manager 进行全面的智能体性能分析，用于历史数据收集。

1.1 收集性能数据

Use: context-manager
Command: analyze-agent-performance $ARGUMENTS --days 30

收集指标包括：

任务完成率（成功任务与失败任务）
响应准确性和事实正确性
工具使用效率（正确工具、调用频率）
平均响应时间和令牌消耗
用户满意度指标（修正、重试）
幻觉事件和错误模式

1.2 用户反馈模式分析

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

1.3 故障模式分类

按根本原因对故障进行分类：

指令误解：角色或任务混淆
输出格式错误：结构或格式问题
上下文丢失：长对话性能下降
工具误用：工具选择错误或低效
约束违反：违反安全或业务规则
边缘情况处理：异常输入场景

1.4 基线性能报告

生成量化基线指标：

Performance Baseline:
- Task Success Rate: [X%]
- Average Corrections per Task: [Y]
- Tool Call Efficiency: [Z%]
- User Satisfaction Score: [1-10]
- Average Response Latency: [Xms]
- Token Efficiency Ratio: [X:Y]

第二阶段：提示工程改进

使用 prompt-engineer 智能体应用高级提示优化技术。

实施结构化的推理模式：

Use: prompt-engineer
Technique: chain-of-thought-optimization

添加明确的推理步骤："让我们一步步来处理这个问题..."
包含自我验证检查点："在继续之前，请验证..."
为复杂任务实施递归分解
为调试添加推理轨迹可见性

2.2 少样本示例优化

从成功的交互中精选高质量示例：

选择多样化的示例，涵盖常见用例
包含之前失败的边缘案例
展示正面和反面示例并附上解释
按从简单到复杂的顺序排列示例
为示例添加关键决策点注释

Good Example:
Input: [User request]
Reasoning: [Step-by-step thought process]
Output: [Successful response]
Why this works: [Key success factors]

Bad Example:
Input: [Similar request]
Output: [Failed response]
Why this fails: [Specific issues]
Correct approach: [Fixed version]

2.3 角色定义细化

强化智能体身份和能力：

核心目的：清晰、一句话的使命
专业领域：具体的知识领域
行为特征：个性和交互风格
工具熟练度：可用工具及使用时机
约束条件：智能体不应做的事情
成功标准：如何衡量任务完成度

2.4 宪法式 AI 集成

实施自我纠正机制：

Constitutional Principles:
1. Verify factual accuracy before responding
2. Self-check for potential biases or harmful content
3. Validate output format matches requirements
4. Ensure response completeness
5. Maintain consistency with previous responses

添加批判与修订循环：

初始响应生成
根据原则进行自我批判
如果检测到问题则自动修订
输出前进行最终验证

2.5 输出格式调优

优化响应结构：

针对常见任务的结构化模板
基于复杂度的动态格式化
详细信息的渐进式披露
为提升可读性的 Markdown 优化
带语法高亮的代码块格式化
用于数据展示的表格和列表生成

第三阶段：测试与验证

包含 A/B 比较的全面测试框架。

3.1 测试套件开发

创建代表性的测试场景：

Test Categories:
1. Golden path scenarios (common successful cases)
2. Previously failed tasks (regression testing)
3. Edge cases and corner scenarios
4. Stress tests (complex, multi-step tasks)
5. Adversarial inputs (potential breaking points)
6. Cross-domain tasks (combining capabilities)

3.2 A/B 测试框架

比较原始智能体与改进后的智能体：

Use: parallel-test-runner
Config:
  - Agent A: Original version
  - Agent B: Improved version
  - Test set: 100 representative tasks
  - Metrics: Success rate, speed, token usage
  - Evaluation: Blind human review + automated scoring

统计显著性测试：

最小样本量：每个变体 100 个任务
置信水平：95% (p < 0.05)
效应量计算 (Cohen's d)
未来测试的效能分析

全面的评分框架：

任务级指标：

完成率（二元成功/失败）
正确性得分（0-100% 准确度）
效率得分（采取步骤与最优步骤对比）
工具使用适当性
响应相关性和完整性

幻觉率（每个响应的事实错误）
一致性得分（与先前响应对齐程度）
格式合规性（匹配指定结构）
安全得分（约束遵守情况）
用户满意度预测

响应延迟（到第一个令牌的时间）
总生成时间
令牌消耗（输入 + 输出）
每任务成本（API 使用费用）
内存/上下文效率

3.4 人工评估协议

结构化的人工评审流程：

盲审（评估者不知道版本）
具有明确标准的标准化评分表
每个样本由多位评估者评审（评估者间信度）
定性反馈收集
偏好排序（A 与 B 比较）

第四阶段：版本控制与部署

具备监控和回滚能力的安全推出。

系统化的版本控制策略：

Version Format: agent-name-v[MAJOR].[MINOR].[PATCH]
Example: customer-support-v2.3.1

MAJOR: Significant capability changes
MINOR: Prompt improvements, new examples
PATCH: Bug fixes, minor adjustments

维护版本历史：

基于 Git 的提示存储
包含改进详情的变更日志
每个版本的性能指标
记录回滚程序

渐进式部署策略：

Alpha 测试：内部团队验证（5% 流量）
Beta 测试：选定用户（20% 流量）
金丝雀发布：逐步增加（20% → 50% → 100%）
全面部署：满足成功标准后
监控期：7 天观察窗口

快速恢复机制：

Rollback Triggers:
- Success rate drops >10% from baseline
- Critical errors increase >5%
- User complaints spike
- Cost per task increases >20%
- Safety violations detected

Rollback Process:
1. Detect issue via monitoring
2. Alert team immediately
3. Switch to previous stable version
4. Analyze root cause
5. Fix and re-test before retry

实时性能跟踪：

包含关键指标的仪表板
异常检测警报
用户反馈收集
自动化回归测试
每周性能报告

当满足以下条件时，智能体改进被视为成功：

任务成功率提高 ≥15%
用户修正减少 ≥25%
安全违规事件未增加
响应时间保持在基线 10% 以内
每任务成本增加不超过 5%
积极用户反馈增加

在生产环境使用 30 天后：

分析累积的性能数据
与基线和目标进行比较
识别新的改进机会
记录经验教训
规划下一个优化周期

建立定期的改进节奏：

每周：监控指标并收集反馈
每月：分析模式并规划改进
每季度：具有新功能的主要版本更新
每年：战略审查和架构更新

记住：智能体优化是一个迭代过程。每个周期都建立在先前学习的基础上，在保持稳定性和安全性的同时逐步提高性能。

🇺🇸English

Agent Performance Optimization Workflow

Systematic improvement of existing agents through performance analysis, prompt engineering, and continuous iteration.

[Extended thinking: Agent optimization requires a data-driven approach combining performance metrics, user feedback analysis, and advanced prompt engineering techniques. Success depends on systematic evaluation, targeted improvements, and rigorous testing with rollback capabilities for production safety.]

Use this skill when

Improving an existing agent's performance or reliability
Analyzing failure modes, prompt quality, or tool usage
Running structured A/B tests or evaluation suites
Designing iterative optimization workflows for agents

Do not use this skill when

You are building a brand-new agent from scratch
There are no metrics, feedback, or test cases available
The task is unrelated to agent performance or prompt quality

Instructions

Establish baseline metrics and collect representative examples.
Identify failure modes and prioritize high-impact fixes.
Apply prompt and workflow improvements with measurable goals.
Validate with tests and roll out changes in controlled stages.

Safety

Avoid deploying prompt changes without regression testing.
Roll back quickly if quality or safety metrics regress.

Phase 1: Performance Analysis and Baseline Metrics

Comprehensive analysis of agent performance using context-manager for historical data collection.

1.1 Gather Performance Data

Use: context-manager
Command: analyze-agent-performance $ARGUMENTS --days 30

Collect metrics including:

Task completion rate (successful vs failed tasks)
Response accuracy and factual correctness
Tool usage efficiency (correct tools, call frequency)
Average response time and token consumption
User satisfaction indicators (corrections, retries)
Hallucination incidents and error patterns

1.2 User Feedback Pattern Analysis

Identify recurring patterns in user interactions:

Correction patterns : Where users consistently modify outputs
Clarification requests : Common areas of ambiguity
Task abandonment : Points where users give up
Follow-up questions : Indicators of incomplete responses
Positive feedback : Successful patterns to preserve

1.3 Failure Mode Classification

Categorize failures by root cause:

Instruction misunderstanding : Role or task confusion
Output format errors : Structure or formatting issues
Context loss : Long conversation degradation
Tool misuse : Incorrect or inefficient tool selection
Constraint violations : Safety or business rule breaches
Edge case handling : Unusual input scenarios

1.4 Baseline Performance Report

Generate quantitative baseline metrics:

Performance Baseline:
- Task Success Rate: [X%]
- Average Corrections per Task: [Y]
- Tool Call Efficiency: [Z%]
- User Satisfaction Score: [1-10]
- Average Response Latency: [Xms]
- Token Efficiency Ratio: [X:Y]

Phase 2: Prompt Engineering Improvements

Apply advanced prompt optimization techniques using prompt-engineer agent.

2.1 Chain-of-Thought Enhancement

Implement structured reasoning patterns:

Use: prompt-engineer
Technique: chain-of-thought-optimization

Add explicit reasoning steps: "Let's approach this step-by-step..."
Include self-verification checkpoints: "Before proceeding, verify that..."
Implement recursive decomposition for complex tasks
Add reasoning trace visibility for debugging

2.2 Few-Shot Example Optimization

Curate high-quality examples from successful interactions:

Select diverse examples covering common use cases
Include edge cases that previously failed
Show both positive and negative examples with explanations
Order examples from simple to complex
Annotate examples with key decision points

Example structure:

Good Example:
Input: [User request]
Reasoning: [Step-by-step thought process]
Output: [Successful response]
Why this works: [Key success factors]

Bad Example:
Input: [Similar request]
Output: [Failed response]
Why this fails: [Specific issues]
Correct approach: [Fixed version]

2.3 Role Definition Refinement

Strengthen agent identity and capabilities:

Core purpose : Clear, single-sentence mission
Expertise domains : Specific knowledge areas
Behavioral traits : Personality and interaction style
Tool proficiency : Available tools and when to use them
Constraints : What the agent should NOT do
Success criteria : How to measure task completion

2.4 Constitutional AI Integration

Implement self-correction mechanisms:

Constitutional Principles:
1. Verify factual accuracy before responding
2. Self-check for potential biases or harmful content
3. Validate output format matches requirements
4. Ensure response completeness
5. Maintain consistency with previous responses

Add critique-and-revise loops:

Initial response generation
Self-critique against principles
Automatic revision if issues detected
Final validation before output

2.5 Output Format Tuning

Optimize response structure:

Structured templates for common tasks
Dynamic formatting based on complexity
Progressive disclosure for detailed information
Markdown optimization for readability
Code block formatting with syntax highlighting
Table and list generation for data presentation

Phase 3: Testing and Validation

Comprehensive testing framework with A/B comparison.

3.1 Test Suite Development

Create representative test scenarios:

Test Categories:
1. Golden path scenarios (common successful cases)
2. Previously failed tasks (regression testing)
3. Edge cases and corner scenarios
4. Stress tests (complex, multi-step tasks)
5. Adversarial inputs (potential breaking points)
6. Cross-domain tasks (combining capabilities)

3.2 A/B Testing Framework

Compare original vs improved agent:

Use: parallel-test-runner
Config:
  - Agent A: Original version
  - Agent B: Improved version
  - Test set: 100 representative tasks
  - Metrics: Success rate, speed, token usage
  - Evaluation: Blind human review + automated scoring

Statistical significance testing:

Minimum sample size: 100 tasks per variant
Confidence level: 95% (p < 0.05)
Effect size calculation (Cohen's d)
Power analysis for future tests

3.3 Evaluation Metrics

Comprehensive scoring framework:

Task-Level Metrics:

Completion rate (binary success/failure)
Correctness score (0-100% accuracy)
Efficiency score (steps taken vs optimal)
Tool usage appropriateness
Response relevance and completeness

Quality Metrics:

Hallucination rate (factual errors per response)
Consistency score (alignment with previous responses)
Format compliance (matches specified structure)
Safety score (constraint adherence)
User satisfaction prediction

Performance Metrics:

Response latency (time to first token)
Total generation time
Token consumption (input + output)
Cost per task (API usage fees)
Memory/context efficiency

3.4 Human Evaluation Protocol

Structured human review process:

Blind evaluation (evaluators don't know version)
Standardized rubric with clear criteria
Multiple evaluators per sample (inter-rater reliability)
Qualitative feedback collection
Preference ranking (A vs B comparison)

Phase 4: Version Control and Deployment

Safe rollout with monitoring and rollback capabilities.

4.1 Version Management

Systematic versioning strategy:

Version Format: agent-name-v[MAJOR].[MINOR].[PATCH]
Example: customer-support-v2.3.1

MAJOR: Significant capability changes
MINOR: Prompt improvements, new examples
PATCH: Bug fixes, minor adjustments

Maintain version history:

Git-based prompt storage
Changelog with improvement details
Performance metrics per version
Rollback procedures documented

4.2 Staged Rollout

Progressive deployment strategy:

Alpha testing : Internal team validation (5% traffic)
Beta testing : Selected users (20% traffic)
Canary release : Gradual increase (20% → 50% → 100%)
Full deployment : After success criteria met
Monitoring period : 7-day observation window

4.3 Rollback Procedures

Quick recovery mechanism:

Rollback Triggers:
- Success rate drops >10% from baseline
- Critical errors increase >5%
- User complaints spike
- Cost per task increases >20%
- Safety violations detected

Rollback Process:
1. Detect issue via monitoring
2. Alert team immediately
3. Switch to previous stable version
4. Analyze root cause
5. Fix and re-test before retry

4.4 Continuous Monitoring

Real-time performance tracking:

Dashboard with key metrics
Anomaly detection alerts
User feedback collection
Automated regression testing
Weekly performance reports

Success Criteria

Agent improvement is successful when:

Task success rate improves by ≥15%
User corrections decrease by ≥25%
No increase in safety violations
Response time remains within 10% of baseline
Cost per task doesn't increase >5%
Positive user feedback increases

Post-Deployment Review

After 30 days of production use:

Analyze accumulated performance data
Compare against baseline and targets
Identify new improvement opportunities
Document lessons learned
Plan next optimization cycle

Continuous Improvement Cycle

Establish regular improvement cadence:

Weekly : Monitor metrics and collect feedback
Monthly : Analyze patterns and plan improvements
Quarterly : Major version updates with new capabilities
Annually : Strategic review and architecture updates

Remember: Agent optimization is an iterative process. Each cycle builds upon previous learnings, gradually improving performance while maintaining stability and safety.

Weekly Installs

211

Repository

sickn33/antigra…e-skills

GitHub Stars

27.1K

First Seen

Jan 28, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode199

gemini-cli193

codex191

github-copilot189

cursor178

amp176

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

60,400 周安装

智能体性能优化指南：通过数据驱动和提示工程提升AI智能体可靠性与效率

🇨🇳中文介绍

Agent Performance Optimization Workflow

使用此技能的时机

不应使用此技能的时机

操作指南

安全须知

第一阶段：性能分析与基线指标

1.1 收集性能数据

1.2 用户反馈模式分析

相关 Skills

1.3 故障模式分类

1.4 基线性能报告

第二阶段：提示工程改进

2.1 思维链增强

2.2 少样本示例优化

2.3 角色定义细化

2.4 宪法式 AI 集成

2.5 输出格式调优

第三阶段：测试与验证

3.1 测试套件开发

3.2 A/B 测试框架

3.3 评估指标

3.4 人工评估协议

第四阶段：版本控制与部署

4.1 版本管理

4.2 分阶段推出

4.3 回滚程序

4.4 持续监控

成功标准

部署后审查

持续改进周期

🇺🇸English

Agent Performance Optimization Workflow

Use this skill when

Do not use this skill when

Instructions

Safety

Phase 1: Performance Analysis and Baseline Metrics

1.1 Gather Performance Data

1.2 User Feedback Pattern Analysis

1.3 Failure Mode Classification

1.4 Baseline Performance Report

Phase 2: Prompt Engineering Improvements

2.1 Chain-of-Thought Enhancement

2.2 Few-Shot Example Optimization

2.3 Role Definition Refinement

2.4 Constitutional AI Integration

2.5 Output Format Tuning

Phase 3: Testing and Validation

3.1 Test Suite Development

3.2 A/B Testing Framework

3.3 Evaluation Metrics

3.4 Human Evaluation Protocol

Phase 4: Version Control and Deployment

4.1 Version Management

4.2 Staged Rollout

4.3 Rollback Procedures

4.4 Continuous Monitoring

Success Criteria

Post-Deployment Review

Continuous Improvement Cycle

最新 Skills