多智能体系统生产就绪性指南：成本优化、可观测性与治理实践

The Agent Skills Directory

安装命令

npx skills add https://smithery.ai/skills/markusdegen/production-readiness

AI/机器学习系统架构性能优化

🇨🇳中文介绍

多智能体系统的生产就绪性

生产挑战

89% 的组织已为智能体实施了可观测性，但 32% 的组织将质量问题视为投入生产的主要障碍。生产级多智能体系统需要优先处理以下方面：

成本优化 - 多智能体效率低下可能导致成本增加 10 倍
可观测性 - 分布式追踪和故障分类
治理 - 动态 RBAC 和审计追踪
合规性 - 欧盟《人工智能法案》将于 2026 年 8 月 2 日开始强制执行

成本优化

令牌预算管理

多智能体系统的成本可能是单智能体系统的 2-4 倍。系统化优化可实现 30-50% 的成本降低。

按智能体分配令牌预算

{
  "token_budgets": {
    "Planner": {
      "max_input_tokens": 4000,
      "max_output_tokens": 2000,
      "model": "claude-3-haiku"
    },
    "Executor": {
      "max_input_tokens": 8000,
      "max_output_tokens": 4000,
      "model": "claude-3-sonnet"
    },
    "Verifier": {
      "max_input_tokens": 6000,
      "max_output_tokens": 1000,
      "model": "claude-3-haiku"
    }
  }
}

预算强制执行

def check_budget_before_call(agent_id: str, estimated_tokens: int):
    budget = token_budgets[agent_id]

    if estimated_tokens > budget.max_input_tokens:
        # 在继续之前压缩上下文
        compressed = compress_context(context, budget.max_input_tokens)
        return compressed

    return context

动态模型路由

将任务路由到适当的模型层级：

任务类型	模型	成本节省
分类	Haiku	90%
摘要	Haiku	90%
简单提取	Haiku	90%
复杂推理	Sonnet	基线
关键决策	Opus	-200%（值得）

上下文管理策略

策略	令牌节省	使用场景
滑动窗口 + 摘要	60-70%	超过 5 轮的对话
简洁输出格式	70-85%	智能体间链式调用
仅 ID 格式	85-95%	批量操作
选择性上下文加载	40-60%	大型知识库

反模式：无界上下文

问题：智能体累积上下文而不进行修剪。

影响：令牌成本呈指数级增长，系统在达到上下文限制时失败。

修复：实现 max_history_tokens 预算，压缩旧上下文。

可观测性

分布式追踪要求

捕获完整的执行路径：

{
  "trace": {
    "trace_id": "workflow_uuid",
    "spans": [
      {
        "span_id": "uuid",
        "parent_span_id": null,
        "operation": "workflow.start",
        "agent": "Orchestrator",
        "start_time": "ISO8601",
        "end_time": "ISO8601",
        "attributes": {
          "user_request": "...",
          "tokens_input": 150
        }
      },
      {
        "span_id": "uuid",
        "parent_span_id": "prev_uuid",
        "operation": "plan.create",
        "agent": "Planner",
        "attributes": {
          "tokens_input": 1200,
          "tokens_output": 800,
          "model": "claude-3-haiku",
          "latency_ms": 450
        }
      }
    ]
  }
}

需要收集的指标

类别	指标
成本	每个智能体的令牌数、每个工作流的成本、模型利用率
延迟	每个智能体的延迟、总工作流延迟、队列等待时间
质量	成功率、验证通过率、重试率
协调	消息量、冲突率、升级率

用于调试的状态追踪

在关键点记录状态：

def log_state_transition(
    agent_id: str,
    operation: str,
    state_before: dict,
    state_after: dict
):
    """记录状态变更以实现重放能力。"""
    logger.info({
        "event": "state_transition",
        "agent": agent_id,
        "operation": operation,
        "state_diff": compute_diff(state_before, state_after),
        "timestamp": datetime.utcnow().isoformat()
    })

故障分类

使用 MAST 分类法进行自动化故障分类：

类别	百分比	示例
规范	41.77%	缺少输入、模糊输出、无约束
对齐	36.94%	静默忽略、角色混淆、状态冲突
验证	21.30%	弱检查、过早终止、自我评分

治理

智能体的动态 RBAC

静态权限列表是不够的。实现动态治理：

三大支柱

1. 经过认证的身份和目的

{
  "agent": {
    "id": "planner_001",
    "name": "TaskPlanner",
    "purpose": "将需求分解为任务图",
    "risk_level": "low",
    "data_access": ["requirements", "constraints"],
    "forbidden_data": ["pii", "credentials"]
  }
}

2. 中央策略引擎

def check_policy(agent_id: str, action: str, resource: str) -> bool:
    """检查智能体是否被允许执行该操作。"""
    agent = get_agent_config(agent_id)
    policy = get_policy(agent.role)

    checks = [
        action_aligns_with_purpose(action, agent.purpose),
        resource_in_allowed_data(resource, agent.data_access),
        not_in_forbidden(resource, agent.forbidden_data),
        not_chaining_too_many_actions(agent_id)
    ]

    return all(checks)

3. 带审计的运行时强制执行

def enforce_and_audit(agent_id: str, action: str, resource: str):
    """拦截、强制执行并审计智能体操作。"""
    # 检查策略
    permitted = check_policy(agent_id, action, resource)

    # 无论结果如何都进行审计
    audit_log.append({
        "timestamp": datetime.utcnow().isoformat(),
        "agent": agent_id,
        "action": action,
        "resource": resource,
        "permitted": permitted,
        "policy_version": current_policy_version
    })

    if not permitted:
        raise PolicyViolation(f"{agent_id} cannot {action} on {resource}")

实施阶段

清单：按风险级别对智能体进行分类
定义角色：记录目的、边界、禁止的操作
集成 RBAC：每个智能体操作都经过强制执行
策略即代码：版本控制策略，同行评审变更
持续审查：分析审计日志以发现异常

合规性

欧盟《人工智能法案》（2026年8月2日）

对人工智能智能体的关键要求：

要求	实施
透明度	记录智能体的能力和限制
人工监督	升级路径，关键决策中有人工参与
风险管理	按风险对智能体进行分类，实施相称的控制措施
技术文档	维护规范、审计日志、测试结果
准确性/稳健性	验证流程、故障处理

合规性检查清单

具有细粒度权限的 RBAC
捕获所有智能体决策的审计日志
提示/内容编辑能力
运行时策略强制执行
SSO/SCIM 集成
PII 处理和 DLP 控制
事件响应程序
定期安全评估

SOC 2 考量

SOC 2 Type II 审计现在会仔细审查：

人工智能智能体的访问模式
智能体的数据处理
智能体更新的变更管理
智能体故障的事件响应

12 Factor Agents：操作原则

因子 9：将错误压缩到上下文窗口中

原则：将错误反馈到上下文中，以便智能体能够自我纠正。实施错误计数器以防止陷入循环。

错误上下文模式

当智能体遇到错误时，将其作为上下文反馈以进行自我纠正：

class ErrorContextManager:
    MAX_ERRORS_PER_TOOL = 3
    MAX_TOTAL_ERRORS = 5

    def __init__(self):
        self.error_counts = defaultdict(int)
        self.error_history = []

    def record_error(self, tool: str, error: dict) -> dict:
        """记录错误并返回重试的上下文。"""
        self.error_counts[tool] += 1
        self.error_history.append({
            "tool": tool,
            "error": error,
            "timestamp": datetime.utcnow().isoformat(),
            "attempt": self.error_counts[tool]
        })

        # 检查是否陷入循环
        if self.error_counts[tool] >= self.MAX_ERRORS_PER_TOOL:
            raise ToolSpinOut(f"{tool} failed {self.MAX_ERRORS_PER_TOOL} times")

        if sum(self.error_counts.values()) >= self.MAX_TOTAL_ERRORS:
            raise WorkflowSpinOut(f"Total errors exceeded {self.MAX_TOTAL_ERRORS}")

        # 返回用于自我纠正的上下文
        return {
            "previous_error": error,
            "attempt_number": self.error_counts[tool],
            "suggestion": f"Previous attempt failed: {error['message']}. Try different approach."
        }

错误上下文格式

以结构化格式将错误反馈给智能体：

{
  "role": "system",
  "content": "PREVIOUS ATTEMPT FAILED\n\nError: [error message]\nAttempt: 2 of 3\n\nPlease try a different approach. Consider:\n- Alternative method X\n- Check assumption Y\n- Verify input Z"
}

防止陷入循环

计数器	阈值	操作
每个工具的错误	3	停止使用该工具，尝试替代方案
总错误数	5	暂停工作流，请求人工帮助
重试但无进展	2	强制采用不同方法
重复出现相同错误	2	立即升级

错误处理工作流

consecutive_errors = 0

while True:
    try:
        result = await handle_next_step(thread, next_step)
        thread["events"].append({
            "type": next_step["intent"] + '_result',
            "data": result,
        })
        # 成功！重置错误计数器
        consecutive_errors = 0

    except Exception as e:
        consecutive_errors += 1
        if consecutive_errors < 3:
            # 将错误反馈到上下文并重试
            thread["events"].append({
                "type": 'error',
                "data": format_error_for_context(e),
            })
        else:
            # 错误太多 - 中断循环，升级
            await escalate_to_human(thread, e)
            break

有关详细的错误处理程序，请参阅 references/ops-runbook.md。

因子 11：从任何地方触发

原则：在用户所在的地方提供服务。智能体触发应该是渠道无关的。

统一触发接口

class WorkflowTrigger:
    """
    渠道无关的工作流触发。
    同一工作流可以从任何渠道启动。
    """

    def trigger(self,
                input_data: dict,
                channel: str,
                user_id: str,
                metadata: dict = None) -> str:
        """
        从任何渠道触发工作流。

        Args:
            input_data: 实际的请求/任务
            channel: 来源渠道（slack, email, cli, api, webhook）
            user_id: 触发者
            metadata: 渠道特定的元数据

        Returns:
            workflow_id 用于追踪
        """
        # 无论渠道如何，都规范化输入
        normalized = self.normalize_input(input_data, channel)

        # 创建带有渠道上下文的工作流
        workflow_id = self.workflow_controller.launch(
            input=normalized,
            context={
                "channel": channel,
                "user_id": user_id,
                "reply_to": self.get_reply_destination(channel, metadata)
            }
        )

        return workflow_id

支持的渠道

渠道	触发方法	响应方法
Slack	斜杠命令、提及、私信	线程回复
电子邮件	发送至 agent@domain.com	回复邮件
CLI	`mas run <workflow>`	标准输出
API	POST /workflows	Webhook 或轮询
Webhook	来自外部系统的 POST	回调 URL
仪表板	按钮点击	UI 通知

渠道适配器

┌─────────┐     ┌─────────────────┐     ┌──────────────┐
│  Slack  │────►│                 │────►│              │
├─────────┤     │  Unified        │     │   Workflow   │
│  Email  │────►│  Trigger        │────►│   Engine     │
├─────────┤     │  Interface      │     │              │
│   CLI   │────►│                 │────►│              │
├─────────┤     │                 │     │              │
│   API   │────►│                 │────►│              │
└─────────┘     └─────────────────┘     └──────────────┘

响应路由

当工作流完成时，通过原始渠道将响应路由回去：

def complete_workflow(self, workflow_id: str, result: dict):
    """将结果路由回发起渠道。"""
    context = self.get_workflow_context(workflow_id)
    reply_to = context["reply_to"]

    match reply_to["type"]:
        case "slack":
            self.slack_client.post_message(
                reply_to["channel"],
                format_slack_response(result)
            )
        case "email":
            self.email_client.send(
                reply_to["to"],
                format_email_response(result)
            )
        case "webhook":
            self.http_client.post(reply_to["url"], result)
        case "cli":
            print(format_cli_response(result))

渠道配置

channels:
  slack:
    enabled: true
    app_token: ${SLACK_APP_TOKEN}
    triggers:
      - slash_command: /mas
      - mention: @mas-agent
    response_format: slack_blocks

  email:
    enabled: true
    inbox: mas-agent@company.com
    response_format: html

  api:
    enabled: true
    auth: bearer_token
    rate_limit: 100/minute

  cli:
    enabled: true
    response_format: text

有关多渠道部署程序，请参阅 references/ops-runbook.md。

生产检查清单

部署前

为每个智能体设置令牌预算
配置模型路由
实现上下文管理
启用分布式追踪
激活指标收集
定义 RBAC 策略
启用审计日志记录
完成合规性文档

12 Factor Agents 检查清单

F9：实现错误上下文管理器（防止陷入循环）
F9：每个工具的错误计数器（最多 3 次尝试）
F9：配置总错误阈值（最多 5 次）
F9：为重复错误设置人工升级路径
F11：触发接口支持所需渠道
F11：按渠道配置响应路由
F11：实现渠道特定的格式化

生产环境监控

每个工作流的成本仪表板
配置延迟警报
设置错误率阈值
激活状态一致性检查
安排审计日志审查
启用策略漂移检测
陷入循环检测警报（F9）
多渠道健康监控（F11）

其他资源

参考文件

有关详细的实施指南：

references/cost-optimization.md - 详细的成本降低策略
references/compliance-details.md - 完整的合规性要求
references/ops-runbook.md - 操作程序（包括错误处理和多渠道）
../agent-specification/references/twelve-factor-agents.md - 所有 12 个因子的快速参考

🇺🇸English

Production Readiness for Multi-Agent Systems

The Production Challenge

89% of organizations have implemented observability for agents, but 32% cite quality issues as the primary barrier to production. Production MAS requires first-class treatment of:

Cost Optimization - Multi-agent inefficiencies can increase costs 10x
Observability - Distributed tracing and failure classification
Governance - Dynamic RBAC and audit trails
Compliance - EU AI Act enforcement begins August 2, 2026

Cost Optimization

Token Budget Management

Multi-agent systems can cost 2-4x more than single agents. Systematic optimization achieves 30-50% reductions.

Per-Agent Token Budgets

{
  "token_budgets": {
    "Planner": {
      "max_input_tokens": 4000,
      "max_output_tokens": 2000,
      "model": "claude-3-haiku"
    },
    "Executor": {
      "max_input_tokens": 8000,
      "max_output_tokens": 4000,
      "model": "claude-3-sonnet"
    },
    "Verifier": {
      "max_input_tokens": 6000,
      "max_output_tokens": 1000,
      "model": "claude-3-haiku"
    }
  }
}

Budget Enforcement

def check_budget_before_call(agent_id: str, estimated_tokens: int):
    budget = token_budgets[agent_id]

    if estimated_tokens > budget.max_input_tokens:
        # Compress context before proceeding
        compressed = compress_context(context, budget.max_input_tokens)
        return compressed

    return context

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

Dynamic Model Routing

Route tasks to appropriate model tiers:

Task Type	Model	Cost Savings
Classification	Haiku	90%
Summarization	Haiku	90%
Simple extraction	Haiku	90%
Complex reasoning	Sonnet	Baseline
Critical decisions	Opus	-200% (worth it)

Context Management Strategies

Strategy	Token Savings	Use Case
Sliding window + summarization	60-70%	Conversations >5 turns
Concise output formats	70-85%	Agent-to-agent chaining
IDs-only formats	85-95%	Bulk operations
Selective context loading	40-60%	Large knowledge bases

Anti-Pattern: Unbounded Context

Problem : Agents accumulate context without pruning.

Impact : Token costs escalate exponentially, systems fail at context limits.

Fix : Implement max_history_tokens budgets, compress older context.

Distributed Tracing Requirements

Capture complete execution paths:

{
  "trace": {
    "trace_id": "workflow_uuid",
    "spans": [
      {
        "span_id": "uuid",
        "parent_span_id": null,
        "operation": "workflow.start",
        "agent": "Orchestrator",
        "start_time": "ISO8601",
        "end_time": "ISO8601",
        "attributes": {
          "user_request": "...",
          "tokens_input": 150
        }
      },
      {
        "span_id": "uuid",
        "parent_span_id": "prev_uuid",
        "operation": "plan.create",
        "agent": "Planner",
        "attributes": {
          "tokens_input": 1200,
          "tokens_output": 800,
          "model": "claude-3-haiku",
          "latency_ms": 450
        }
      }
    ]
  }
}

Category	Metrics
Cost	Tokens per agent, cost per workflow, model utilization ratio
Latency	Per-agent latency, total workflow latency, queue wait time
Quality	Success rate, verification pass rate, retry rate
Coordination	Message volume, conflict rate, escalation rate

State Tracking for Debugging

Log state at key points:

def log_state_transition(
    agent_id: str,
    operation: str,
    state_before: dict,
    state_after: dict
):
    """Log state changes for replay capability."""
    logger.info({
        "event": "state_transition",
        "agent": agent_id,
        "operation": operation,
        "state_diff": compute_diff(state_before, state_after),
        "timestamp": datetime.utcnow().isoformat()
    })

Failure Classification

Use MAST taxonomy for automated failure classification:

Category	Percentage	Examples
Specification	41.77%	Missing inputs, vague outputs, no constraints
Alignment	36.94%	Silent ignoring, role confusion, state conflicts
Verification	21.30%	Weak checks, premature termination, self-grading

Dynamic RBAC for Agents

Static permission lists are insufficient. Implement dynamic governance:

1. Certified Identity and Purpose

{
  "agent": {
    "id": "planner_001",
    "name": "TaskPlanner",
    "purpose": "Decompose requirements into task graphs",
    "risk_level": "low",
    "data_access": ["requirements", "constraints"],
    "forbidden_data": ["pii", "credentials"]
  }
}

2. Central Policy Engine

def check_policy(agent_id: str, action: str, resource: str) -> bool:
    """Check if action is permitted for agent."""
    agent = get_agent_config(agent_id)
    policy = get_policy(agent.role)

    checks = [
        action_aligns_with_purpose(action, agent.purpose),
        resource_in_allowed_data(resource, agent.data_access),
        not_in_forbidden(resource, agent.forbidden_data),
        not_chaining_too_many_actions(agent_id)
    ]

    return all(checks)

3. Runtime Enforcement with Audit

def enforce_and_audit(agent_id: str, action: str, resource: str):
    """Intercept, enforce, and audit agent actions."""
    # Check policy
    permitted = check_policy(agent_id, action, resource)

    # Audit regardless of outcome
    audit_log.append({
        "timestamp": datetime.utcnow().isoformat(),
        "agent": agent_id,
        "action": action,
        "resource": resource,
        "permitted": permitted,
        "policy_version": current_policy_version
    })

    if not permitted:
        raise PolicyViolation(f"{agent_id} cannot {action} on {resource}")

Implementation Phases

Inventory : Classify agents by risk level
Define Roles : Document purpose, boundaries, forbidden actions
Integrate RBAC : Every agent action passes through enforcement
Policy-as-Code : Version-control policies, peer review changes
Continuous Review : Analyze audit logs for anomalies

EU AI Act (August 2, 2026)

Key requirements for AI agents:

Requirement	Implementation
Transparency	Document agent capabilities and limitations
Human oversight	Escalation paths, human-in-the-loop for critical decisions
Risk management	Classify agents by risk, implement proportional controls
Technical documentation	Maintain specs, audit logs, test results
Accuracy/robustness	Verification processes, failure handling

Compliance Checklist

RBAC with granular permissions
Audit logs capturing all agent decisions
Prompt/content redaction capabilities
Policy enforcement at runtime
SSO/SCIM integration
PII handling and DLP controls
Incident response procedures
Regular security assessments

SOC 2 Considerations

SOC 2 Type II audits now scrutinize:

AI agent access patterns
Data handling by agents
Change management for agent updates
Incident response for agent failures

12 Factor Agents: Operational Principles

Factor 9: Compact Errors into Context Window

Principle : Feed errors back into context so agents can self-correct. Implement error counters to prevent spin-outs.

Error Context Pattern

When an agent encounters an error, feed it back as context for self-correction:

class ErrorContextManager:
    MAX_ERRORS_PER_TOOL = 3
    MAX_TOTAL_ERRORS = 5

    def __init__(self):
        self.error_counts = defaultdict(int)
        self.error_history = []

    def record_error(self, tool: str, error: dict) -> dict:
        """Record error and return context for retry."""
        self.error_counts[tool] += 1
        self.error_history.append({
            "tool": tool,
            "error": error,
            "timestamp": datetime.utcnow().isoformat(),
            "attempt": self.error_counts[tool]
        })

        # Check for spin-out
        if self.error_counts[tool] >= self.MAX_ERRORS_PER_TOOL:
            raise ToolSpinOut(f"{tool} failed {self.MAX_ERRORS_PER_TOOL} times")

        if sum(self.error_counts.values()) >= self.MAX_TOTAL_ERRORS:
            raise WorkflowSpinOut(f"Total errors exceeded {self.MAX_TOTAL_ERRORS}")

        # Return context for self-correction
        return {
            "previous_error": error,
            "attempt_number": self.error_counts[tool],
            "suggestion": f"Previous attempt failed: {error['message']}. Try different approach."
        }

Error Context Format

Feed errors back to the agent in a structured format:

{
  "role": "system",
  "content": "PREVIOUS ATTEMPT FAILED\n\nError: [error message]\nAttempt: 2 of 3\n\nPlease try a different approach. Consider:\n- Alternative method X\n- Check assumption Y\n- Verify input Z"
}

Counter	Threshold	Action
Per-tool errors	3	Stop using that tool, try alternative
Total errors	5	Pause workflow, request human help
Retry without progress	2	Force different approach
Same error repeated	2	Escalate immediately

Error Handling Workflow

consecutive_errors = 0

while True:
    try:
        result = await handle_next_step(thread, next_step)
        thread["events"].append({
            "type": next_step["intent"] + '_result',
            "data": result,
        })
        # Success! Reset the error counter
        consecutive_errors = 0

    except Exception as e:
        consecutive_errors += 1
        if consecutive_errors < 3:
            # Feed error back into context and retry
            thread["events"].append({
                "type": 'error',
                "data": format_error_for_context(e),
            })
        else:
            # Too many errors - break loop, escalate
            await escalate_to_human(thread, e)
            break

See references/ops-runbook.md for detailed error handling procedures.

Factor 11: Trigger from Anywhere

Principle : Meet users where they are. Agent triggering should be channel-agnostic.

Unified Trigger Interface

class WorkflowTrigger:
    """
    Channel-agnostic workflow triggering.
    Same workflow can be started from any channel.
    """

    def trigger(self,
                input_data: dict,
                channel: str,
                user_id: str,
                metadata: dict = None) -> str:
        """
        Trigger workflow from any channel.

        Args:
            input_data: The actual request/task
            channel: Where this came from (slack, email, cli, api, webhook)
            user_id: Who triggered it
            metadata: Channel-specific metadata

        Returns:
            workflow_id for tracking
        """
        # Normalize input regardless of channel
        normalized = self.normalize_input(input_data, channel)

        # Create workflow with channel context
        workflow_id = self.workflow_controller.launch(
            input=normalized,
            context={
                "channel": channel,
                "user_id": user_id,
                "reply_to": self.get_reply_destination(channel, metadata)
            }
        )

        return workflow_id

Channel	Trigger Method	Response Method
Slack	Slash command, mention, DM	Thread reply
Email	Send to agent@domain.com	Reply email
CLI	`mas run <workflow>`	Stdout
API	POST /workflows	Webhook or poll
Webhook	POST from external system	Callback URL
Dashboard	Button click	UI notification

┌─────────┐     ┌─────────────────┐     ┌──────────────┐
│  Slack  │────►│                 │────►│              │
├─────────┤     │  Unified        │     │   Workflow   │
│  Email  │────►│  Trigger        │────►│   Engine     │
├─────────┤     │  Interface      │     │              │
│   CLI   │────►│                 │────►│              │
├─────────┤     │                 │     │              │
│   API   │────►│                 │────►│              │
└─────────┘     └─────────────────┘     └──────────────┘

When workflow completes, route response back through original channel:

def complete_workflow(self, workflow_id: str, result: dict):
    """Route result back to originating channel."""
    context = self.get_workflow_context(workflow_id)
    reply_to = context["reply_to"]

    match reply_to["type"]:
        case "slack":
            self.slack_client.post_message(
                reply_to["channel"],
                format_slack_response(result)
            )
        case "email":
            self.email_client.send(
                reply_to["to"],
                format_email_response(result)
            )
        case "webhook":
            self.http_client.post(reply_to["url"], result)
        case "cli":
            print(format_cli_response(result))

Channel Configuration

channels:
  slack:
    enabled: true
    app_token: ${SLACK_APP_TOKEN}
    triggers:
      - slash_command: /mas
      - mention: @mas-agent
    response_format: slack_blocks

  email:
    enabled: true
    inbox: mas-agent@company.com
    response_format: html

  api:
    enabled: true
    auth: bearer_token
    rate_limit: 100/minute

  cli:
    enabled: true
    response_format: text

See references/ops-runbook.md for multi-channel deployment procedures.

Production Checklist

Token budgets set per agent
Model routing configured
Context management implemented
Distributed tracing enabled
Metrics collection active
RBAC policies defined
Audit logging enabled
Compliance documentation complete

12 Factor Agents Checklist

F9: Error context manager implemented (spin-out prevention)
F9: Error counters per tool (max 3 attempts)
F9: Total error threshold configured (max 5)
F9: Human escalation path for repeated errors
F11: Trigger interface supports required channels
F11: Response routing configured per channel
F11: Channel-specific formatting implemented

Monitoring in Production

Cost dashboards per workflow
Latency alerts configured
Error rate thresholds set
State consistency checks active
Audit log review scheduled
Policy drift detection enabled
Spin-out detection alerts (F9)
Multi-channel health monitoring (F11)

Additional Resources

For detailed implementation guides:

references/cost-optimization.md - Detailed cost reduction strategies
references/compliance-details.md - Full compliance requirements
references/ops-runbook.md - Operational procedures (includes error handling and multi-channel)
../agent-specification/references/twelve-factor-agents.md - Quick reference for all 12 factors

coordination-patterns - Design observability into coordination (Factors 3, 5/6, 8, 12)
agent-specification - Build governance into specs (Factors 1, 2, 4, 7)
mas-decision-gate - Decide if multi-agent is needed (Factor 10)

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

45,100 周安装