customaize-agent%3Acontext-engineering by neolabhq/context-engineering-kit
npx skills add https://github.com/neolabhq/context-engineering-kit --skill customaize-agent:context-engineering上下文是语言模型在推理时可用的完整状态。它包括模型在生成响应时可以关注的所有内容:系统指令、工具定义、检索到的文档、消息历史记录和工具输出。理解上下文基础是进行有效上下文工程的前提。
上下文由几个不同的组成部分构成,每个部分都有不同的特征和约束。注意力机制创建了一个有限的预算,限制了上下文的有效使用。渐进式披露通过仅在需要时加载信息来管理这一约束。工程学科的目标是策划能够实现预期结果的最小高信号令牌集。
系统提示 系统提示确立了智能体的核心身份、约束和行为准则。它们在会话开始时加载一次,通常在整个对话过程中持续存在。系统提示应极其清晰,并使用简单、直接的语言,其抽象程度要适合智能体。
合适的抽象程度需要平衡两种失败模式。一个极端是,工程师硬编码复杂而脆弱的逻辑,这会带来脆弱性和维护负担。另一个极端是,工程师提供模糊的高层指导,无法为期望的输出提供具体信号,或错误地假设了共享的上下文。最佳的抽象程度需要取得平衡:既要足够具体以有效指导行为,又要足够灵活以提供强有力的启发式方法。
使用 XML 标签或 Markdown 标题将提示组织成不同的部分,以划分背景信息、指令、工具指导和输出描述。随着模型能力越来越强,确切的格式变得不那么重要,但结构的清晰性仍然很有价值。
工具定义 工具定义指定了智能体可以执行的操作。每个工具包括名称、描述、参数和返回格式。工具定义在序列化后位于上下文的前部附近,通常在系统提示之前或之后。
工具描述共同引导智能体的行为。糟糕的描述会迫使智能体猜测;优化的描述则包含使用上下文、示例和默认值。整合原则指出,如果人类工程师无法明确说明在给定情况下应该使用哪个工具,那么就不能期望智能体能做得更好。
检索到的文档 检索到的文档提供特定领域的知识、参考资料或与任务相关的信息。智能体使用检索增强生成,在运行时将相关文档拉入上下文,而不是预先加载所有可能的信息。
这种即时方法维护轻量级的标识符(文件路径、存储的查询、网页链接),并使用这些引用来动态地将数据加载到上下文中。这反映了人类的认知方式:我们通常不会记住整个信息库,而是使用外部的组织和索引系统来按需检索相关信息。
消息历史记录 消息历史记录包含用户和智能体之间的对话,包括之前的查询、响应和推理过程。对于长时间运行的任务,消息历史记录可能会增长到主导上下文使用。
消息历史记录充当便签记忆,智能体在其中跟踪进度、维护任务状态并保持跨轮次的推理。有效管理消息历史记录对于完成长期任务至关重要。
工具输出 工具输出是智能体操作的结果:文件内容、搜索结果、命令执行输出、API 响应和类似数据。工具输出构成了典型智能体轨迹中的大部分令牌,研究表明观察结果(工具输出)可以达到总上下文使用量的 83.9%。
无论工具输出是否与当前决策相关,它们都会消耗上下文。这催生了诸如观察掩蔽、压缩和选择性工具结果保留等策略的需求。
注意力预算约束 语言模型通过注意力机制处理令牌,该机制在上下文中的所有令牌之间创建成对关系。对于 n 个令牌,这会产生 n^2 个必须计算和存储的关系。随着上下文长度的增加,模型捕捉这些关系的能力会变得薄弱。
模型从以较短序列为主的训练数据分布中发展出注意力模式。这意味着模型对于跨上下文的依赖关系经验较少,专门的参数也较少。结果是,随着上下文的增长,“注意力预算”会逐渐耗尽。
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
位置编码与上下文扩展 位置编码插值允许模型通过适应最初训练的较小上下文来处理更长的序列。然而,这种适应会降低对令牌位置的理解能力。与在较短上下文上的性能相比,模型在较长上下文中仍然非常强大,但在信息检索和长程推理方面表现出精度下降。
渐进式披露原则 渐进式披露通过仅在需要时加载信息来高效地管理上下文。在启动时,智能体只加载技能名称和描述——这足以知道何时可能使用某个技能。只有在技能被激活用于特定任务时,才加载完整内容。
这种方法使智能体保持快速,同时让它们能够按需访问更多上下文。该原则适用于多个层面:技能选择、文档加载,甚至工具结果检索。
更大的上下文窗口能解决内存问题的假设已被经验证伪。上下文工程意味着找到尽可能小的高信号令牌集,以最大化实现期望结果的可能性。
有几个因素对上下文效率提出了要求。处理成本随上下文长度不成比例地增长——不仅仅是令牌数量翻倍导致成本翻倍,而是在时间和计算资源上呈指数级增长。即使窗口在技术上支持更多令牌,超过特定上下文长度后,模型性能也会下降。即使有前缀缓存,长输入仍然很昂贵。
指导原则是信息性优于详尽性。包含与当前决策相关的内容,排除不相关的内容,并设计能够按需访问额外信息的系统。
必须将上下文视为一种边际收益递减的有限资源。就像人类的工作记忆有限一样,语言模型在解析大量上下文时会消耗注意力预算。
引入的每个新令牌都会消耗一定量的预算。这就需要仔细策划可用的令牌。工程问题是在固有约束下优化效用。
上下文工程是迭代的,每次决定向模型传递什么内容时,都会进行策划阶段。这不是一次性的提示编写练习,而是一种持续的上下文管理学科。
具有文件系统访问权限的智能体可以自然地使用渐进式披露。将参考资料、文档和数据存储在外部。仅在需要时使用标准文件系统操作加载文件。这种模式避免了将可能不相关的信息塞满上下文。
文件系统本身提供了智能体可以导航的结构。文件大小暗示复杂性;命名约定暗示目的;时间戳可作为相关性的代理。文件引用的元数据提供了一种有效优化行为的机制。
最有效的智能体采用混合策略。为了速度预加载一些上下文(如 CLAUDE.md 文件或项目规则),但根据需要启用自主探索以获取额外的上下文。决策边界取决于任务特征和上下文动态。
对于内容变化不大的上下文,预先加载更多内容是有意义的。对于快速变化或高度具体的信息,即时加载可以避免上下文过时。
设计时要考虑到明确的上下文预算。了解你的模型和任务的有效上下文限制。在开发过程中监控上下文使用情况。在适当的阈值处实现压缩触发器。设计系统时假设上下文会退化,而不是希望它不会。
有效的上下文预算不仅需要了解原始令牌计数,还需要了解注意力分布模式。上下文中间部分受到的注意力少于开头和结尾。将关键信息放在受注意力青睐的位置。
示例 1:组织系统提示
<BACKGROUND_INFORMATION>
你是一位帮助开发团队的 Python 专家。
当前项目:使用 Python 3.9+ 的数据处理管道
</BACKGROUND_INFORMATION>
<INSTRUCTIONS>
- 编写干净、地道的 Python 代码
- 为函数签名包含类型提示
- 为公共函数添加文档字符串
- 遵循 PEP 8 风格指南
</INSTRUCTIONS>
<TOOL_GUIDANCE>
使用 bash 进行 shell 操作,使用 python 进行代码任务。
文件操作应使用 pathlib 以确保跨平台兼容性。
</TOOL_GUIDANCE>
<OUTPUT_DESCRIPTION>
提供带有具体行号参考的可操作反馈。
解释建议背后的推理过程。
</OUTPUT_DESCRIPTION>
示例 2:渐进式文档加载
# 而不是一次性加载所有文档:
# 步骤 1:加载摘要
docs/architecture_overview.md # 轻量级概述
# 步骤 2:根据需要加载特定部分
docs/api/endpoints.md # 仅在需要 API 工作时加载
docs/database/schemas.md # 仅在需要数据层工作时加载
示例 3:技能描述设计
# 差:模糊的描述,加载到上下文中但提供的信号很少
description: 帮助处理代码相关事务
# 好:具体的描述,帮助模型决定何时激活
description: 分析代码质量并建议重构模式。在审查拉取请求或改进现有代码结构时使用。
随着上下文长度的增加,语言模型表现出可预测的退化模式。理解这些模式对于诊断故障和设计弹性系统至关重要。上下文退化不是一种二元状态,而是一种性能退化的连续体,以几种不同的方式表现出来。
上下文退化通过几种不同的模式表现出来。“迷失在中间”现象导致上下文中间部分的信息受到的注意力较少。上下文污染发生在错误通过重复引用而累积时。上下文干扰发生在不相关信息压倒相关内容时。当模型无法确定哪个上下文适用时,就会出现上下文混淆。当累积的信息直接冲突时,就会产生上下文冲突。
这些模式是可预测的,可以通过压缩、掩蔽、分区和隔离等架构模式来缓解。
记录最充分的退化模式是“迷失在中间”效应,即模型表现出 U 形注意力曲线。上下文开头和结尾的信息获得可靠的注意力,而埋藏在中间的信息则遭受显著降低的回忆准确性。
经验证据 研究表明,放置在上下文中间的相关信息,与放置在开头或结尾的相同信息相比,回忆准确性低 10-40%。这不是模型的失败,而是注意力机制和训练数据分布的结果。
模型将大量注意力分配给第一个令牌(通常是 BOS 令牌)以稳定内部状态。这创建了一个“注意力汇”,吸收了注意力预算。随着上下文的增长,有限的预算被拉伸得更薄,中间令牌无法获得足够的注意力权重以实现可靠的检索。
实际影响 设计上下文放置时要考虑注意力模式。将关键信息放在上下文的开头或结尾。考虑信息是会被直接查询还是需要支持推理——如果是后者,放置位置不那么重要,但整体信号质量更重要。
对于长文档或对话,使用在受注意力青睐的位置呈现关键信息的摘要结构。使用明确的部分标题和过渡来帮助模型导航结构。
上下文污染发生在幻觉、错误或不正确信息进入上下文并通过重复引用而累积时。一旦被污染,上下文就会产生反馈循环,强化错误的信念。
污染如何发生 污染通常通过三种途径进入。首先,工具输出可能包含错误或意外格式,模型将其接受为基本事实。其次,检索到的文档可能包含不正确或过时的信息,模型将其纳入推理过程。第三,模型生成的摘要或中间输出可能引入持续存在于上下文中的幻觉。
累积效应是严重的。如果智能体的目标部分被污染,它会制定出需要大量努力才能撤销的策略。每个后续决策都会引用被污染的内容,强化错误的假设。
检测与恢复 注意以下症状:在先前成功的任务上输出质量下降;工具错位,智能体调用错误的工具或参数;以及尽管尝试纠正,幻觉仍然持续。当这些症状出现时,考虑上下文污染。
恢复需要移除或替换被污染的内容。这可能涉及将上下文截断到污染点之前,在上下文中明确注明污染并要求重新评估,或者使用干净的上下文重新开始并仅保留已验证的信息。
当上下文变得如此之长,以至于模型过度关注所提供的信息而牺牲其训练知识时,就会出现上下文干扰。模型会关注上下文中的所有内容,无论其是否相关,这会产生压力,即使内部知识更准确,也要使用所提供的信息。
干扰效应 研究表明,即使上下文中只有一个不相关的文档,也会降低涉及相关文档的任务的性能。多个干扰因素会加剧退化。这种效应不是关于绝对意义上的噪音,而是关于注意力分配——不相关信息与相关信息竞争有限的注意力预算。
模型没有“跳过”不相关上下文的机制。它们必须关注所提供的一切,这种义务即使在不相关信息明显无用的情况下也会造成干扰。
缓解策略 通过仔细筛选进入上下文的内容来缓解干扰。在加载检索到的文档之前应用相关性过滤。使用命名空间和组织结构,使不相关的部分在结构上容易被忽略。考虑信息是否真的需要放在上下文中,或者可以通过工具调用访问。
当不相关信息以降低质量的方式影响响应时,就会出现上下文混淆。这与干扰有关但又不同——混淆关注的是上下文对模型行为的影响,而不是注意力分配。
如果你把某些东西放在上下文中,模型就必须关注它。模型可能会纳入不相关的信息,使用不恰当的工具定义,或应用来自不同上下文的约束。当上下文包含多种任务类型或在单个会话中切换任务时,混淆尤其成问题。
混淆的迹象 注意那些回答查询错误方面的响应,为不同任务选择看似合适的工具调用,或者混合了来自多个来源的需求的输出。这些表明对当前情况适用哪个上下文感到困惑。
架构解决方案 架构解决方案包括明确的任务分割(不同任务获得不同的上下文窗口)、任务上下文之间的清晰过渡,以及隔离不同目标上下文的状态管理。
当累积的信息直接冲突,产生矛盾的指导并导致推理脱轨时,就会产生上下文冲突。这与污染不同,在污染中,一条信息是错误的——在冲突中,多条正确的信息相互矛盾。
冲突的来源 冲突通常源于多源检索(不同来源有矛盾的信息)、版本冲突(过时和当前信息同时出现在上下文中)以及视角冲突(不同观点有效但不兼容)。
解决方法 解决方法包括明确标记冲突以识别矛盾并要求澄清、建立优先级的规则(确定哪个来源优先),以及版本过滤(将过时信息排除在上下文之外)。
研究揭示了几种挑战上下文管理假设的反直觉模式。
打乱的干草堆优于连贯的干草堆 研究发现,打乱的(不连贯的)干草堆比逻辑连贯的干草堆产生更好的性能。这表明连贯的上下文可能会产生混淆检索的错误关联,而不连贯的上下文则迫使模型依赖精确匹配。
单个干扰因素具有不成比例的影响 即使是一个不相关的文档也会显著降低性能。这种影响与噪音量不成比例,而是遵循一个阶跃函数,即任何干扰因素的存在都会触发退化。
针-问题相似性相关性 针和问题对之间的相似性越低,随着上下文长度的增加,退化速度越快。需要跨不相似内容进行推理的任务尤其脆弱。
更大的上下文窗口并不总是能提高性能。在许多情况下,更大的上下文会带来新的问题,其代价超过了收益。
性能退化曲线 模型表现出随上下文长度变化的非线性退化。性能在达到阈值之前保持稳定,然后迅速下降。阈值因模型和任务复杂性而异。对于许多模型,即使上下文窗口支持更大的尺寸,有意义的退化大约在 8,000-16,000 个令牌时开始。
成本影响 处理成本随上下文长度不成比例地增长。处理 400K 令牌上下文的成本不是处理 200K 的两倍——它在时间和计算资源上都呈指数级增长。对于许多应用来说,这使得大上下文处理在经济上不切实际。
认知负荷隐喻 即使拥有无限的上下文,要求单个模型在数十个独立任务中保持一致的品质也会产生认知瓶颈。模型必须不断在项目之间切换上下文,维护比较框架,并确保风格一致性。这不是更多上下文能解决的问题。
四种策略解决上下文退化的不同方面:
写入:使用便签、文件系统或外部存储将上下文保存在窗口之外。这使活动上下文保持精简,同时保留信息访问权限。
选择:通过检索、过滤和优先级排序将相关上下文拉入窗口。这通过排除不相关信息来解决干扰问题。
压缩:通过摘要、抽象和观察掩蔽来减少令牌,同时保留信息。这扩展了有效的上下文容量。
隔离:将上下文分割到子智能体或会话中,以防止任何单个上下文变得过大而导致退化。这是最激进的策略,但往往也是最有效的。
通过特定的架构模式实现这些策略。使用即时上下文加载来仅在需要时检索信息。使用观察掩蔽来用紧凑的引用替换冗长的工具输出。使用子智能体架构来隔离不同任务的上下文。使用压缩在上下文超过限制之前总结不断增长的上下文。
示例 1:在提示设计中检测退化
# 你的命令/技能提示可能过大的迹象:
早期迹象(上下文利用率约 50-70%):
- 智能体偶尔会遗漏指令
- 响应变得不那么集中
- 某些指导原则被忽略
警告迹象(上下文利用率约 70-85%):
- 不同运行之间的行为不一致
- 智能体“忘记”了早期的指令
- 质量差异显著
关键迹象(上下文利用率 >85%):
- 智能体忽略关键约束
- 幻觉增加
- 任务完成失败
示例 2:在提示结构中缓解“迷失在中间”效应
# 将关键信息组织在边缘位置
<CRITICAL_CONSTRAINTS> # 在开头(高注意力)
- 切勿直接修改生产文件
- 提交前始终运行测试
- 最大文件大小:500 行
</CRITICAL_CONSTRAINTS>
<DETAILED_GUIDELINES> # 中间(较低注意力)
- 代码风格偏好
- 文档模板
- 审查清单
- 示例模式
</DETAILED_GUIDELINES>
<KEY_REMINDERS> # 在结尾(高注意力)
- 运行测试:npm test
- 格式化代码:npm run format
- 创建带有描述的 PR
</KEY_REMINDERS>
示例 3:子智能体上下文隔离
# 而不是让一个智能体处理所有事情:
## 协调器智能体(精简上下文)
- 理解任务分解
- 委派给专门的子智能体
- 综合结果
## 代码审查子智能体(隔离上下文)
- 仅加载代码审查指南
- 仅专注于审查任务
- 返回结构化发现
## 测试编写子智能体(隔离上下文)
- 仅加载测试模式
- 仅专注于测试创建
- 返回测试文件
本节将上下文退化检测和缓解概念转化为适用于 Claude Code 的可操作多智能体工作流。在构建命令、技能或复杂的智能体管道时使用这些模式,以确保质量和可靠性。
智能体输出中的幻觉会污染下游上下文,并通过多步骤工作流传播错误。此工作流可在幻觉累积之前检测到它们。
步骤 1:生成输出
让主智能体正常完成任务。
步骤 2:提取主张
生成一个验证子智能体,使用以下提示:
<TASK>
从以下输出中提取所有事实性主张。将每个主张列在单独的一行。
</TASK>
<FOCUS_AREAS>
- 文件路径及其存在性
- 引用的函数/类/方法名称
- 代码行为断言("此函数返回 X")
- 关于 API、库或规范的外部事实
- 数值和指标
</FOCUS_AREAS>
<OUTPUT_TO_ANALYZE>
{agent_output}
</OUTPUT_TO_ANALYZE>
<OUTPUT_FORMAT>
每行一个主张,前缀类别:
[PATH] /src/auth/login.ts 存在
[CODE] validateCredentials() 返回布尔值
[FACT] JWT 令牌默认在 24 小时后过期
[METRIC] 该函数具有 O(n) 复杂度
</OUTPUT_FORMAT>
步骤 3:验证主张
对于提取的主张组,生成一个验证智能体:
<TASK>
通过检查实际代码库和上下文来验证此主张。
</TASK>
<CLAIM>
{claim}
</CLAIM>
<VERIFICATION_APPROACH>
- 对于文件路径:使用文件工具检查存在性
- 对于代码主张:阅读实际代码并验证行为
- 对于外部事实:与文档或网络搜索交叉引用
- 对于指标:分析代码结构
</VERIFICATION_APPROACH>
<RESPONSE_FORMAT>
STATUS: [VERIFIED | FALSE | UNVERIFIABLE]
EVIDENCE: [你的发现]
CONFIDENCE: [HIGH | MEDIUM | LOW]
</RESPONSE_FORMAT>
步骤 4:计算污染风险
汇总验证结果:
total_claims = 提取的主张数量
verified_count = 标记为 VERIFIED 的主张数量
false_count = 标记为 FALSE 的主张数量
unverifiable_count = 标记为 UNVERIFIABLE 的主张数量
poisoning_risk = (false_count * 2 + unverifiable_count) / total_claims
步骤 5:决策阈值
风险 < 0.1:输出可靠,正常进行
风险 0.1-0.3:在继续之前手动审查标记的主张
风险 > 0.3:使用更明确的接地指令重新生成输出:
<REGENERATION_PROMPT> 先前的输出包含 {false_count} 个错误主张和 {unverifiable_count} 个无法验证的主张。
具体问题: {列出带有证据的 FALSE 和 UNVERIFIABLE 主张}
请重新生成你的响应。对于每个事实性主张:
长提示中埋藏在中间的关键信息受到的注意力较少。此工作流通过运行多个智能体并验证其输出是否符合原始指令,来检测提示的哪些部分有被忽略的风险。
步骤 1:识别关键指令
从你的提示中提取所有智能体必须遵循的关键指令:
需要验证的关键指令:
1. "切勿修改 /production 中的文件"
2. "提交前始终运行测试"
3. "使用 TypeScript 严格模式"
4. "最大函数长度:50 行"
5. "为公共 API 包含 JSDoc"
6. "将输出格式化为 JSON"
7. "记录所有文件修改"
步骤 2:使用相同提示运行多个智能体
使用相同的提示(正在测试的命令/技能/智能体)生成 3-5 个智能体。每个智能体使用相同的输入独立运行:
<AGENT_RUN_CONFIG>
运行次数:5
提示:{你正在测试的完整提示}
任务:{能锻炼所有指令的代表性任务}
每次运行保存:
- run_id: 唯一标识符
- agent_output: 来自智能体的完整响应
- timestamp: 运行完成时间
</AGENT_RUN_CONFIG>
步骤 3:根据原始提示验证每个输出
对于每个智能体的输出,生成一个新的验证智能体,检查其是否符合每条关键指令:
<VERIFICATION_AGENT_PROMPT>
<TASK>
你是一个合规性验证智能体。分析智能体输出是否遵循了原始提示中的每条指令。
</TASK>
<ORIGINAL_PROMPT>
{正在测试的完整提示}
</ORIGINAL_PROMPT>
<CRITICAL_INSTRUCTIONS>
{关键指令的编号列表}
</CRITICAL_INSTRUCTIONS>
<AGENT_OUTPUT>
{来自第 N 次运行的输出}
</AGENT_OUTPUT>
<VERIFICATION_APPROACH>
对于每条关键指令:
1. 确定该指令是否适用于此任务
2. 如果适用,检查输出是否合规
3. 寻找明确的违规和遗漏
4. 注意任何部分合规的情况
</VERIFICATION_APPROACH>
<OUTPUT_FORMAT>
RUN_ID: {run_id}
INSTRUCTION_COMPLIANCE:
- 指令 1: "切勿修改 /production 中的文件"
STATUS: [FOLLOWED | VIOLATED | NOT_APPLICABLE]
EVIDENCE: {输出中的引用或解释}
- 指令 2: "提交前始终运行测试"
STATUS: [FOLLOWED | VIOLATED | NOT_APPLICABLE]
EVIDENCE: {输出中的引用或解释}
[... 为所有指令继续 ...]
SUMMARY:
- 遵循的指令:{count}
- 违反的指令:{count}
- 不适用:{count}
</OUTPUT_FORMAT>
</VERIFICATION_AGENT_PROMPT>
步骤 4:汇总结果并识别风险部分
收集所有运行的验证结果,并识别不一致遵循的指令:
<AGGREGATION_LOGIC>
对于每条指令:
followed_count = STATUS == FOLLOWED 的运行次数
violated_count = STATUS == VIOLATED 的运行次数
applicable_runs = total_runs - (STATUS == NOT_APPLICABLE 的运行次数)
compliance_rate = followed_count / applicable_runs
分类:
- compliance_rate == 1.0: RELIABLE(始终遵循)
- compliance_rate >= 0.8: MOSTLY_RELIABLE(轻微不一致)
- compliance_rate >= 0.5: AT_RISK(不一致 - 很可能迷失在中间)
- compliance_rate < 0.5: FREQUENTLY_IGNORED(严重问题)
- compliance_rate == 0.0: ALWAYS_IGNORED(关键失败)
AT_RISK 指令是“迷失在中间”问题的主要信号。
这些指令有时有效但并非始终一致,表明它们处于注意力薄弱的位置。
</AGGREGATION_LOGIC>
<AGGREGATION_OUTPUT_FORMAT>
指令合规性摘要:
| 指令 | 遵循 | 违反 | 合规率 | 状态 |
|-------------|----------|----------|-----------------|--------|
| 1. 切勿修改 /production | 5/5 | 0/5 | 100% | RELIABLE |
| 2. 提交前运行测试 | 3/5 | 2/5 | 60% | AT_RISK |
| 3. TypeScript 严格模式 | 4/5 | 1/5 | 80% | MOSTLY_RELIABLE |
| 4. 最大函数长度 50 | 2/5 | 3/5 | 40% | FREQUENTLY_IGNORED |
| 5. 包含 JSDoc | 5/5 | 0/5 | 100% | RELIABLE |
| 6. 格式化为 JSON | 1/5 | 4/5 | 20% | ALWAYS_IGNORED |
| 7. 记录修改 | 3/5 | 2/5 | 60% | AT_RISK |
风险指令(很可能处于“迷失在中间”区域):
- 指令 2: "提交前运行测试" (60% 合规率)
- 指令 4: "最大函数长度 50" (40% 合规率)
- 指令 6: "格式化为 JSON" (20% 合规率)
- 指令 7: "记录修改" (60% 合规率)
</AGGREGATION_OUTPUT_FORMAT>
步骤 5:输出建议
根据识别的风险部分,提供具体的补救指导:
<RECOMMENDATIONS_OUTPUT>
“迷失在中间”分析完成
检测到的风险指令:{count}
这些指令被不一致地遵循,表明它们很可能
位于注意力薄弱的位置(提示中间)。
具体建议:
1. 将关键信息移动到受注意力青睐的位置
以下指令应重新定位到提示的开头或结尾:
- "提交前运行测试" -> 移动到提示开头的 <CRITICAL_CONSTRAINTS>
- "最大函数长度 50" -> 移动到提示结尾的 <KEY_REMINDERS>
- "格式化为 JSON" -> 移动到提示结尾的 <OUTPUT_FORMAT>
- "记录修改" -> 添加到开头和结尾部分
2. 使用明确的标记来突出关键信息
用强调方式重构风险指令:
之前:"提交前始终运行测试"
之后:"**关键:** 提交前必须运行测试。切勿跳过此步骤。"
之前:"最大函数长度:50 行"
之后:"3. [必需] 最大函数长度:50 行"
使用编号列表、粗体标记或明确的标签,如 [REQUIRED]、[CRITICAL]、[MUST]。
3. 考虑分割上下文以减少中间部分
如果你的提示有很多指令,请考虑:
- 分解为针对不同方面的专注子提示
- 使用具有专门的、较短上下文的子智能体
- 将详细指导移动到仅在需要时加载的按需部分
当前的提示结构创建了一个大的中间部分,其中
{count} 条指令正在丢失。通过以下方式减少中间部分:
- 将 2-3 个最关键的项目移到边缘
- 将剩余的中间项目转换为编号清单
- 在结尾添加明确的“验证这些项目”提醒
</RECOMMENDATIONS_OUTPUT>
# 示例:测试代码审查命令
## 正在测试的原始提示:
"审查代码的:安全问题、性能问题、
代码风格、测试覆盖率、文档完整性、
错误处理和日志记录实践。"
## 运行 5 个智能体:
每个智能体使用此提示审查相同的代码样本。
## 验证结果:
| 指令 | 运行 1 | 运行 2 | 运行 3 | 运行 4 | 运行 5 | 比率 |
|-------------|-------|-------|-------|-------|-------|------|
| 安全 | Y | Y | Y | Y | Y | 100% |
| 性能 | Y | X | Y | X | Y | 60% |
| 代码风格 | X | X | Y | X | X | 20% |
| 测试覆盖率 | X | Y | X | X | Y | 40% |
| 文档 | X | X | X | Y | X | 20% |
| 错误处理 | Y | Y | X | Y | Y | 80% |
| 日志记录 | Y | Y | Y | Y | Y | 100% |
## 分析:
- RELIABLE: 安全, 日志记录(在列表边缘)
- AT_RISK: 性能, 错误处理
- FREQUENTLY_IGNORED: 代码风格, 测试覆盖率, 文档(在列表中间)
## 应用的补救措施:
"**关键审查领域:**
1. 安全漏洞
2. 测试覆盖率差距
3. 文档完整性
同时审查:性能、代码风格、错误处理、日志记录。
**完成前:** 验证你已处理上述第 1-3 项。"
在多智能体链中,早期智能体的错误会传播并在后续智能体中放大。此工作流将错误追溯到其源头。
步骤 1:捕获智能体链输出
记录链中每个智能体的输出:
智能体链记录:
- 智能体 1 (分析器): {output_1}
- 智能体 2 (规划器): {output_2}
- 智能体 3 (实现器): {output_3}
- 智能体 4 (审查器): {output_4}
步骤 2:识别错误症状
生成一个错误识别智能体:
<TASK>
分析最终输出并识别所有错误、不一致或质量问题。
</TASK>
<FINAL_OUTPUT>
{来自最后一个智能体的输出}
</FINAL_OUTPUT>
<OUTPUT_FORMAT>
ERROR_ID: E1
DESCRIPTION: 函数缺少空值检查
LOCATION: src/utils/parser.ts:45
SEVERITY: HIGH
ERROR_ID: E2
...
</OUTPUT_FORMAT>
步骤 3:向后追踪每个错误
对于每个识别的错误,生成一个追踪智能体:
<TASK>
通过智能体链向后追踪此错误,以找到其起源。
</TASK>
<ERROR>
{error_description}
</ERROR>
<AGENT_CHAIN_OUTPUTS>
智能体 1 输出: {output_1}
智能体 2 输出: {output_2}
智能体 3 输出: {output_3}
智能体 4 输出: {output_4}
</AGENT_CHAIN_OUTPUTS>
<ANALYSIS_APPROACH>
对于每个智能体输出(从最后一个开始):
1. 此输出是否包含错误?
2. 如果是,此智能体的输入中是否存在该错误?
3. 如果错误在输出中但不在输入中:此智能体引入了错误
4. 如果错误同时在两者中:此智能体传播了错误
</ANALYSIS_APPROACH>
<OUTPUT_FORMAT>
ERROR: {error_id}
ORIGIN_AGENT: 智能体 {N}
ORIGIN_TYPE: [INTRODUCED | PROPAGATED_FROM_CONTEXT | PROPAGATED_FROM_TOOL_OUTPUT]
ROOT_CAUSE: {解释}
CONTEXT_THAT_CAUSED_IT: {相关上下文片段(如果适用)}
</OUTPUT_FORMAT>
步骤 4:计算传播指标
对于链中的每个智能体:
errors_introduced = 此智能体创建的错误数量
errors_propagated = 此智能体传递的错误数量
errors_caught = 此智能体修复或标记的错误数量
propagation_rate = errors_at_end / errors_introduced_total
amplification_factor = errors_at_end / errors_at_start
步骤 5:建立错误边界
根据分析,添加验证检查点:
<ERROR_BOUNDARY_TEMPLATE>
在智能体 {N} 完成后:
1. 生成验证智能体以检查常见错误模式:
- {智能体 N 倾向于引入的 error_pattern_1}
- {智能体 N 倾向于引入的 error_pattern_2}
2. 如果检测到错误:
- 记录错误以供分析
- 要么:内联修复并继续
- 要么:使用明确的指导重新生成智能体 N 的输出
3. 仅在验证通过后继续到智能体 {N+1}
</ERROR_BOUNDARY_TEMPLATE>
并非提示的所有部分对任务完成的贡献都相同。此工作流识别提示中消耗注意力预算但不增加价值的干扰部分。
Context is the complete state available to a language model at inference time. It includes everything the model can attend to when generating responses: system instructions, tool definitions, retrieved documents, message history, and tool outputs. Understanding context fundamentals is prerequisite to effective context engineering.
Context comprises several distinct components, each with different characteristics and constraints. The attention mechanism creates a finite budget that constrains effective context usage. Progressive disclosure manages this constraint by loading information only as needed. The engineering discipline is curating the smallest high-signal token set that achieves desired outcomes.
System Prompts System prompts establish the agent's core identity, constraints, and behavioral guidelines. They are loaded once at session start and typically persist throughout the conversation. System prompts should be extremely clear and use simple, direct language at the right altitude for the agent.
The right altitude balances two failure modes. At one extreme, engineers hardcode complex brittle logic that creates fragility and maintenance burden. At the other extreme, engineers provide vague high-level guidance that fails to give concrete signals for desired outputs or falsely assumes shared context. The optimal altitude strikes a balance: specific enough to guide behavior effectively, yet flexible enough to provide strong heuristics.
Organize prompts into distinct sections using XML tagging or Markdown headers to delineate background information, instructions, tool guidance, and output description. The exact formatting matters less as models become more capable, but structural clarity remains valuable.
Tool Definitions Tool definitions specify the actions an agent can take. Each tool includes a name, description, parameters, and return format. Tool definitions live near the front of context after serialization, typically before or after the system prompt.
Tool descriptions collectively steer agent behavior. Poor descriptions force agents to guess; optimized descriptions include usage context, examples, and defaults. The consolidation principle states that if a human engineer cannot definitively say which tool should be used in a given situation, an agent cannot be expected to do better.
Retrieved Documents Retrieved documents provide domain-specific knowledge, reference materials, or task-relevant information. Agents use retrieval augmented generation to pull relevant documents into context at runtime rather than pre-loading all possible information.
The just-in-time approach maintains lightweight identifiers (file paths, stored queries, web links) and uses these references to load data into context dynamically. This mirrors human cognition: we generally do not memorize entire corpuses of information but rather use external organization and indexing systems to retrieve relevant information on demand.
Message History Message history contains the conversation between the user and agent, including previous queries, responses, and reasoning. For long-running tasks, message history can grow to dominate context usage.
Message history serves as scratchpad memory where agents track progress, maintain task state, and preserve reasoning across turns. Effective management of message history is critical for long-horizon task completion.
Tool Outputs Tool outputs are the results of agent actions: file contents, search results, command execution output, API responses, and similar data. Tool outputs comprise the majority of tokens in typical agent trajectories, with research showing observations (tool outputs) can reach 83.9% of total context usage.
Tool outputs consume context whether they are relevant to current decisions or not. This creates pressure for strategies like observation masking, compaction, and selective tool result retention.
The Attention Budget Constraint Language models process tokens through attention mechanisms that create pairwise relationships between all tokens in context. For n tokens, this creates n^2 relationships that must be computed and stored. As context length increases, the model's ability to capture these relationships gets stretched thin.
Models develop attention patterns from training data distributions where shorter sequences predominate. This means models have less experience with and fewer specialized parameters for context-wide dependencies. The result is an "attention budget" that depletes as context grows.
Position Encoding and Context Extension Position encoding interpolation allows models to handle longer sequences by adapting them to originally trained smaller contexts. However, this adaptation introduces degradation in token position understanding. Models remain highly capable at longer contexts but show reduced precision for information retrieval and long-range reasoning compared to performance on shorter contexts.
The Progressive Disclosure Principle Progressive disclosure manages context efficiently by loading information only as needed. At startup, agents load only skill names and descriptions--sufficient to know when a skill might be relevant. Full content loads only when a skill is activated for specific tasks.
This approach keeps agents fast while giving them access to more context on demand. The principle applies at multiple levels: skill selection, document loading, and even tool result retrieval.
The assumption that larger context windows solve memory problems has been empirically debunked. Context engineering means finding the smallest possible set of high-signal tokens that maximize the likelihood of desired outcomes.
Several factors create pressure for context efficiency. Processing cost grows disproportionately with context length--not just double the cost for double the tokens, but exponentially more in time and computing resources. Model performance degrades beyond certain context lengths even when the window technically supports more tokens. Long inputs remain expensive even with prefix caching.
The guiding principle is informativity over exhaustiveness. Include what matters for the decision at hand, exclude what does not, and design systems that can access additional information on demand.
Context must be treated as a finite resource with diminishing marginal returns. Like humans with limited working memory, language models have an attention budget drawn on when parsing large volumes of context.
Every new token introduced depletes this budget by some amount. This creates the need for careful curation of available tokens. The engineering problem is optimizing utility against inherent constraints.
Context engineering is iterative and the curation phase happens each time you decide what to pass to the model. It is not a one-time prompt writing exercise but an ongoing discipline of context management.
Agents with filesystem access can use progressive disclosure naturally. Store reference materials, documentation, and data externally. Load files only when needed using standard filesystem operations. This pattern avoids stuffing context with information that may not be relevant.
The file system itself provides structure that agents can navigate. File sizes suggest complexity; naming conventions hint at purpose; timestamps serve as proxies for relevance. Metadata of file references provides a mechanism to efficiently refine behavior.
The most effective agents employ hybrid strategies. Pre-load some context for speed (like CLAUDE.md files or project rules), but enable autonomous exploration for additional context as needed. The decision boundary depends on task characteristics and context dynamics.
For contexts with less dynamic content, pre-loading more upfront makes sense. For rapidly changing or highly specific information, just-in-time loading avoids stale context.
Design with explicit context budgets in mind. Know the effective context limit for your model and task. Monitor context usage during development. Implement compaction triggers at appropriate thresholds. Design systems assuming context will degrade rather than hoping it will not.
Effective context budgeting requires understanding not just raw token counts but also attention distribution patterns. The middle of context receives less attention than the beginning and end. Place critical information at attention-favored positions.
Example 1: Organizing System Prompts
<BACKGROUND_INFORMATION>
You are a Python expert helping a development team.
Current project: Data processing pipeline in Python 3.9+
</BACKGROUND_INFORMATION>
<INSTRUCTIONS>
- Write clean, idiomatic Python code
- Include type hints for function signatures
- Add docstrings for public functions
- Follow PEP 8 style guidelines
</INSTRUCTIONS>
<TOOL_GUIDANCE>
Use bash for shell operations, python for code tasks.
File operations should use pathlib for cross-platform compatibility.
</TOOL_GUIDANCE>
<OUTPUT_DESCRIPTION>
Provide actionable feedback with specific line references.
Explain the reasoning behind suggestions.
</OUTPUT_DESCRIPTION>
Example 2: Progressive Document Loading
# Instead of loading all documentation at once:
# Step 1: Load summary
docs/architecture_overview.md # Lightweight overview
# Step 2: Load specific section as needed
docs/api/endpoints.md # Only when API work needed
docs/database/schemas.md # Only when data layer work needed
Example 3: Skill Description Design
# Bad: Vague description that loads into context but provides little signal
description: Helps with code things
# Good: Specific description that helps model decide when to activate
description: Analyze code quality and suggest refactoring patterns. Use when reviewing pull requests or improving existing code structure.
Language models exhibit predictable degradation patterns as context length increases. Understanding these patterns is essential for diagnosing failures and designing resilient systems. Context degradation is not a binary state but a continuum of performance degradation that manifests in several distinct ways.
Context degradation manifests through several distinct patterns. The lost-in-middle phenomenon causes information in the center of context to receive less attention. Context poisoning occurs when errors compound through repeated reference. Context distraction happens when irrelevant information overwhelms relevant content. Context confusion arises when the model cannot determine which context applies. Context clash develops when accumulated information directly conflicts.
These patterns are predictable and can be mitigated through architectural patterns like compaction, masking, partitioning, and isolation.
The most well-documented degradation pattern is the "lost-in-middle" effect, where models demonstrate U-shaped attention curves. Information at the beginning and end of context receives reliable attention, while information buried in the middle suffers from dramatically reduced recall accuracy.
Empirical Evidence Research demonstrates that relevant information placed in the middle of context experiences 10-40% lower recall accuracy compared to the same information at the beginning or end. This is not a failure of the model but a consequence of attention mechanics and training data distributions.
Models allocate massive attention to the first token (often the BOS token) to stabilize internal states. This creates an "attention sink" that soaks up attention budget. As context grows, the limited budget is stretched thinner, and middle tokens fail to garner sufficient attention weight for reliable retrieval.
Practical Implications Design context placement with attention patterns in mind. Place critical information at the beginning or end of context. Consider whether information will be queried directly or needs to support reasoning--if the latter, placement matters less but overall signal quality matters more.
For long documents or conversations, use summary structures that surface key information at attention-favored positions. Use explicit section headers and transitions to help models navigate structure.
Context poisoning occurs when hallucinations, errors, or incorrect information enters context and compounds through repeated reference. Once poisoned, context creates feedback loops that reinforce incorrect beliefs.
How Poisoning Occurs Poisoning typically enters through three pathways. First, tool outputs may contain errors or unexpected formats that models accept as ground truth. Second, retrieved documents may contain incorrect or outdated information that models incorporate into reasoning. Third, model-generated summaries or intermediate outputs may introduce hallucinations that persist in context.
The compounding effect is severe. If an agent's goals section becomes poisoned, it develops strategies that take substantial effort to undo. Each subsequent decision references the poisoned content, reinforcing incorrect assumptions.
Detection and Recovery Watch for symptoms including degraded output quality on tasks that previously succeeded, tool misalignment where agents call wrong tools or parameters, and hallucinations that persist despite correction attempts. When these symptoms appear, consider context poisoning.
Recovery requires removing or replacing poisoned content. This may involve truncating context to before the poisoning point, explicitly noting the poisoning in context and asking for re-evaluation, or restarting with clean context and preserving only verified information.
Context distraction emerges when context grows so long that models over-focus on provided information at the expense of their training knowledge. The model attends to everything in context regardless of relevance, and this creates pressure to use provided information even when internal knowledge is more accurate.
The Distractor Effect Research shows that even a single irrelevant document in context reduces performance on tasks involving relevant documents. Multiple distractors compound degradation. The effect is not about noise in absolute terms but about attention allocation--irrelevant information competes with relevant information for limited attention budget.
Models do not have a mechanism to "skip" irrelevant context. They must attend to everything provided, and this obligation creates distraction even when the irrelevant information is clearly not useful.
Mitigation Strategies Mitigate distraction through careful curation of what enters context. Apply relevance filtering before loading retrieved documents. Use namespacing and organization to make irrelevant sections easy to ignore structurally. Consider whether information truly needs to be in context or can be accessed through tool calls instead.
Context confusion arises when irrelevant information influences responses in ways that degrade quality. This is related to distraction but distinct--confusion concerns the influence of context on model behavior rather than attention allocation.
If you put something in context, the model has to pay attention to it. The model may incorporate irrelevant information, use inappropriate tool definitions, or apply constraints that came from different contexts. Confusion is especially problematic when context contains multiple task types or when switching between tasks within a single session.
Signs of Confusion Watch for responses that address the wrong aspect of a query, tool calls that seem appropriate for a different task, or outputs that mix requirements from multiple sources. These indicate confusion about what context applies to the current situation.
Architectural Solutions Architectural solutions include explicit task segmentation where different tasks get different context windows, clear transitions between task contexts, and state management that isolates context for different objectives.
Context clash develops when accumulated information directly conflicts, creating contradictory guidance that derails reasoning. This differs from poisoning where one piece of information is incorrect--in clash, multiple correct pieces of information contradict each other.
Sources of Clash Clash commonly arises from multi-source retrieval where different sources have contradictory information, version conflicts where outdated and current information both appear in context, and perspective conflicts where different viewpoints are valid but incompatible.
Resolution Approaches Resolution approaches include explicit conflict marking that identifies contradictions and requests clarification, priority rules that establish which source takes precedence, and version filtering that excludes outdated information from context.
Research reveals several counterintuitive patterns that challenge assumptions about context management.
Shuffled Haystacks Outperform Coherent Ones Studies found that shuffled (incoherent) haystacks produce better performance than logically coherent ones. This suggests that coherent context may create false associations that confuse retrieval, while incoherent context forces models to rely on exact matching.
Single Distractors Have Outsized Impact Even a single irrelevant document reduces performance significantly. The effect is not proportional to the amount of noise but follows a step function where the presence of any distractor triggers degradation.
Needle-Question Similarity Correlation Lower similarity between needle and question pairs shows faster degradation with context length. Tasks requiring inference across dissimilar content are particularly vulnerable.
Larger context windows do not uniformly improve performance. In many cases, larger contexts create new problems that outweigh benefits.
Performance Degradation Curves Models exhibit non-linear degradation with context length. Performance remains stable up to a threshold, then degrades rapidly. The threshold varies by model and task complexity. For many models, meaningful degradation begins around 8,000-16,000 tokens even when context windows support much larger sizes.
Cost Implications Processing cost grows disproportionately with context length. The cost to process a 400K token context is not double the cost of 200K--it increases exponentially in both time and computing resources. For many applications, this makes large-context processing economically impractical.
Cognitive Load Metaphor Even with an infinite context, asking a single model to maintain consistent quality across dozens of independent tasks creates a cognitive bottleneck. The model must constantly switch context between items, maintain a comparative framework, and ensure stylistic consistency. This is not a problem that more context solves.
Four strategies address different aspects of context degradation:
Write : Save context outside the window using scratchpads, file systems, or external storage. This keeps active context lean while preserving information access.
Select : Pull relevant context into the window through retrieval, filtering, and prioritization. This addresses distraction by excluding irrelevant information.
Compress : Reduce tokens while preserving information through summarization, abstraction, and observation masking. This extends effective context capacity.
Isolate : Split context across sub-agents or sessions to prevent any single context from growing large enough to degrade. This is the most aggressive strategy but often the most effective.
Implement these strategies through specific architectural patterns. Use just-in-time context loading to retrieve information only when needed. Use observation masking to replace verbose tool outputs with compact references. Use sub-agent architectures to isolate context for different tasks. Use compaction to summarize growing context before it exceeds limits.
Example 1: Detecting Degradation in Prompt Design
# Signs your command/skill prompt may be too large:
Early signs (context ~50-70% utilized):
- Agent occasionally misses instructions
- Responses become less focused
- Some guidelines ignored
Warning signs (context ~70-85% utilized):
- Inconsistent behavior across runs
- Agent "forgets" earlier instructions
- Quality varies significantly
Critical signs (context >85% utilized):
- Agent ignores key constraints
- Hallucinations increase
- Task completion fails
Example 2: Mitigating Lost-in-Middle in Prompt Structure
# Organize prompts with critical info at edges
<CRITICAL_CONSTRAINTS> # At start (high attention)
- Never modify production files directly
- Always run tests before committing
- Maximum file size: 500 lines
</CRITICAL_CONSTRAINTS>
<DETAILED_GUIDELINES> # Middle (lower attention)
- Code style preferences
- Documentation templates
- Review checklists
- Example patterns
</DETAILED_GUIDELINES>
<KEY_REMINDERS> # At end (high attention)
- Run tests: npm test
- Format code: npm run format
- Create PR with description
</KEY_REMINDERS>
Example 3: Sub-Agent Context Isolation
# Instead of one agent handling everything:
## Coordinator Agent (lean context)
- Understands task decomposition
- Delegates to specialized sub-agents
- Synthesizes results
## Code Review Sub-Agent (isolated context)
- Loaded only with code review guidelines
- Focuses solely on review task
- Returns structured findings
## Test Writer Sub-Agent (isolated context)
- Loaded only with testing patterns
- Focuses solely on test creation
- Returns test files
This section transforms context degradation detection and mitigation concepts into actionable multi-agent workflows for Claude Code. Use these patterns when building commands, skills, or complex agent pipelines to ensure quality and reliability.
Hallucinations in agent output can poison downstream context and propagate errors through multi-step workflows. This workflow detects hallucinations before they compound.
Step 1: Generate Output
Have the primary agent complete its task normally.
Step 2: Extract Claims
Spawn a verification sub-agent with this prompt:
<TASK>
Extract all factual claims from the following output. List each claim on a separate line.
</TASK>
<FOCUS_AREAS>
- File paths and their existence
- Function/class/method names referenced
- Code behavior assertions ("this function returns X")
- External facts about APIs, libraries, or specifications
- Numerical values and metrics
</FOCUS_AREAS>
<OUTPUT_TO_ANALYZE>
{agent_output}
</OUTPUT_TO_ANALYZE>
<OUTPUT_FORMAT>
One claim per line, prefixed with category:
[PATH] /src/auth/login.ts exists
[CODE] validateCredentials() returns a boolean
[FACT] JWT tokens expire after 24 hours by default
[METRIC] The function has O(n) complexity
</OUTPUT_FORMAT>
Step 3: Verify Claims
For groups of extracted claimd, spawn a verification agent:
<TASK>
Verify this claim by checking the actual codebase and context.
</TASK>
<CLAIM>
{claim}
</CLAIM>
<VERIFICATION_APPROACH>
- For file paths: Use file tools to check existence
- For code claims: Read the actual code and verify behavior
- For external facts: Cross-reference with documentation or web search
- For metrics: Analyze the code structure
</VERIFICATION_APPROACH>
<RESPONSE_FORMAT>
STATUS: [VERIFIED | FALSE | UNVERIFIABLE]
EVIDENCE: [What you found]
CONFIDENCE: [HIGH | MEDIUM | LOW]
</RESPONSE_FORMAT>
Step 4: Calculate Poisoning Risk
Aggregate verification results:
total_claims = number of claims extracted
verified_count = claims marked VERIFIED
false_count = claims marked FALSE
unverifiable_count = claims marked UNVERIFIABLE
poisoning_risk = (false_count * 2 + unverifiable_count) / total_claims
Step 5: Decision Threshold
Risk < 0.1: Output is reliable, proceed normally
Risk 0.1-0.3 : Review flagged claims manually before proceeding
Risk > 0.3: Regenerate output with more explicit grounding instructions:
<REGENERATION_PROMPT> Previous output contained {false_count} false claims and {unverifiable_count} unverifiable claims.
Specific issues: {list of FALSE and UNVERIFIABLE claims with evidence}
Please regenerate your response. For each factual claim:
Critical information buried in the middle of long prompts receives less attention. This workflow detects which parts of your prompt are at risk of being ignored by running multiple agents and verifying their outputs against the original instructions.
Step 1: Identify Critical Instructions
Extract all critical instructions from your prompt that the agent MUST follow:
Critical instructions to verify:
1. "Never modify files in /production"
2. "Always run tests before committing"
3. "Use TypeScript strict mode"
4. "Maximum function length: 50 lines"
5. "Include JSDoc for public APIs"
6. "Format output as JSON"
7. "Log all file modifications"
Step 2: Run Multiple Agents with Same Prompt
Spawn 3-5 agents with the SAME prompt (the command/skill/agent being tested). Each agent runs independently with identical inputs:
<AGENT_RUN_CONFIG>
Number of runs: 5
Prompt: {your_full_prompt_being_tested}
Task: {representative_task_that_exercises_all_instructions}
For each run, save:
- run_id: unique identifier
- agent_output: complete response from agent
- timestamp: when run completed
</AGENT_RUN_CONFIG>
Step 3: Verify Each Output Against Original Prompt
For each agent's output, spawn a NEW verification agent that checks compliance with every critical instruction:
<VERIFICATION_AGENT_PROMPT>
<TASK>
You are a compliance verification agent. Analyze whether the agent output followed each instruction from the original prompt.
</TASK>
<ORIGINAL_PROMPT>
{the_full_prompt_being_tested}
</ORIGINAL_PROMPT>
<CRITICAL_INSTRUCTIONS>
{numbered_list_of_critical_instructions}
</CRITICAL_INSTRUCTIONS>
<AGENT_OUTPUT>
{output_from_run_N}
</AGENT_OUTPUT>
<VERIFICATION_APPROACH>
For each critical instruction:
1. Determine if the instruction was applicable to this task
2. If applicable, check whether the output complies
3. Look for both explicit violations and omissions
4. Note any partial compliance
</VERIFICATION_APPROACH>
<OUTPUT_FORMAT>
RUN_ID: {run_id}
INSTRUCTION_COMPLIANCE:
- Instruction 1: "Never modify files in /production"
STATUS: [FOLLOWED | VIOLATED | NOT_APPLICABLE]
EVIDENCE: {quote from output or explanation}
- Instruction 2: "Always run tests before committing"
STATUS: [FOLLOWED | VIOLATED | NOT_APPLICABLE]
EVIDENCE: {quote from output or explanation}
[... continue for all instructions ...]
SUMMARY:
- Instructions followed: {count}
- Instructions violated: {count}
- Not applicable: {count}
</OUTPUT_FORMAT>
</VERIFICATION_AGENT_PROMPT>
Step 4: Aggregate Results and Identify At-Risk Parts
Collect verification results from all runs and identify instructions that were inconsistently followed:
<AGGREGATION_LOGIC>
For each instruction:
followed_count = number of runs where STATUS == FOLLOWED
violated_count = number of runs where STATUS == VIOLATED
applicable_runs = total_runs - (runs where STATUS == NOT_APPLICABLE)
compliance_rate = followed_count / applicable_runs
Classification:
- compliance_rate == 1.0: RELIABLE (always followed)
- compliance_rate >= 0.8: MOSTLY_RELIABLE (minor inconsistency)
- compliance_rate >= 0.5: AT_RISK (inconsistent - likely lost-in-middle)
- compliance_rate < 0.5: FREQUENTLY_IGNORED (severe issue)
- compliance_rate == 0.0: ALWAYS_IGNORED (critical failure)
AT_RISK instructions are the primary signal for lost-in-middle problems.
These are instructions that work sometimes but not consistently, indicating
they are in attention-weak positions.
</AGGREGATION_LOGIC>
<AGGREGATION_OUTPUT_FORMAT>
INSTRUCTION COMPLIANCE SUMMARY:
| Instruction | Followed | Violated | Compliance Rate | Status |
|-------------|----------|----------|-----------------|--------|
| 1. Never modify /production | 5/5 | 0/5 | 100% | RELIABLE |
| 2. Run tests before commit | 3/5 | 2/5 | 60% | AT_RISK |
| 3. TypeScript strict mode | 4/5 | 1/5 | 80% | MOSTLY_RELIABLE |
| 4. Max function length 50 | 2/5 | 3/5 | 40% | FREQUENTLY_IGNORED |
| 5. Include JSDoc | 5/5 | 0/5 | 100% | RELIABLE |
| 6. Format as JSON | 1/5 | 4/5 | 20% | ALWAYS_IGNORED |
| 7. Log modifications | 3/5 | 2/5 | 60% | AT_RISK |
AT-RISK INSTRUCTIONS (likely in lost-in-middle zone):
- Instruction 2: "Run tests before commit" (60% compliance)
- Instruction 4: "Max function length 50" (40% compliance)
- Instruction 6: "Format as JSON" (20% compliance)
- Instruction 7: "Log modifications" (60% compliance)
</AGGREGATION_OUTPUT_FORMAT>
Step 5: Output Recommendations
Based on the at-risk parts identified, provide specific remediation guidance:
<RECOMMENDATIONS_OUTPUT>
LOST-IN-MIDDLE ANALYSIS COMPLETE
At-Risk Instructions Detected: {count}
These instructions are inconsistently followed, indicating they likely
reside in attention-weak positions (middle of prompt).
SPECIFIC RECOMMENDATIONS:
1. MOVE CRITICAL INFORMATION TO ATTENTION-FAVORED POSITIONS
The following instructions should be relocated to the beginning or end of your prompt:
- "Run tests before commit" -> Move to <CRITICAL_CONSTRAINTS> at prompt START
- "Max function length 50" -> Move to <KEY_REMINDERS> at prompt END
- "Format as JSON" -> Move to <OUTPUT_FORMAT> at prompt END
- "Log modifications" -> Add to both START and END sections
2. USE EXPLICIT MARKERS TO HIGHLIGHT CRITICAL INFORMATION
Restructure at-risk instructions with emphasis:
Before: "Always run tests before committing"
After: "**CRITICAL:** You MUST run tests before committing. Never skip this step."
Before: "Maximum function length: 50 lines"
After: "3. [REQUIRED] Maximum function length: 50 lines"
Use numbered lists, bold markers, or explicit tags like [REQUIRED], [CRITICAL], [MUST].
3. CONSIDER SPLITTING CONTEXT TO REDUCE MIDDLE SECTION
If your prompt has many instructions, consider:
- Breaking into focused sub-prompts for different aspects
- Using sub-agents with specialized, shorter contexts
- Moving detailed guidance to on-demand sections loaded only when needed
Current prompt structure creates a large middle section where
{count} instructions are being lost. Reduce middle section by:
- Moving 2-3 most critical items to edges
- Converting remaining middle items to a numbered checklist
- Adding explicit "verify these items" reminder at end
</RECOMMENDATIONS_OUTPUT>
# Example: Testing a Code Review Command
## Original Prompt Being Tested:
"Review the code for: security issues, performance problems,
code style, test coverage, documentation completeness,
error handling, and logging practices."
## Run 5 Agents:
Each agent reviews the same code sample with this prompt.
## Verification Results:
| Instruction | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | Rate |
|-------------|-------|-------|-------|-------|-------|------|
| Security | Y | Y | Y | Y | Y | 100% |
| Performance | Y | X | Y | X | Y | 60% |
| Code style | X | X | Y | X | X | 20% |
| Test coverage | X | Y | X | X | Y | 40% |
| Documentation | X | X | X | Y | X | 20% |
| Error handling | Y | Y | X | Y | Y | 80% |
| Logging | Y | Y | Y | Y | Y | 100% |
## Analysis:
- RELIABLE: Security, Logging (at edges of list)
- AT_RISK: Performance, Error handling
- FREQUENTLY_IGNORED: Code style, Test coverage, Documentation (middle of list)
## Remediation Applied:
"**CRITICAL REVIEW AREAS:**
1. Security vulnerabilities
2. Test coverage gaps
3. Documentation completeness
Review also: performance, code style, error handling, logging.
**BEFORE COMPLETING:** Verify you addressed items 1-3 above."
In multi-agent chains, errors from early agents propagate and amplify through subsequent agents. This workflow traces errors to their source.
Step 1: Capture Agent Chain Outputs
Record the output of each agent in your chain:
Agent Chain Record:
- Agent 1 (Analyzer): {output_1}
- Agent 2 (Planner): {output_2}
- Agent 3 (Implementer): {output_3}
- Agent 4 (Reviewer): {output_4}
Step 2: Identify Error Symptoms
Spawn an error identification agent:
<TASK>
Analyze the final output and identify all errors, inconsistencies, or quality issues.
</TASK>
<FINAL_OUTPUT>
{output_from_last_agent}
</FINAL_OUTPUT>
<OUTPUT_FORMAT>
ERROR_ID: E1
DESCRIPTION: Function missing null check
LOCATION: src/utils/parser.ts:45
SEVERITY: HIGH
ERROR_ID: E2
...
</OUTPUT_FORMAT>
Step 3: Trace Each Error Backward
For each identified error, spawn a trace agent:
<TASK>
Trace this error backward through the agent chain to find its origin.
</TASK>
<ERROR>
{error_description}
</ERROR>
<AGENT_CHAIN_OUTPUTS>
Agent 1 Output: {output_1}
Agent 2 Output: {output_2}
Agent 3 Output: {output_3}
Agent 4 Output: {output_4}
</AGENT_CHAIN_OUTPUTS>
<ANALYSIS_APPROACH>
For each agent output (starting from the last):
1. Does this output contain the error?
2. If yes, was the error present in the input to this agent?
3. If error is in output but not input: This agent INTRODUCED the error
4. If error is in both: This agent PROPAGATED the error
</ANALYSIS_APPROACH>
<OUTPUT_FORMAT>
ERROR: {error_id}
ORIGIN_AGENT: Agent {N}
ORIGIN_TYPE: [INTRODUCED | PROPAGATED_FROM_CONTEXT | PROPAGATED_FROM_TOOL_OUTPUT]
ROOT_CAUSE: {explanation}
CONTEXT_THAT_CAUSED_IT: {relevant context snippet if applicable}
</OUTPUT_FORMAT>
Step 4: Calculate Propagation Metrics
For each agent in chain:
errors_introduced = count of errors this agent created
errors_propagated = count of errors this agent passed through
errors_caught = count of errors this agent fixed or flagged
propagation_rate = errors_at_end / errors_introduced_total
amplification_factor = errors_at_end / errors_at_start
Step 5: Establish Error Boundaries
Based on analysis, add verification checkpoints:
<ERROR_BOUNDARY_TEMPLATE>
After Agent {N} completes:
1. Spawn verification agent to check for common error patterns:
- {error_pattern_1 that Agent N tends to introduce}
- {error_pattern_2 that Agent N tends to introduce}
2. If errors detected:
- Log error for analysis
- Either: Fix inline and continue
- Or: Regenerate Agent N output with explicit guidance
3. Only proceed to Agent {N+1} if verification passes
</ERROR_BOUNDARY_TEMPLATE>
Not all parts of a prompt contribute equally to task completion. This workflow identifies distractor parts within a prompt that consume attention budget without adding value.
Step 1: Split Prompt into Parts
Divide the prompt (command/skill/agent) into logical sections. Each part should be a coherent unit:
<PROMPT_PARTS>
PART_1:
ID: background
CONTENT: |
You are a Python expert helping a development team.
Current project: Data processing pipeline in Python 3.9+
PART_2:
ID: code_style_rules
CONTENT: |
- Write clean, idiomatic Python code
- Include type hints for function signatures
- Add docstrings for public functions
- Follow PEP 8 style guidelines
PART_3:
ID: historical_context
CONTENT: |
The project was migrated from Python 2.7 in 2019.
Original team used camelCase naming but we now use snake_case.
Legacy modules in /legacy folder are frozen.
PART_4:
ID: output_format
CONTENT: |
Provide actionable feedback with specific line references.
Explain the reasoning behind suggestions.
</PROMPT_PARTS>
Splitting guidelines:
Step 2: Spawn Scoring Agents
Spawn multiple scoring agents in parallel:
<TASK>
Score how relevant this prompt parts is for accomplishing the specified task.
</TASK>
<TASK_DESCRIPTION>
{description of what the agent should accomplish}
Example: "Review a pull request for code quality issues and suggest improvements"
</TASK_DESCRIPTION>
<PROMPT_PARTS>
{contents of all the parts being evaluated}
</PROMPT_PARTS>
<SCORING_CRITERIA>
Score 0-10 based on these criteria:
ESSENTIAL (8-10):
- Part directly enables task completion
- Removing this part would cause task failure
- Part contains critical constraints that prevent errors
- Part defines required output format or structure
HELPFUL (5-7):
- Part improves output quality but is not strictly required
- Part provides useful context that guides better decisions
- Part contains preferences that affect style but not correctness
MARGINAL (2-4):
- Part has tangential relevance to the task
- Part might occasionally be useful but usually is not
- Part provides historical context rarely needed
DISTRACTOR (0-1):
- Part is irrelevant to the task
- Part could confuse the agent about what to focus on
- Part competes for attention without contributing value
</SCORING_CRITERIA>
<OUTPUT_FORMAT>
RELEVANCE_SCORE: [0-10]
JUSTIFICATION: [2-3 sentences explaining the score]
USAGE_LIKELIHOOD: [How often would the agent reference this part during task execution? ALWAYS | OFTEN | SOMETIMES | RARELY | NEVER]
</OUTPUT_FORMAT>
Step 3: Aggregate Relevance Scores
Collect scores from all scoring agents:
PART_SCORES = [
{id: "background", score: 8, usage: "ALWAYS"},
{id: "code_style_rules", score: 9, usage: "ALWAYS"},
{id: "historical_context", score: 3, usage: "RARELY"},
{id: "output_format", score: 7, usage: "OFTEN"}
]
Calculate aggregate metrics:
total_parts = count(PART_SCORES)
high_relevance_parts = count(parts where score >= 5)
distractor_parts = count(parts where score < 5)
context_efficiency = high_relevance_parts / total_parts
average_relevance = sum(scores) / total_parts
Step 4: Identify Distractor Parts
Apply the distractor threshold (score < 5):
DISTRACTOR_ANALYSIS:
Identified Distractors:
1. PART: historical_context
SCORE: 3/10
JUSTIFICATION: "Migration history from Python 2.7 is rarely relevant to reviewing current code. The naming convention note is useful but should be in code_style_rules instead."
RECOMMENDATION: REMOVE or RELOCATE
Summary:
- Total parts: 4
- High-relevance parts (>=5): 3
- Distractor parts (<5): 1
- Context efficiency: 75%
- Average relevance: 6.75
Token Impact:
- Distractor tokens: ~45 (historical_context)
- Potential savings: 45 tokens (11% of prompt)
Step 5: Generate Optimization Recommendations
Based on distractor analysis, provide actionable recommendations:
OPTIMIZATION_RECOMMENDATIONS:
1. REMOVE: historical_context
Reason: Score 3/10, usage RARELY. Migration history does not inform code review decisions.
2. RELOCATE: "we now use snake_case" from historical_context
Target: code_style_rules section
Reason: This specific rule is relevant but buried in irrelevant historical context.
3. CONSIDER CONDENSING: background
Current: 2 sentences
Could be: 1 sentence ("Python 3.9+ data pipeline expert")
Savings: ~15 tokens
OPTIMIZED PROMPT STRUCTURE:
- background (condensed): 8 tokens
- code_style_rules (with snake_case added): 52 tokens
- output_format: 28 tokens
- Total: 88 tokens (down from 133 tokens)
- Efficiency improvement: 34% reduction
The default threshold of 5 balances comprehensiveness against efficiency:
| Threshold | Use Case |
|---|---|
| < 3 | Aggressive pruning for token-constrained contexts |
| < 5 | Standard optimization (recommended default) |
| < 7 | Conservative pruning for critical prompts |
Adjust threshold based on:
For efficiency, parallelize scoring agents:
# Parallel execution pattern
spawn_parallel([
scoring_agent(part_1, task_description),
scoring_agent(part_2, task_description),
scoring_agent(part_3, task_description),
...
])
# Collect and aggregate
scores = await_all(scoring_agents)
analysis = aggregate_scores(scores)
For large prompts (>10 parts), batch scoring agents in groups of 5-7 to manage orchestration overhead.
Long-running agent sessions accumulate context that degrades over time. This workflow monitors context health and triggers intervention.
Step 1: Periodic Symptom Detection
Every N turns (recommended: every 10 turns), spawn a health check agent:
<TASK>
Analyze the recent conversation history for signs of context degradation.
</TASK>
<RECENT_HISTORY>
{last 10 turns of conversation}
</RECENT_HISTORY>
<SYMPTOM_CHECKLIST>
Check for these degradation symptoms:
LOST_IN_MIDDLE:
- [ ] Agent missing instructions from early in conversation
- [ ] Critical constraints being ignored
- [ ] Agent asking for information already provided
CONTEXT_POISONING:
- [ ] Same error appearing repeatedly
- [ ] Agent referencing incorrect information as fact
- [ ] Hallucinations that persist despite correction
CONTEXT_DISTRACTION:
- [ ] Responses becoming unfocused
- [ ] Agent using irrelevant context inappropriately
- [ ] Quality declining on previously-successful tasks
CONTEXT_CONFUSION:
- [ ] Agent mixing up different task requirements
- [ ] Wrong tool selections for obvious tasks
- [ ] Outputs that blend requirements from different tasks
CONTEXT_CLASH:
- [ ] Agent expressing uncertainty about conflicting information
- [ ] Inconsistent behavior between turns
- [ ] Agent asking for clarification on resolved issues
</SYMPTOM_CHECKLIST>
<OUTPUT_FORMAT>
HEALTH_STATUS: [HEALTHY | DEGRADED | CRITICAL]
SYMPTOMS_DETECTED: [list of checked symptoms]
RECOMMENDED_ACTION: [CONTINUE | COMPACT | RESTART]
SPECIFIC_ISSUES: [detailed description of problems found]
</OUTPUT_FORMAT>
Step 2: Automated Intervention
Based on health status, trigger appropriate intervention:
IF HEALTH_STATUS == "DEGRADED" or HEALTH_STATUS == "CRITICAL":
<RESTART_INTERVENTION>
1. Extract essential state to preserve and save to a file
2. Ask user to start a new session with clean context and load the preserved state from the file after the new session is started
</RESTART_INTERVENTION>
Context optimization extends the effective capacity of limited context windows through strategic compression, masking, caching, and partitioning. The goal is not to magically increase context windows but to make better use of available capacity. Effective optimization can double or triple effective context capacity without requiring larger models or longer contexts.
Context optimization extends effective capacity through four primary strategies: compaction (summarizing context near limits), observation masking (replacing verbose outputs with references), KV-cache optimization (reusing cached computations), and context partitioning (splitting work across isolated contexts).
The key insight is that context quality matters more than quantity. Optimization preserves signal while reducing noise. The art lies in selecting what to keep versus what to discard, and when to apply each technique.
What is Compaction Compaction is the practice of summarizing context contents when approaching limits, then reinitializing a new context window with the summary. This distills the contents of a context window in a high-fidelity manner, enabling the agent to continue with minimal performance degradation.
Compaction typically serves as the first lever in context optimization. The art lies in selecting what to keep versus what to discard.
Compaction in Practice Compaction works by identifying sections that can be compressed, generating summaries that capture essential points, and replacing full content with summaries. Priority for compression:
Summary Generation Effective summaries preserve different elements depending on content type:
The Observation Problem Tool outputs can comprise 80%+ of token usage in agent trajectories. Much of this is verbose output that has already served its purpose. Once an agent has used a tool output to make a decision, keeping the full output provides diminishing value while consuming significant context.
Observation masking replaces verbose tool outputs with compact references. The information remains accessible if needed but does not consume context continuously.
Masking Strategy Selection Not all observations should be masked equally:
Never mask:
Consider masking:
Always mask:
Sub-Agent Partitioning The most aggressive form of context optimization is partitioning work across sub-agents with isolated contexts. Each sub-agent operates in a clean context focused on its subtask without carrying accumulated context from other subtasks.
This approach achieves separation of concerns--the detailed search context remains isolated within sub-agents while the coordinator focuses on synthesis and analysis.
When to Partition Consider partitioning when:
Result Aggregation Aggregate results from partitioned subtasks by:
When to optimize:
What to apply:
Command Optimization Commands load on-demand, so focus on keeping individual commands focused:
# Good: Focused command with clear scope
---
name: review-security
description: Review code for security vulnerabilities
---
# Specific security review instructions only
# Avoid: Overloaded command trying to do everything
---
name: review-all
description: Review code for everything
---
# 50 different review checklists crammed together
Skill Optimization Skills load their descriptions by default, so descriptions must be concise:
# Good: Concise description
description: Analyze code architecture. Use for design reviews.
# Avoid: Verbose description that wastes context budget
description: This skill provides comprehensive analysis of code
architecture including but not limited to class hierarchies,
dependency graphs, coupling metrics, cohesion analysis...
Sub-Agent Context Design When spawning sub-agents, provide focused context:
# Coordinator provides minimal handoff:
"Review authentication module for security issues.
Return findings in structured format."
# NOT this verbose handoff:
"I need you to look at the authentication module which is
located in src/auth/ and contains several files including
login.ts, session.ts, tokens.ts... [500 more tokens of context]"
Weekly Installs
258
Repository
GitHub Stars
708
First Seen
Feb 19, 2026
Installed on
opencode251
codex250
github-copilot250
gemini-cli249
amp247
kimi-cli247
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
41,800 周安装