重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
context-optimization by guanyang/antigravity-skills
npx skills add https://github.com/guanyang/antigravity-skills --skill context-optimization上下文优化通过策略性的压缩、掩蔽、缓存和分区,扩展有限上下文窗口的有效容量。有效的优化可以在不要求更大模型或更长窗口的情况下,将有效上下文容量提高一倍或两倍——但前提是必须遵循规则应用。以下技术按影响力和风险排序。
在以下情况下启用此技能:
按此优先级顺序应用四种主要策略:
KV-cache 优化 — 重新排序并稳定提示结构,使推理引擎能够重用缓存的 Key/Value 张量。这是成本最低的优化:零质量风险,立即节省成本和延迟。首先且无条件地应用它。
观察掩蔽 — 一旦工具输出的目的已达到,就用紧凑的引用替换冗长的工具输出。在典型的智能体轨迹中,工具输出消耗了 80% 以上的 token,因此掩蔽它们能带来最大的容量增益。原始内容在下游需要时仍可检索。
压缩 — 当利用率超过 70% 时,总结累积的上下文,然后用总结重新初始化。这提炼了窗口的内容,同时保留了任务关键状态。压缩是有损的——在掩蔽已经移除了低价值的大部分内容之后应用它。
上下文分区 — 当单个窗口无法容纳整个问题时,将工作拆分到具有独立上下文的子智能体。每个子智能体在其子任务的干净、专注的上下文中运行。将此保留用于估计上下文超过窗口限制 60% 的任务,因为协调开销是真实存在的。
指导原则:上下文质量比数量更重要。每次优化都在减少噪声的同时保留信号。优化前先测量,然后测量优化的效果。
当上下文利用率超过 70% 时触发压缩:总结当前上下文,然后用总结重新初始化。这以高保真度提炼窗口内容,使得能够以最小的性能下降继续工作。优先压缩工具输出(它们消耗 80% 以上的 token),然后是旧的对话轮次,最后是检索到的文档。切勿压缩系统提示——它锚定模型行为,移除它会导致不可预测的性能下降。
按消息类型保留不同元素:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
目标是实现 50-70% 的 token 减少,同时质量下降低于 5%。如果压缩超过 70% 的减少量,请审计总结是否存在关键信息丢失——过度激进的压缩是最常见的失败模式。
根据最近性和持续相关性有选择地掩蔽观察结果——而非统一处理。应用以下规则:
[Obs:{ref_id} 已省略。关键点:{summary}。完整内容可检索。]掩蔽应实现掩蔽观察结果 60-80% 的减少,同时质量影响小于 2%。关键是保持可检索性——将完整内容存储在外部,并在上下文中保留引用 ID,以便智能体在需要时可以请求原始内容。
通过构建提示,使稳定内容占据前缀,动态内容出现在末尾,从而最大化前缀缓存命中率。KV-cache 存储推理期间计算的 Key 和 Value 张量;当连续请求共享相同的前缀时,缓存的张量被重用,从而节省成本和延迟。
在每个提示中应用此顺序:
为缓存稳定性设计提示:从系统提示中移除时间戳、会话计数器和请求 ID。将动态元数据移到单独的用户消息或工具结果中,使其不会破坏前缀。即使前缀中单个空格的更改也会使该更改点下游的整个缓存块失效。
对于稳定的工作负载,目标是 70% 以上的缓存命中率。在规模上,这相当于在缓存的 token 上实现 50% 以上的成本减少和 40% 以上的延迟减少。
当单个上下文无法在不触发激进压缩的情况下容纳整个问题时,将工作分区到子智能体。每个子智能体在其子任务的干净、专注的上下文中运行,然后将结构化结果返回给协调器智能体。
当估计的任务上下文超过窗口限制的 60% 时,规划分区。将任务分解为独立的子任务,分配给每个子智能体,并聚合结果。在合并前验证所有分区是否完成,合并兼容的结果,如果聚合输出仍超出预算,则应用总结。
这种方法实现了关注点分离——详细的搜索上下文保持在子智能体内隔离,而协调器专注于综合。然而,协调有真实的 token 成本:协调器提示、结果聚合和错误处理都会消耗 token。只有在节省的成本超过此开销时才进行分区。
在会话开始前,为上下文类别分配明确的 token 预算:系统提示、工具定义、检索到的文档、消息历史、工具输出和保留缓冲区(占总量的 5-10%)。持续监控使用情况与预算,当任何类别超过其分配或总利用率超过 70% 时触发优化。
使用基于触发的优化,而非周期性优化。监控这些信号:
根据主导上下文的内容选择优化技术:
| 上下文构成 | 首要行动 | 次要行动 |
|---|---|---|
| 工具输出占主导 (>50%) | 观察掩蔽 | 压缩剩余轮次 |
| 检索到的文档占主导 | 总结 | 如果文档独立则分区 |
| 消息历史占主导 | 选择性保留的压缩 | 为新子任务分区 |
| 多个组件共同作用 | 首先进行 KV-cache 优化,然后分层掩蔽 + 压缩 | |
| 接近限制且处于活动调试中 | 仅掩蔽已解决的工具输出——保留错误详情 |
跟踪这些指标以验证优化效果:
根据测量结果迭代策略。如果一种优化技术没有可测量地改善目标指标,请移除它——优化机制本身会消耗 token 并增加延迟。
示例 1:压缩触发
if context_tokens / context_limit > 0.8:
context = compact_context(context)
示例 2:观察掩蔽
if len(observation) > max_length:
ref_id = store_observation(observation)
return f"[Obs:{ref_id} elided. Key: {extract_key(observation)}]"
示例 3:缓存友好排序
# Stable content first
context = [system_prompt, tool_definitions] # Cacheable
context += [reused_templates] # Reusable
context += [unique_content] # Unique
空格破坏 KV-cache:即使提示前缀中单个空格或换行符的更改,也会使该点下游的整个 KV-cache 块失效。将系统提示固定为不可变字符串——不要在其中插入时间戳、版本号或会话 ID。在部署之间逐字节比较提示模板。
系统提示中的时间戳破坏缓存命中率:在系统提示中包含 当前日期:{today} 或类似的动态内容,会迫使在每个新的一天(或每次请求,如果使用一天中的时间)发生完全缓存未命中。将动态元数据移到用户消息或附加在稳定前缀之后的单独工具结果中。
压力下的压缩会丢失关键状态:当执行压缩的模型本身处于上下文压力下(>85% 利用率)时,其总结质量会下降——它会忽略任务目标、丢弃用户约束并扁平化细微的状态。在 70-80% 时触发压缩,而不是 90%+。如果必须晚些进行压缩,请使用单独的模型调用,其上下文仅包含要总结的材料。
掩蔽错误输出会破坏调试循环:过度激进的掩蔽会隐藏错误消息、堆栈跟踪和失败详情,而这些是智能体在后续轮次中诊断和修复问题所需要的。在活动调试期间(最近 3 轮内有错误),暂停对所有与错误相关的观察结果的掩蔽,直到问题解决。
分区开销可能超过节省:每个子智能体都需要自己的系统提示、工具定义和协调消息。对于少于 3 个独立子任务的任务,协调开销通常超过上下文节省。在决定分区之前,估计总 token 数(协调器 + 所有子智能体)。
部署更改后缓存未命中成本激增:在部署之间重新排序工具、改写系统提示或更改少样本示例会使整个前缀缓存失效,导致 2-5 倍的临时成本激增,直到新缓存预热。逐步推出提示更改,并在部署窗口期间监控缓存命中率。
压缩对过时的总结产生虚假信心:一旦上下文被压缩,总结看起来权威,但可能反映过时的状态。如果任务自压缩以来已经演变(新的用户需求、修正的假设),总结会默默地携带过时的信息。压缩后,在继续之前,根据当前任务目标重新验证总结。
此技能建立在 context-fundamentals 和 context-degradation 之上。它连接到:
内部参考:
此集合中的相关技能:
外部资源:
创建日期 : 2025-12-20 最后更新 : 2026-03-17 作者 : Agent Skills for Context Engineering Contributors 版本 : 2.0.0
每周安装数
53
仓库
GitHub 星标数
518
首次出现
Jan 26, 2026
安全审计
安装于
opencode49
codex48
github-copilot47
cursor46
gemini-cli46
amp45
Context optimization extends the effective capacity of limited context windows through strategic compression, masking, caching, and partitioning. Effective optimization can double or triple effective context capacity without requiring larger models or longer windows — but only when applied with discipline. The techniques below are ordered by impact and risk.
Activate this skill when:
Apply four primary strategies in this priority order:
KV-cache optimization — Reorder and stabilize prompt structure so the inference engine reuses cached Key/Value tensors. This is the cheapest optimization: zero quality risk, immediate cost and latency savings. Apply it first and unconditionally.
Observation masking — Replace verbose tool outputs with compact references once their purpose has been served. Tool outputs consume 80%+ of tokens in typical agent trajectories, so masking them yields the largest capacity gains. The original content remains retrievable if needed downstream.
Compaction — Summarize accumulated context when utilization exceeds 70%, then reinitialize with the summary. This distills the window's contents while preserving task-critical state. Compaction is lossy — apply it after masking has already removed the low-value bulk.
Context partitioning — Split work across sub-agents with isolated contexts when a single window cannot hold the full problem. Each sub-agent operates in a clean context focused on its subtask. Reserve this for tasks where estimated context exceeds 60% of the window limit, because coordination overhead is real.
The governing principle: context quality matters more than quantity. Every optimization preserves signal while reducing noise. Measure before optimizing, then measure the optimization's effect.
Trigger compaction when context utilization exceeds 70%: summarize the current context, then reinitialize with the summary. This distills the window's contents in a high-fidelity manner, enabling continuation with minimal performance degradation. Prioritize compressing tool outputs first (they consume 80%+ of tokens), then old conversation turns, then retrieved documents. Never compress the system prompt — it anchors model behavior and its removal causes unpredictable degradation.
Preserve different elements by message type:
Target 50-70% token reduction with less than 5% quality degradation. If compaction exceeds 70% reduction, audit the summary for critical information loss — over-aggressive compaction is the most common failure mode.
Mask observations selectively based on recency and ongoing relevance — not uniformly. Apply these rules:
[Obs:{ref_id} elided. Key: {summary}. Full content retrievable.]Masking should achieve 60-80% reduction in masked observations with less than 2% quality impact. The key is maintaining retrievability — store the full content externally and keep the reference ID in context so the agent can request the original if needed.
Maximize prefix cache hits by structuring prompts so that stable content occupies the prefix and dynamic content appears at the end. KV-cache stores Key and Value tensors computed during inference; when consecutive requests share an identical prefix, the cached tensors are reused, saving both cost and latency.
Apply this ordering in every prompt:
Design prompts for cache stability: remove timestamps, session counters, and request IDs from the system prompt. Move dynamic metadata into a separate user message or tool result where it does not break the prefix. Even a single whitespace change in the prefix invalidates the entire cached block downstream of that change.
Target 70%+ cache hit rate for stable workloads. At scale, this translates to 50%+ cost reduction and 40%+ latency reduction on cached tokens.
Partition work across sub-agents when a single context cannot hold the full problem without triggering aggressive compaction. Each sub-agent operates in a clean, focused context for its subtask, then returns a structured result to a coordinator agent.
Plan partitioning when estimated task context exceeds 60% of the window limit. Decompose the task into independent subtasks, assign each to a sub-agent, and aggregate results. Validate that all partitions completed before merging, merge compatible results, and apply summarization if the aggregated output still exceeds budget.
This approach achieves separation of concerns — detailed search context stays isolated within sub-agents while the coordinator focuses on synthesis. However, coordination has real token cost: the coordinator prompt, result aggregation, and error handling all consume tokens. Only partition when the savings exceed this overhead.
Allocate explicit token budgets across context categories before the session begins: system prompt, tool definitions, retrieved documents, message history, tool outputs, and a reserved buffer (5-10% of total). Monitor usage against budget continuously and trigger optimization when any category exceeds its allocation or total utilization crosses 70%.
Use trigger-based optimization rather than periodic optimization. Monitor these signals:
Select the optimization technique based on what dominates the context:
| Context Composition | First Action | Second Action |
|---|---|---|
| Tool outputs dominate (>50%) | Observation masking | Compaction of remaining turns |
| Retrieved documents dominate | Summarization | Partitioning if docs are independent |
| Message history dominates | Compaction with selective preservation | Partitioning for new subtasks |
| Multiple components contribute | KV-cache optimization first, then layer masking + compaction | |
| Near-limit with active debugging | Mask resolved tool outputs only — preserve error details |
Track these metrics to validate optimization effectiveness:
Iterate on strategies based on measured results. If an optimization technique does not measurably improve the target metric, remove it — optimization machinery itself consumes tokens and adds latency.
Example 1: Compaction Trigger
if context_tokens / context_limit > 0.8:
context = compact_context(context)
Example 2: Observation Masking
if len(observation) > max_length:
ref_id = store_observation(observation)
return f"[Obs:{ref_id} elided. Key: {extract_key(observation)}]"
Example 3: Cache-Friendly Ordering
# Stable content first
context = [system_prompt, tool_definitions] # Cacheable
context += [reused_templates] # Reusable
context += [unique_content] # Unique
Whitespace breaks KV-cache : Even a single whitespace or newline change in the prompt prefix invalidates the entire KV-cache block downstream of that point. Pin system prompts as immutable strings — do not interpolate timestamps, version numbers, or session IDs into them. Diff prompt templates byte-for-byte between deployments.
Timestamps in system prompts destroy cache hit rates : Including Current date: {today} or similar dynamic content in the system prompt forces a full cache miss on every new day (or every request, if using time-of-day). Move dynamic metadata into a user message or a separate tool result appended after the stable prefix.
Compaction under pressure loses critical state : When the model performing compaction is itself under context pressure (>85% utilization), its summarization quality degrades — it omits task goals, drops user constraints, and flattens nuanced state. Trigger compaction at 70-80%, not 90%+. If compaction must happen late, use a separate model call with a clean context containing only the material to summarize.
Masking error outputs breaks debugging loops : Over-aggressive masking hides error messages, stack traces, and failure details that the agent needs in subsequent turns to diagnose and fix issues. During active debugging (error in the last 3 turns), suspend masking for all error-related observations until the issue is resolved.
Partitioning overhead can exceed savings : Each sub-agent requires its own system prompt, tool definitions, and coordination messages. For tasks with fewer than 3 independent subtasks, the coordination overhead often exceeds the context savings. Estimate total tokens (coordinator + all sub-agents) before committing to partitioning.
Cache miss cost spikes after deployment changes : Reordering tools, rewording the system prompt, or changing few-shot examples between deployments invalidates the entire prefix cache, causing a temporary cost spike of 2-5x until the new cache warms up. Roll out prompt changes gradually and monitor cache hit rate during deployment windows.
This skill builds on context-fundamentals and context-degradation. It connects to:
Internal reference:
Related skills in this collection:
External resources:
Created : 2025-12-20 Last Updated : 2026-03-17 Author : Agent Skills for Context Engineering Contributors Version : 2.0.0
Weekly Installs
53
Repository
GitHub Stars
518
First Seen
Jan 26, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode49
codex48
github-copilot47
cursor46
gemini-cli46
amp45
AI界面设计评审工具 - 全面评估UI/UX设计质量、检测AI生成痕迹与优化用户体验
58,500 周安装
Compaction creates false confidence in stale summaries : Once context is compacted, the summary looks authoritative but may reflect outdated state. If the task has evolved since compaction (new user requirements, corrected assumptions), the summary silently carries forward stale information. After compaction, re-validate the summary against the current task goal before proceeding.