⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

上下文优化技术指南：通过压缩、掩蔽、缓存和分区扩展LLM上下文窗口，降低成本与延迟

context-optimization by guanyang/antigravity-skills

64 周安装量

595 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/guanyang/antigravity-skills --skill context-optimization

AI/机器学习性能优化提示工程

🇨🇳中文介绍

上下文优化技术

上下文优化通过策略性的压缩、掩蔽、缓存和分区，扩展有限上下文窗口的有效容量。有效的优化可以在不要求更大模型或更长窗口的情况下，将有效上下文容量提高一倍或两倍——但前提是必须遵循规则应用。以下技术按影响力和风险排序。

何时启用

在以下情况下启用此技能：

上下文限制约束了任务复杂性
为降低成本进行优化（更少的 token = 更低的成本）
减少长对话的延迟
实现长时间运行的智能体系统
需要处理更大的文档或对话
大规模构建生产系统

核心概念

按此优先级顺序应用四种主要策略：

KV-cache 优化 — 重新排序并稳定提示结构，使推理引擎能够重用缓存的 Key/Value 张量。这是成本最低的优化：零质量风险，立即节省成本和延迟。首先且无条件地应用它。
观察掩蔽 — 一旦工具输出的目的已达到，就用紧凑的引用替换冗长的工具输出。在典型的智能体轨迹中，工具输出消耗了 80% 以上的 token，因此掩蔽它们能带来最大的容量增益。原始内容在下游需要时仍可检索。
压缩 — 当利用率超过 70% 时，总结累积的上下文，然后用总结重新初始化。这提炼了窗口的内容，同时保留了任务关键状态。压缩是有损的——在掩蔽已经移除了低价值的大部分内容之后应用它。
上下文分区 — 当单个窗口无法容纳整个问题时，将工作拆分到具有独立上下文的子智能体。每个子智能体在其子任务的干净、专注的上下文中运行。将此保留用于估计上下文超过窗口限制 60% 的任务，因为协调开销是真实存在的。

指导原则：上下文质量比数量更重要。每次优化都在减少噪声的同时保留信号。优化前先测量，然后测量优化的效果。

详细主题

压缩策略

当上下文利用率超过 70% 时触发压缩：总结当前上下文，然后用总结重新初始化。这以高保真度提炼窗口内容，使得能够以最小的性能下降继续工作。优先压缩工具输出（它们消耗 80% 以上的 token），然后是旧的对话轮次，最后是检索到的文档。切勿压缩系统提示——它锚定模型行为，移除它会导致不可预测的性能下降。

按消息类型保留不同元素：

工具输出：提取关键发现、指标、错误代码和结论。去除冗长的原始输出、堆栈跟踪（除非调试正在进行中）和样板文件头。
对话轮次：保留决策、承诺、用户偏好和上下文转换。去除填充词、客套话以及已经得出结论的探索性来回讨论。

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

983,700 周安装

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

127,000 周安装

SoulTrace 人格评估 API - 基于五色心理模型的贝叶斯自适应测试

93,500 周安装

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

76,800 周安装

检索到的文档：保留与当前任务相关的声明、事实和数据点。去除服务于一次性推理目的的支撑证据和详细阐述。

目标是实现 50-70% 的 token 减少，同时质量下降低于 5%。如果压缩超过 70% 的减少量，请审计总结是否存在关键信息丢失——过度激进的压缩是最常见的失败模式。

根据最近性和持续相关性有选择地掩蔽观察结果——而非统一处理。应用以下规则：

永不掩蔽：对当前任务至关重要的观察结果、最近一轮的观察结果、用于活动推理链的观察结果，以及调试进行中的错误输出。
3 轮或更多轮后掩蔽：其关键点已提取到对话流程中的冗长输出。用紧凑的引用替换：[Obs:{ref_id} 已省略。关键点：{summary}。完整内容可检索。]
立即始终掩蔽：重复/重复的输出、样板文件头和尾、对话中先前已总结的输出。

掩蔽应实现掩蔽观察结果 60-80% 的减少，同时质量影响小于 2%。关键是保持可检索性——将完整内容存储在外部，并在上下文中保留引用 ID，以便智能体在需要时可以请求原始内容。

通过构建提示，使稳定内容占据前缀，动态内容出现在末尾，从而最大化前缀缓存命中率。KV-cache 存储推理期间计算的 Key 和 Value 张量；当连续请求共享相同的前缀时，缓存的张量被重用，从而节省成本和延迟。

在每个提示中应用此顺序：

系统提示（最稳定——在会话内从不改变）
工具定义（跨请求稳定）
频繁重用的模板和少样本示例
对话历史（增长但与先前轮次共享前缀）
当前查询和动态内容（最不稳定——始终在最后）

为缓存稳定性设计提示：从系统提示中移除时间戳、会话计数器和请求 ID。将动态元数据移到单独的用户消息或工具结果中，使其不会破坏前缀。即使前缀中单个空格的更改也会使该更改点下游的整个缓存块失效。

对于稳定的工作负载，目标是 70% 以上的缓存命中率。在规模上，这相当于在缓存的 token 上实现 50% 以上的成本减少和 40% 以上的延迟减少。

当单个上下文无法在不触发激进压缩的情况下容纳整个问题时，将工作分区到子智能体。每个子智能体在其子任务的干净、专注的上下文中运行，然后将结构化结果返回给协调器智能体。

当估计的任务上下文超过窗口限制的 60% 时，规划分区。将任务分解为独立的子任务，分配给每个子智能体，并聚合结果。在合并前验证所有分区是否完成，合并兼容的结果，如果聚合输出仍超出预算，则应用总结。

这种方法实现了关注点分离——详细的搜索上下文保持在子智能体内隔离，而协调器专注于综合。然而，协调有真实的 token 成本：协调器提示、结果聚合和错误处理都会消耗 token。只有在节省的成本超过此开销时才进行分区。

在会话开始前，为上下文类别分配明确的 token 预算：系统提示、工具定义、检索到的文档、消息历史、工具输出和保留缓冲区（占总量的 5-10%）。持续监控使用情况与预算，当任何类别超过其分配或总利用率超过 70% 时触发优化。

使用基于触发的优化，而非周期性优化。监控这些信号：

Token 利用率超过 80% —— 触发压缩
注意力下降指标（重复、遗漏指令）—— 触发掩蔽 + 压缩
质量评分低于基线 —— 在优化前审计上下文构成

根据主导上下文的内容选择优化技术：

上下文构成	首要行动	次要行动
工具输出占主导 (>50%)	观察掩蔽	压缩剩余轮次
检索到的文档占主导	总结	如果文档独立则分区
消息历史占主导	选择性保留的压缩	为新子任务分区
多个组件共同作用	首先进行 KV-cache 优化，然后分层掩蔽 + 压缩
接近限制且处于活动调试中	仅掩蔽已解决的工具输出——保留错误详情

跟踪这些指标以验证优化效果：

压缩：50-70% token 减少，<5% 质量下降，压缩步骤本身 <10% 的延迟开销
掩蔽：掩蔽观察结果减少 60-80%，<2% 质量影响，接近零延迟开销
缓存优化：稳定工作负载命中率 70%+，成本减少 50%+，延迟减少 40%+
分区：考虑协调器开销后的净 token 节省；盈亏平衡通常需要 3+ 个子任务

根据测量结果迭代策略。如果一种优化技术没有可测量地改善目标指标，请移除它——优化机制本身会消耗 token 并增加延迟。

示例 1：压缩触发

if context_tokens / context_limit > 0.8:
    context = compact_context(context)

示例 2：观察掩蔽

if len(observation) > max_length:
    ref_id = store_observation(observation)
    return f"[Obs:{ref_id} elided. Key: {extract_key(observation)}]"

示例 3：缓存友好排序

# Stable content first
context = [system_prompt, tool_definitions]  # Cacheable
context += [reused_templates]  # Reusable
context += [unique_content]  # Unique

优化前先测量——了解当前状态
先应用掩蔽再压缩——先移除低价值的大块内容，然后总结剩余部分
使用一致的提示设计缓存稳定性
在上下文出现问题之前进行分区
随时间监控优化效果
平衡 token 节省与质量保留
在生产规模测试优化
为边缘情况实现优雅降级

空格破坏 KV-cache：即使提示前缀中单个空格或换行符的更改，也会使该点下游的整个 KV-cache 块失效。将系统提示固定为不可变字符串——不要在其中插入时间戳、版本号或会话 ID。在部署之间逐字节比较提示模板。
系统提示中的时间戳破坏缓存命中率：在系统提示中包含 当前日期：{today} 或类似的动态内容，会迫使在每个新的一天（或每次请求，如果使用一天中的时间）发生完全缓存未命中。将动态元数据移到用户消息或附加在稳定前缀之后的单独工具结果中。
压力下的压缩会丢失关键状态：当执行压缩的模型本身处于上下文压力下（>85% 利用率）时，其总结质量会下降——它会忽略任务目标、丢弃用户约束并扁平化细微的状态。在 70-80% 时触发压缩，而不是 90%+。如果必须晚些进行压缩，请使用单独的模型调用，其上下文仅包含要总结的材料。
掩蔽错误输出会破坏调试循环：过度激进的掩蔽会隐藏错误消息、堆栈跟踪和失败详情，而这些是智能体在后续轮次中诊断和修复问题所需要的。在活动调试期间（最近 3 轮内有错误），暂停对所有与错误相关的观察结果的掩蔽，直到问题解决。
分区开销可能超过节省：每个子智能体都需要自己的系统提示、工具定义和协调消息。对于少于 3 个独立子任务的任务，协调开销通常超过上下文节省。在决定分区之前，估计总 token 数（协调器 + 所有子智能体）。
部署更改后缓存未命中成本激增：在部署之间重新排序工具、改写系统提示或更改少样本示例会使整个前缀缓存失效，导致 2-5 倍的临时成本激增，直到新缓存预热。逐步推出提示更改，并在部署窗口期间监控缓存命中率。
压缩对过时的总结产生虚假信心：一旦上下文被压缩，总结看起来权威，但可能反映过时的状态。如果任务自压缩以来已经演变（新的用户需求、修正的假设），总结会默默地携带过时的信息。压缩后，在继续之前，根据当前任务目标重新验证总结。

此技能建立在 context-fundamentals 和 context-degradation 之上。它连接到：

multi-agent-patterns - 分区作为隔离
evaluation - 测量优化效果
memory-systems - 将上下文卸载到内存

优化技术参考 - 阅读时机：实现特定优化技术时，需要技能主体之外的详细代码模式、阈值表或集成示例

此集合中的相关技能：

context-fundamentals - 阅读时机：不熟悉上下文窗口机制、token 计数或注意力分布基础知识时
context-degradation - 阅读时机：诊断智能体性能下降的原因，需要在选择优化之前识别哪种退化模式正在发生时
evaluation - 阅读时机：设置指标和基准来衡量优化技术是否真正改善了结果时

关于上下文窗口限制的研究 - 阅读时机：评估特定于模型的上下文行为时（例如，中间丢失效应、注意力衰减曲线）
KV-cache 优化技术 - 阅读时机：在推理基础设施层面（vLLM、TGI 或云提供商 API）实现前缀缓存时
生产工程指南 - 阅读时机：在生产管道中部署上下文优化，需要可操作性模式（监控、告警、回滚）时

创建日期 : 2025-12-20 最后更新 : 2026-03-17 作者 : Agent Skills for Context Engineering Contributors 版本 : 2.0.0

🇺🇸English

Context Optimization Techniques

Context optimization extends the effective capacity of limited context windows through strategic compression, masking, caching, and partitioning. Effective optimization can double or triple effective context capacity without requiring larger models or longer windows — but only when applied with discipline. The techniques below are ordered by impact and risk.

When to Activate

Activate this skill when:

Context limits constrain task complexity
Optimizing for cost reduction (fewer tokens = lower costs)
Reducing latency for long conversations
Implementing long-running agent systems
Needing to handle larger documents or conversations
Building production systems at scale

Core Concepts

Apply four primary strategies in this priority order:

KV-cache optimization — Reorder and stabilize prompt structure so the inference engine reuses cached Key/Value tensors. This is the cheapest optimization: zero quality risk, immediate cost and latency savings. Apply it first and unconditionally.
Observation masking — Replace verbose tool outputs with compact references once their purpose has been served. Tool outputs consume 80%+ of tokens in typical agent trajectories, so masking them yields the largest capacity gains. The original content remains retrievable if needed downstream.
Compaction — Summarize accumulated context when utilization exceeds 70%, then reinitialize with the summary. This distills the window's contents while preserving task-critical state. Compaction is lossy — apply it after masking has already removed the low-value bulk.
Context partitioning — Split work across sub-agents with isolated contexts when a single window cannot hold the full problem. Each sub-agent operates in a clean context focused on its subtask. Reserve this for tasks where estimated context exceeds 60% of the window limit, because coordination overhead is real.

The governing principle: context quality matters more than quantity. Every optimization preserves signal while reducing noise. Measure before optimizing, then measure the optimization's effect.

Detailed Topics

Compaction Strategies

Trigger compaction when context utilization exceeds 70%: summarize the current context, then reinitialize with the summary. This distills the window's contents in a high-fidelity manner, enabling continuation with minimal performance degradation. Prioritize compressing tool outputs first (they consume 80%+ of tokens), then old conversation turns, then retrieved documents. Never compress the system prompt — it anchors model behavior and its removal causes unpredictable degradation.

Preserve different elements by message type:

Tool outputs : Extract key findings, metrics, error codes, and conclusions. Strip verbose raw output, stack traces (unless debugging is ongoing), and boilerplate headers.
Conversational turns : Retain decisions, commitments, user preferences, and context shifts. Remove filler, pleasantries, and exploratory back-and-forth that led to a conclusion already captured.
Retrieved documents : Keep claims, facts, and data points relevant to the active task. Remove supporting evidence and elaboration that served a one-time reasoning purpose.

Target 50-70% token reduction with less than 5% quality degradation. If compaction exceeds 70% reduction, audit the summary for critical information loss — over-aggressive compaction is the most common failure mode.

Observation Masking

Mask observations selectively based on recency and ongoing relevance — not uniformly. Apply these rules:

Never mask : Observations critical to the current task, observations from the most recent turn, observations used in active reasoning chains, and error outputs when debugging is in progress.
Mask after 3+ turns : Verbose outputs whose key points have already been extracted into the conversation flow. Replace with a compact reference: [Obs:{ref_id} elided. Key: {summary}. Full content retrievable.]
Always mask immediately : Repeated/duplicate outputs, boilerplate headers and footers, outputs already summarized earlier in the conversation.

Masking should achieve 60-80% reduction in masked observations with less than 2% quality impact. The key is maintaining retrievability — store the full content externally and keep the reference ID in context so the agent can request the original if needed.

KV-Cache Optimization

Maximize prefix cache hits by structuring prompts so that stable content occupies the prefix and dynamic content appears at the end. KV-cache stores Key and Value tensors computed during inference; when consecutive requests share an identical prefix, the cached tensors are reused, saving both cost and latency.

Apply this ordering in every prompt:

System prompt (most stable — never changes within a session)
Tool definitions (stable across requests)
Frequently reused templates and few-shot examples
Conversation history (grows but shares prefix with prior turns)
Current query and dynamic content (least stable — always last)

Design prompts for cache stability: remove timestamps, session counters, and request IDs from the system prompt. Move dynamic metadata into a separate user message or tool result where it does not break the prefix. Even a single whitespace change in the prefix invalidates the entire cached block downstream of that change.

Target 70%+ cache hit rate for stable workloads. At scale, this translates to 50%+ cost reduction and 40%+ latency reduction on cached tokens.

Context Partitioning

Partition work across sub-agents when a single context cannot hold the full problem without triggering aggressive compaction. Each sub-agent operates in a clean, focused context for its subtask, then returns a structured result to a coordinator agent.

Plan partitioning when estimated task context exceeds 60% of the window limit. Decompose the task into independent subtasks, assign each to a sub-agent, and aggregate results. Validate that all partitions completed before merging, merge compatible results, and apply summarization if the aggregated output still exceeds budget.

This approach achieves separation of concerns — detailed search context stays isolated within sub-agents while the coordinator focuses on synthesis. However, coordination has real token cost: the coordinator prompt, result aggregation, and error handling all consume tokens. Only partition when the savings exceed this overhead.

Budget Management

Allocate explicit token budgets across context categories before the session begins: system prompt, tool definitions, retrieved documents, message history, tool outputs, and a reserved buffer (5-10% of total). Monitor usage against budget continuously and trigger optimization when any category exceeds its allocation or total utilization crosses 70%.

Use trigger-based optimization rather than periodic optimization. Monitor these signals:

Token utilization above 80% — trigger compaction
Attention degradation indicators (repetition, missed instructions) — trigger masking + compaction
Quality score drops below baseline — audit context composition before optimizing

Practical Guidance

Optimization Decision Framework

Select the optimization technique based on what dominates the context:

Context Composition	First Action	Second Action
Tool outputs dominate (>50%)	Observation masking	Compaction of remaining turns
Retrieved documents dominate	Summarization	Partitioning if docs are independent
Message history dominates	Compaction with selective preservation	Partitioning for new subtasks
Multiple components contribute	KV-cache optimization first, then layer masking + compaction
Near-limit with active debugging	Mask resolved tool outputs only — preserve error details

Performance Targets

Track these metrics to validate optimization effectiveness:

Compaction : 50-70% token reduction, <5% quality degradation, <10% latency overhead from the compaction step itself
Masking : 60-80% reduction in masked observations, <2% quality impact, near-zero latency overhead
Cache optimization : 70%+ hit rate for stable workloads, 50%+ cost reduction, 40%+ latency reduction
Partitioning : Net token savings after accounting for coordinator overhead; break-even typically requires 3+ subtasks

Iterate on strategies based on measured results. If an optimization technique does not measurably improve the target metric, remove it — optimization machinery itself consumes tokens and adds latency.

Examples

Example 1: Compaction Trigger

if context_tokens / context_limit > 0.8:
    context = compact_context(context)

Example 2: Observation Masking

if len(observation) > max_length:
    ref_id = store_observation(observation)
    return f"[Obs:{ref_id} elided. Key: {extract_key(observation)}]"

Example 3: Cache-Friendly Ordering

# Stable content first
context = [system_prompt, tool_definitions]  # Cacheable
context += [reused_templates]  # Reusable
context += [unique_content]  # Unique

Guidelines

Measure before optimizing—know your current state
Apply masking before compaction — remove low-value bulk first, then summarize what remains
Design for cache stability with consistent prompts
Partition before context becomes problematic
Monitor optimization effectiveness over time
Balance token savings against quality preservation
Test optimization at production scale
Implement graceful degradation for edge cases

Gotchas

Whitespace breaks KV-cache : Even a single whitespace or newline change in the prompt prefix invalidates the entire KV-cache block downstream of that point. Pin system prompts as immutable strings — do not interpolate timestamps, version numbers, or session IDs into them. Diff prompt templates byte-for-byte between deployments.
Timestamps in system prompts destroy cache hit rates : Including Current date: {today} or similar dynamic content in the system prompt forces a full cache miss on every new day (or every request, if using time-of-day). Move dynamic metadata into a user message or a separate tool result appended after the stable prefix.
Compaction under pressure loses critical state : When the model performing compaction is itself under context pressure (>85% utilization), its summarization quality degrades — it omits task goals, drops user constraints, and flattens nuanced state. Trigger compaction at 70-80%, not 90%+. If compaction must happen late, use a separate model call with a clean context containing only the material to summarize.
Masking error outputs breaks debugging loops : Over-aggressive masking hides error messages, stack traces, and failure details that the agent needs in subsequent turns to diagnose and fix issues. During active debugging (error in the last 3 turns), suspend masking for all error-related observations until the issue is resolved.
Partitioning overhead can exceed savings : Each sub-agent requires its own system prompt, tool definitions, and coordination messages. For tasks with fewer than 3 independent subtasks, the coordination overhead often exceeds the context savings. Estimate total tokens (coordinator + all sub-agents) before committing to partitioning.
Cache miss cost spikes after deployment changes : Reordering tools, rewording the system prompt, or changing few-shot examples between deployments invalidates the entire prefix cache, causing a temporary cost spike of 2-5x until the new cache warms up. Roll out prompt changes gradually and monitor cache hit rate during deployment windows.

Integration

This skill builds on context-fundamentals and context-degradation. It connects to:

multi-agent-patterns - Partitioning as isolation
evaluation - Measuring optimization effectiveness
memory-systems - Offloading context to memory

References

Internal reference:

Optimization Techniques Reference - Read when: implementing a specific optimization technique and needing detailed code patterns, threshold tables, or integration examples beyond what the skill body provides

Related skills in this collection:

context-fundamentals - Read when: unfamiliar with context window mechanics, token counting, or attention distribution basics
context-degradation - Read when: diagnosing why agent performance has dropped and needing to identify which degradation pattern is occurring before selecting an optimization
evaluation - Read when: setting up metrics and benchmarks to measure whether an optimization technique actually improved outcomes

External resources:

Research on context window limitations - Read when: evaluating model-specific context behavior (e.g., lost-in-the-middle effects, attention decay curves)
KV-cache optimization techniques - Read when: implementing prefix caching at the inference infrastructure level (vLLM, TGI, or cloud provider APIs)
Production engineering guides - Read when: deploying context optimization in a production pipeline and needing operability patterns (monitoring, alerting, rollback)

Skill Metadata

Created : 2025-12-20 Last Updated : 2026-03-17 Author : Agent Skills for Context Engineering Contributors Version : 2.0.0

Weekly Installs

Repository

guanyang/antigr…y-skills

GitHub Stars

518

First Seen

Jan 26, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode49

codex48

github-copilot47

cursor46

gemini-cli46

amp45

AI界面设计评审工具 - 全面评估UI/UX设计质量、检测AI生成痕迹与优化用户体验

58,500 周安装

Compaction creates false confidence in stale summaries : Once context is compacted, the summary looks authoritative but may reflect outdated state. If the task has evolved since compaction (new user requirements, corrected assumptions), the summary silently carries forward stale information. After compaction, re-validate the summary against the current task goal before proceeding.

上下文优化技术指南：通过压缩、掩蔽、缓存和分区扩展LLM上下文窗口，降低成本与延迟

🇨🇳中文介绍

上下文优化技术

何时启用

核心概念

详细主题

压缩策略

相关 Skills

观察掩蔽

KV-Cache 优化

上下文分区

预算管理

实践指导

优化决策框架

性能目标

示例

指导原则

注意事项

集成

参考资料

技能元数据