上下文优化技术：压缩、掩蔽、缓存与分区策略扩展AI上下文窗口有效容量

context-optimization by crinkj/common-claude-setting

1 周安装量

GitHub

安装命令

npx skills add https://github.com/crinkj/common-claude-setting --skill context-optimization

AI/机器学习系统架构性能优化

🇨🇳中文介绍

上下文优化技术

上下文优化通过策略性的压缩、掩蔽、缓存和分区，扩展了有限上下文窗口的有效容量。目标并非神奇地增加上下文窗口，而是更好地利用可用容量。有效的优化可以使有效上下文容量翻倍或增至三倍，而无需更大的模型或更长的上下文。

何时激活

在以下情况下激活此技能：

上下文限制约束了任务复杂度
为降低成本进行优化（更少的 token = 更低的成本）
减少长对话的延迟
实现长时间运行的智能体系统
需要处理更大的文档或对话
构建大规模生产系统

核心概念

上下文优化通过四种主要策略扩展有效容量：压缩（在接近限制时总结上下文）、观察掩蔽（用引用替换冗长的输出）、KV缓存优化（重用缓存的计算）和上下文分区（将工作拆分到隔离的上下文中）。

关键洞察在于上下文质量比数量更重要。优化在减少噪声的同时保留了信号。艺术在于选择保留什么与丢弃什么，以及何时应用每种技术。

详细主题

压缩策略

什么是压缩 压缩是在接近上下文限制时总结上下文内容，然后用摘要重新初始化新上下文窗口的做法。这以高保真方式提炼上下文窗口的内容，使智能体能够以最小的性能下降继续工作。

压缩通常是上下文优化的首要手段。艺术在于选择保留什么与丢弃什么。

压缩实现 压缩通过识别可以压缩的部分、生成捕获要点的摘要并用摘要替换完整内容来工作。压缩的优先级依次为：工具输出（用摘要替换）、旧对话轮次（总结早期对话）、检索到的文档（如果存在较新版本则总结），并且永不压缩系统提示。

摘要生成 有效的摘要根据消息类型保留不同的元素：

工具输出：保留关键发现、指标和结论。删除冗长的原始输出。

对话轮次：保留关键决策、承诺和上下文转换。删除填充内容和来回对话。

检索到的文档：保留关键事实和主张。删除支持性证据和详细阐述。

观察掩蔽

观察问题 工具输出可能占智能体轨迹中 token 使用量的 80% 以上。其中大部分是冗长的输出，其目的已经达到。一旦智能体使用工具输出来做出决策，保留完整输出提供的价值递减，同时消耗大量上下文。

观察掩蔽用紧凑的引用替换冗长的工具输出。信息在需要时仍然可以访问，但不会持续消耗上下文。

掩蔽策略选择 并非所有观察都应同等掩蔽：

永不掩蔽：对当前任务至关重要的观察、最近一轮的观察、用于主动推理的观察。

考虑掩蔽：3 轮以前的观察、可以提取要点的冗长输出、目的已实现的观察。

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

812,900 周安装

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

109,600 周安装

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

60,400 周安装

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

45,700 周安装

始终掩蔽：重复的输出、样板文件头/尾、对话中已总结的输出。

理解 KV 缓存 KV 缓存存储推理期间计算的键和值张量，其大小随序列长度线性增长。在共享相同前缀的请求之间缓存 KV 缓存可以避免重新计算。

前缀缓存通过基于哈希的块匹配，在具有相同前缀的请求之间重用 KV 块。这极大地降低了具有公共前缀（如系统提示）的请求的成本和延迟。

缓存优化模式 通过重新排序上下文元素以最大化缓存命中率来优化缓存。首先放置稳定元素（系统提示、工具定义），然后是频繁重用的元素，最后是唯一元素。

设计提示以最大化缓存稳定性：避免动态内容（如时间戳），使用一致的格式，保持会话间结构稳定。

子智能体分区 上下文优化最激进的形式是将工作分配到具有隔离上下文的子智能体之间。每个子智能体在一个专注于其子任务的干净上下文中运行，而不携带来自其他子任务的累积上下文。

这种方法实现了关注点分离——详细的搜索上下文保持在子智能体内隔离，而协调器专注于综合和分析。

结果聚合 通过验证所有分区已完成、合并兼容的结果以及在仍然太大时进行总结，来聚合来自分区子任务的结果。

上下文预算分配 设计明确的上下文预算。将 token 分配到类别：系统提示、工具定义、检索到的文档、消息历史和预留缓冲区。根据预算监控使用情况，并在接近限制时触发优化。

基于触发的优化 监控优化触发信号：token 利用率超过 80%、性能下降指标和性能下降。根据上下文组成应用适当的优化技术。

上下文利用率超过 70%
随着对话延长，响应质量下降
由于长上下文导致成本增加
延迟随对话长度增加

工具输出占主导：观察掩蔽
检索到的文档占主导：总结或分区
消息历史占主导：带总结的压缩
多个组件：组合策略

压缩应实现 50-70% 的 token 减少，且质量下降小于 5%。掩蔽应在被掩蔽的观察中实现 60-80% 的减少。缓存优化对于稳定工作负载应实现 70% 以上的命中率。

根据测量的有效性监控并迭代优化策略。

示例 1：压缩触发

if context_tokens / context_limit > 0.8:
    context = compact_context(context)

示例 2：观察掩蔽

if len(observation) > max_length:
    ref_id = store_observation(observation)
    return f"[Obs:{ref_id} elided. Key: {extract_key(observation)}]"

示例 3：缓存友好排序

# 稳定内容在前
context = [system_prompt, tool_definitions]  # 可缓存
context += [reused_templates]  # 可重用
context += [unique_content]  # 唯一

优化前先测量——了解当前状态
可能时，在掩蔽前应用压缩
使用一致的提示设计以保持缓存稳定性
在上下文出现问题之前进行分区
随时间监控优化效果
在 token 节省和质量保留之间取得平衡
在生产规模测试优化
为边缘情况实现优雅降级

此技能建立在 context-fundamentals 和 context-degradation 之上。它关联到：

multi-agent-patterns - 作为隔离的分区
evaluation - 测量优化效果
memory-systems - 将上下文卸载到内存

优化技术参考 - 详细技术参考

本集合中的相关技能：

context-fundamentals - 上下文基础
context-degradation - 了解何时优化
evaluation - 测量优化

关于上下文窗口限制的研究
KV 缓存优化技术
生产工程指南

创建日期 : 2025-12-20 最后更新 : 2025-12-20 作者 : Agent Skills for Context Engineering Contributors 版本 : 1.0.0

🇺🇸English

Context Optimization Techniques

Context optimization extends the effective capacity of limited context windows through strategic compression, masking, caching, and partitioning. The goal is not to magically increase context windows but to make better use of available capacity. Effective optimization can double or triple effective context capacity without requiring larger models or longer contexts.

When to Activate

Activate this skill when:

Context limits constrain task complexity
Optimizing for cost reduction (fewer tokens = lower costs)
Reducing latency for long conversations
Implementing long-running agent systems
Needing to handle larger documents or conversations
Building production systems at scale

Core Concepts

Context optimization extends effective capacity through four primary strategies: compaction (summarizing context near limits), observation masking (replacing verbose outputs with references), KV-cache optimization (reusing cached computations), and context partitioning (splitting work across isolated contexts).

The key insight is that context quality matters more than quantity. Optimization preserves signal while reducing noise. The art lies in selecting what to keep versus what to discard, and when to apply each technique.

Detailed Topics

Compaction Strategies

What is Compaction Compaction is the practice of summarizing context contents when approaching limits, then reinitializing a new context window with the summary. This distills the contents of a context window in a high-fidelity manner, enabling the agent to continue with minimal performance degradation.

Compaction typically serves as the first lever in context optimization. The art lies in selecting what to keep versus what to discard.

Compaction Implementation Compaction works by identifying sections that can be compressed, generating summaries that capture essential points, and replacing full content with summaries. Priority for compression goes to tool outputs (replace with summaries), old turns (summarize early conversation), retrieved docs (summarize if recent versions exist), and never compress system prompt.

Summary Generation Effective summaries preserve different elements depending on message type:

Tool outputs: Preserve key findings, metrics, and conclusions. Remove verbose raw output.

Conversational turns: Preserve key decisions, commitments, and context shifts. Remove filler and back-and-forth.

Retrieved documents: Preserve key facts and claims. Remove supporting evidence and elaboration.

Observation Masking

The Observation Problem Tool outputs can comprise 80%+ of token usage in agent trajectories. Much of this is verbose output that has already served its purpose. Once an agent has used a tool output to make a decision, keeping the full output provides diminishing value while consuming significant context.

Observation masking replaces verbose tool outputs with compact references. The information remains accessible if needed but does not consume context continuously.

Masking Strategy Selection Not all observations should be masked equally:

Never mask: Observations critical to current task, observations from the most recent turn, observations used in active reasoning.

Consider masking: Observations from 3+ turns ago, verbose outputs with key points extractable, observations whose purpose has been served.

Always mask: Repeated outputs, boilerplate headers/footers, outputs already summarized in conversation.

KV-Cache Optimization

Understanding KV-Cache The KV-cache stores Key and Value tensors computed during inference, growing linearly with sequence length. Caching the KV-cache across requests sharing identical prefixes avoids recomputation.

Prefix caching reuses KV blocks across requests with identical prefixes using hash-based block matching. This dramatically reduces cost and latency for requests with common prefixes like system prompts.

Cache Optimization Patterns Optimize for caching by reordering context elements to maximize cache hits. Place stable elements first (system prompt, tool definitions), then frequently reused elements, then unique elements last.

Design prompts to maximize cache stability: avoid dynamic content like timestamps, use consistent formatting, keep structure stable across sessions.

Context Partitioning

Sub-Agent Partitioning The most aggressive form of context optimization is partitioning work across sub-agents with isolated contexts. Each sub-agent operates in a clean context focused on its subtask without carrying accumulated context from other subtasks.

This approach achieves separation of concerns—the detailed search context remains isolated within sub-agents while the coordinator focuses on synthesis and analysis.

Result Aggregation Aggregate results from partitioned subtasks by validating all partitions completed, merging compatible results, and summarizing if still too large.

Budget Management

Context Budget Allocation Design explicit context budgets. Allocate tokens to categories: system prompt, tool definitions, retrieved docs, message history, and reserved buffer. Monitor usage against budget and trigger optimization when approaching limits.

Trigger-Based Optimization Monitor signals for optimization triggers: token utilization above 80%, degradation indicators, and performance drops. Apply appropriate optimization techniques based on context composition.

Practical Guidance

Optimization Decision Framework

When to optimize:

Context utilization exceeds 70%
Response quality degrades as conversations extend
Costs increase due to long contexts
Latency increases with conversation length

What to apply:

Tool outputs dominate: observation masking
Retrieved documents dominate: summarization or partitioning
Message history dominates: compaction with summarization
Multiple components: combine strategies

Performance Considerations

Compaction should achieve 50-70% token reduction with less than 5% quality degradation. Masking should achieve 60-80% reduction in masked observations. Cache optimization should achieve 70%+ hit rate for stable workloads.

Monitor and iterate on optimization strategies based on measured effectiveness.

Examples

Example 1: Compaction Trigger

if context_tokens / context_limit > 0.8:
    context = compact_context(context)

Example 2: Observation Masking

if len(observation) > max_length:
    ref_id = store_observation(observation)
    return f"[Obs:{ref_id} elided. Key: {extract_key(observation)}]"

Example 3: Cache-Friendly Ordering

# Stable content first
context = [system_prompt, tool_definitions]  # Cacheable
context += [reused_templates]  # Reusable
context += [unique_content]  # Unique

Guidelines

Measure before optimizing—know your current state
Apply compaction before masking when possible
Design for cache stability with consistent prompts
Partition before context becomes problematic
Monitor optimization effectiveness over time
Balance token savings against quality preservation
Test optimization at production scale
Implement graceful degradation for edge cases

Integration

This skill builds on context-fundamentals and context-degradation. It connects to:

multi-agent-patterns - Partitioning as isolation
evaluation - Measuring optimization effectiveness
memory-systems - Offloading context to memory

References

Internal reference:

Optimization Techniques Reference - Detailed technical reference

Related skills in this collection:

context-fundamentals - Context basics
context-degradation - Understanding when to optimize
evaluation - Measuring optimization

External resources:

Research on context window limitations
KV-cache optimization techniques
Production engineering guides

Skill Metadata

Created : 2025-12-20 Last Updated : 2025-12-20 Author : Agent Skills for Context Engineering Contributors Version : 1.0.0

Weekly Installs

Repository

crinkj/common-c…-setting

First Seen

Today

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

zencoder1

amp1

cline1

openclaw1

opencode1

cursor1

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

45,100 周安装