chunking-strategy by giuseppe-trisciuoglio/developer-kit
npx skills add https://github.com/giuseppe-trisciuoglio/developer-kit --skill chunking-strategy为检索增强生成(RAG)系统和文档处理流水线实现最优分块策略。此技能提供了一个全面的框架,用于将大型文档分解为更小、具有语义意义的片段,这些片段在保持上下文的同时,支持高效的检索和搜索。
在构建 RAG 系统、优化向量搜索性能、实现文档处理流水线、处理多模态内容,或对检索质量较差的现有 RAG 系统进行性能调优时使用此技能。
根据文档类型和用例选择合适的分块策略:
固定大小分块(等级 1)
递归字符分块(等级 2)
结构感知分块(等级 3)
语义分块(等级 4)
高级方法(等级 5)
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
详细策略实现请参考 references/strategies.md。
遵循以下步骤实现有效的分块:
预处理文档
选择策略参数
处理与验证
评估与迭代
详细实现指南请参考 references/implementation.md。
使用以下指标评估分块效果:
详细评估框架请参考 references/evaluation.md。
from langchain.text_splitter import RecursiveCharacterTextSplitter
# 为事实型查询配置
splitter = RecursiveCharacterTextSplitter(
chunk_size=256,
chunk_overlap=25,
length_function=len
)
chunks = splitter.split_documents(documents)
def chunk_python_code(code):
"""将 Python 代码分割成语义分块"""
import ast
tree = ast.parse(code)
chunks = []
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
chunks.append(ast.get_source_segment(code, node))
return chunks
def semantic_chunk(text, similarity_threshold=0.8):
"""基于语义边界对文本进行分块"""
sentences = split_into_sentences(text)
embeddings = generate_embeddings(sentences)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
similarity = cosine_similarity(embeddings[i-1], embeddings[i])
if similarity < similarity_threshold:
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
chunks.append(" ".join(current_chunk))
return chunks
详细文档请参考 references/ 文件夹:
每周安装量
368
代码仓库
GitHub 星标数
173
首次出现
2026年2月3日
安全审计
安装于
claude-code286
gemini-cli284
opencode284
codex278
cursor276
github-copilot262
Implement optimal chunking strategies for Retrieval-Augmented Generation (RAG) systems and document processing pipelines. This skill provides a comprehensive framework for breaking large documents into smaller, semantically meaningful segments that preserve context while enabling efficient retrieval and search.
Use this skill when building RAG systems, optimizing vector search performance, implementing document processing pipelines, handling multi-modal content, or performance-tuning existing RAG systems with poor retrieval quality.
Select appropriate chunking strategy based on document type and use case:
Fixed-Size Chunking (Level 1)
Recursive Character Chunking (Level 2)
Structure-Aware Chunking (Level 3)
Semantic Chunking (Level 4)
Advanced Methods (Level 5)
Reference detailed strategy implementations in references/strategies.md.
Follow these steps to implement effective chunking:
Pre-process documents
Select strategy parameters
Process and validate
Evaluate and iterate
Reference detailed implementation guidelines in references/implementation.md.
Use these metrics to evaluate chunking effectiveness:
Reference detailed evaluation framework in references/evaluation.md.
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Configure for factoid queries
splitter = RecursiveCharacterTextSplitter(
chunk_size=256,
chunk_overlap=25,
length_function=len
)
chunks = splitter.split_documents(documents)
def chunk_python_code(code):
"""Split Python code into semantic chunks"""
import ast
tree = ast.parse(code)
chunks = []
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
chunks.append(ast.get_source_segment(code, node))
return chunks
def semantic_chunk(text, similarity_threshold=0.8):
"""Chunk text based on semantic boundaries"""
sentences = split_into_sentences(text)
embeddings = generate_embeddings(sentences)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
similarity = cosine_similarity(embeddings[i-1], embeddings[i])
if similarity < similarity_threshold:
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
chunks.append(" ".join(current_chunk))
return chunks
Reference detailed documentation in the references/ folder:
Weekly Installs
368
Repository
GitHub Stars
173
First Seen
Feb 3, 2026
Security Audits
Gen Agent Trust HubWarnSocketPassSnykPass
Installed on
claude-code286
gemini-cli284
opencode284
codex278
cursor276
github-copilot262
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
40,300 周安装
Python类型注解模式指南:现代类型提示与Typing最佳实践
24 周安装
Web应用安全模式指南:OWASP Top 10防护、输入验证、身份认证与授权最佳实践
25 周安装
task-runner任务运行器:使用just简化项目命令执行,替代make的跨平台工具
30 周安装
EdgeOne Pages 一键部署:无需账户,秒级将HTML文件发布到公共URL
35 周安装
Vibe Security 安全扫描器 - 多语言代码漏洞检测与AI智能修复工具
38 周安装
wechat-publisher:一键发布Markdown文章到微信公众号草稿箱工具
323 周安装