grepai-chunking by yoanbernabeu/grepai-skills
npx skills add https://github.com/yoanbernabeu/grepai-skills --skill grepai-chunking本技能涵盖 GrepAI 如何将代码文件分割成块以进行嵌入,以及如何为您的代码库优化分块。
分块是将源文件分割成更小的片段以进行嵌入的过程:
┌─────────────────────────────────────┐
│ 大型源文件 │
│ (1000+ tokens) │
└─────────────────────────────────────┘
↓
┌─────────┐ ┌─────────┐ ┌─────────┐
│ 块 1 │ │ 块 2 │ │ 块 3 │
│ ~512 │ │ ~512 │ │ ~512 │
│ tokens │ │ tokens │ │ tokens │
└─────────┘ └─────────┘ └─────────┘
↓
每个块获得其
自己的嵌入向量
嵌入模型有最佳的输入大小:
# .grepai/config.yaml
chunking:
size: 512 # 每个块的 token 数
overlap: 50 # 块之间的重叠 token 数
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
每个块的目标 token 数。
| 大小 | 效果 |
|---|---|
| 256 | 更精确,上下文较少 |
| 512 | 平衡(默认值) |
| 1024 | 上下文更多,精确度较低 |
相邻块之间共享的 token 数。保留边界处的上下文。
| 重叠 | 效果 |
|---|---|
| 0 | 无重叠,可能在边界处丢失上下文 |
| 50 | 标准重叠(默认值) |
| 100 | 更多上下文,索引更大 |
假设 size=512 且 overlap=50:
File: auth.go (1000 tokens)
Chunk 1: tokens 1-512
┌────────────────────────────────────┐
│ func Login(user, pass)... │
└────────────────────────────────────┘
↘
50 token overlap
↙
Chunk 2: tokens 463-974
┌────────────────────────────────────┐
│ ...validate credentials... │
└────────────────────────────────────┘
↘
50 token overlap
↙
Chunk 3: tokens 925-1000
┌──────────────┐
│ ...return │
└──────────────┘
chunking:
size: 768 # 更大以捕获完整的方法
overlap: 75
chunking:
size: 512 # 标准大小
overlap: 50
chunking:
size: 384 # 更小以获得精确结果
overlap: 40
chunking:
size: 384 # 捕获单个函数
overlap: 40
chunking:
size: 768 # 捕获更多上下文
overlap: 100
chunking:
size: 512 # 平衡的默认值
overlap: 50
GrepAI 使用近似 token 计数:
示例:
func calculateTotal(items []Item) float64 {
total := 0.0
for _, item := range items {
total += item.Price * float64(item.Quantity)
}
return total
}
≈ 45 个 tokens
更大的重叠 = 更多的块 = 更大的索引:
| 大小 | 重叠 | 每 10K tokens 的块数 | 索引影响 |
|---|---|---|---|
| 512 | 0 | ~20 | 最小 |
| 512 | 50 | ~22 | 标准 |
| 512 | 100 | ~24 | +10% |
| 256 | 50 | ~44 | +100% |
Query: "authentication middleware"
Result: "...c.AbortWithStatus(401)..."
(片段,缺少上下文)
Query: "authentication middleware"
Result: "func AuthMiddleware() gin.HandlerFunc {
return func(c *gin.Context) {
token := c.GetHeader("Authorization")
if token == "" {
c.AbortWithStatus(401)
return
}
// validate token...
}
}"
(完整的函数及其上下文)
Query: "authentication middleware"
Result: "// Multiple unrelated functions...
func AuthMiddleware()... (your match)
func LoggingMiddleware()...
func CORSMiddleware()..."
(过多无关内容)
chunking:
size: 384
overlap: 40
rm .grepai/index.gob
grepai watch
grepai search "your query"
在更改设置之前,保存一个搜索结果:
grepai search "authentication" > before.txt
更改设置并重新索引后:
grepai search "authentication" > after.txt
diff before.txt after.txt
GrepAI 尝试在逻辑边界处进行分割:
这意味着实际的块大小可能与目标大小略有不同。
❌ 问题: 搜索结果过于碎片化 ✅ 解决方案: 增加块大小:
chunking:
size: 768
❌ 问题: 搜索结果包含过多无关上下文 ✅ 解决方案: 减小块大小:
chunking:
size: 384
❌ 问题: 结果遗漏了函数边界处的相关代码 ✅ 解决方案: 增加重叠:
chunking:
overlap: 100
❌ 问题: 索引太大 ✅ 解决方案:
分块状态:
✅ 分块配置
大小: 512 tokens
重叠: 50 tokens
索引统计:
- 总文件数: 245
- 总块数: 1,234
- 平均块数/文件: 5.0
- 平均块大小: 478 tokens
建议:
- 当前设置是平衡的
- 考虑 size: 384 以获得更精确的结果
- 考虑 size: 768 以获得更多上下文
每周安装次数
260
仓库
GitHub 星标数
15
首次出现
2026年1月28日
安全审计
安装于
opencode210
codex202
gemini-cli188
github-copilot186
kimi-cli171
amp169
This skill covers how GrepAI splits code files into chunks for embedding, and how to optimize chunking for your codebase.
Chunking is the process of splitting source files into smaller segments for embedding:
┌─────────────────────────────────────┐
│ Large Source File │
│ (1000+ tokens) │
└─────────────────────────────────────┘
↓
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Chunk 1 │ │ Chunk 2 │ │ Chunk 3 │
│ ~512 │ │ ~512 │ │ ~512 │
│ tokens │ │ tokens │ │ tokens │
└─────────┘ └─────────┘ └─────────┘
↓
Each chunk gets
its own embedding
Embedding models have optimal input sizes:
# .grepai/config.yaml
chunking:
size: 512 # Tokens per chunk
overlap: 50 # Overlap between chunks
The target number of tokens per chunk.
| Size | Effect |
|---|---|
| 256 | More precise, less context |
| 512 | Balanced (default) |
| 1024 | More context, less precise |
Tokens shared between adjacent chunks. Preserves context at boundaries.
| Overlap | Effect |
|---|---|
| 0 | No overlap, may lose context at boundaries |
| 50 | Standard overlap (default) |
| 100 | More context, larger index |
With size=512 and overlap=50:
File: auth.go (1000 tokens)
Chunk 1: tokens 1-512
┌────────────────────────────────────┐
│ func Login(user, pass)... │
└────────────────────────────────────┘
↘
50 token overlap
↙
Chunk 2: tokens 463-974
┌────────────────────────────────────┐
│ ...validate credentials... │
└────────────────────────────────────┘
↘
50 token overlap
↙
Chunk 3: tokens 925-1000
┌──────────────┐
│ ...return │
└──────────────┘
chunking:
size: 768 # Larger to capture full methods
overlap: 75
chunking:
size: 512 # Standard size
overlap: 50
chunking:
size: 384 # Smaller for precise results
overlap: 40
chunking:
size: 384 # Capture individual functions
overlap: 40
chunking:
size: 768 # Capture more context
overlap: 100
chunking:
size: 512 # Balanced default
overlap: 50
GrepAI uses approximate token counting:
Example:
func calculateTotal(items []Item) float64 {
total := 0.0
for _, item := range items {
total += item.Price * float64(item.Quantity)
}
return total
}
≈ 45 tokens
Larger overlap = more chunks = larger index:
| Size | Overlap | Chunks per 10K tokens | Index Impact |
|---|---|---|---|
| 512 | 0 | ~20 | Smallest |
| 512 | 50 | ~22 | Standard |
| 512 | 100 | ~24 | +10% |
| 256 | 50 | ~44 | +100% |
Query: "authentication middleware"
Result: "...c.AbortWithStatus(401)..."
(Fragment, missing context)
Query: "authentication middleware"
Result: "func AuthMiddleware() gin.HandlerFunc {
return func(c *gin.Context) {
token := c.GetHeader("Authorization")
if token == "" {
c.AbortWithStatus(401)
return
}
// validate token...
}
}"
(Complete function with context)
Query: "authentication middleware"
Result: "// Multiple unrelated functions...
func AuthMiddleware()... (your match)
func LoggingMiddleware()...
func CORSMiddleware()..."
(Too much noise)
chunking:
size: 384
overlap: 40
2. Re-index:
rm .grepai/index.gob
grepai watch
3. Test with searches:
grepai search "your query"
4. Adjust and repeat until satisfied.
Before changing settings, save a search result:
grepai search "authentication" > before.txt
After changing settings and re-indexing:
grepai search "authentication" > after.txt
diff before.txt after.txt
GrepAI tries to split at logical boundaries:
This means actual chunk sizes may vary slightly from the target.
❌ Problem: Search results are too fragmented ✅ Solution: Increase chunk size:
chunking:
size: 768
❌ Problem: Search results have too much irrelevant context ✅ Solution: Decrease chunk size:
chunking:
size: 384
❌ Problem: Results miss related code at function boundaries ✅ Solution: Increase overlap:
chunking:
overlap: 100
❌ Problem: Index is too large ✅ Solutions:
Chunking status:
✅ Chunking Configuration
Size: 512 tokens
Overlap: 50 tokens
Index Statistics:
- Total files: 245
- Total chunks: 1,234
- Avg chunks/file: 5.0
- Avg chunk size: 478 tokens
Recommendations:
- Current settings are balanced
- Consider size: 384 for more precise results
- Consider size: 768 for more context
Weekly Installs
260
Repository
GitHub Stars
15
First Seen
Jan 28, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode210
codex202
gemini-cli188
github-copilot186
kimi-cli171
amp169
React 组合模式指南:Vercel 组件架构最佳实践,提升代码可维护性
106,200 周安装
竞争对手研究指南:SEO、内容、反向链接与定价分析工具
231 周安装
Azure 工作负载自动升级评估工具 - 支持 Functions、App Service 计划与 SKU 迁移
231 周安装
Kaizen持续改进方法论:软件开发中的渐进式优化与防错设计实践指南
231 周安装
软件UI/UX设计指南:以用户为中心的设计原则、WCAG可访问性与平台规范
231 周安装
Apify 网络爬虫和自动化平台 - 无需编码抓取亚马逊、谷歌、领英等网站数据
231 周安装
llama.cpp 中文指南:纯 C/C++ LLM 推理,CPU/非 NVIDIA 硬件优化部署
231 周安装