ChromaDB 语义搜索工具 - 支持 Markdown/PDF/DOCX/XLSX 文件的知识库搜索与摘要生成

repo-search by dandcg/claude-skills

2 周安装量

安装命令

npx skills add https://github.com/dandcg/claude-skills --skill repo-search

🇨🇳中文介绍

仓库搜索与摘要生成

使用 ChromaDB 向量嵌入对文档目录进行语义搜索。支持 Markdown、PDF、DOCX 和 XLSX 文件。检索相关片段而无需将整个文件加载到上下文中。专为"第二大脑"或个人知识库设计，但适用于任何文档集合。

先决条件

已设置 Python 虚拟环境（如果未完成，请运行 setup.sh）
已构建索引（如果不存在 .vectordb/ 目录，请运行 ingest）

首次设置

# 设置 Python 环境（一次性操作）
~/.claude/skills/repo-search/setup.sh

# 构建索引（从 brain 仓库根目录运行）
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/ingest.py /path/to/your/markdown-repo --verbose

重建索引（添加/更改文件后）

# 增量更新（仅处理已更改的文件）
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/ingest.py /path/to/your/markdown-repo

# 完全重建
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/ingest.py /path/to/your/markdown-repo --force --verbose

搜索操作

语义搜索（默认）

查找与查询语义相关的内容：

# 基本搜索（返回前 10 个片段）
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb search "查询文本"

# 更多结果
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb search "查询文本" -k 20

# 按领域筛选
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb search "查询文本" --area finance

# JSON 输出（用于编程）
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb -f json search "查询文本" -k 5

🇺🇸English

Repo Search & Summarisation

Semantic search across a directory of documents using ChromaDB vector embeddings. Supports markdown , PDF , DOCX , and XLSX files. Retrieves relevant chunks without loading entire files into context. Designed for use with a "second brain" or personal knowledge base, but works with any collection of documents.

Prerequisites

Python virtual environment set up (run setup.sh if not done)
Index built (run ingest if no .vectordb/ directory exists)

First-Time Setup

# Set up Python environment (one-time)
~/.claude/skills/repo-search/setup.sh

# Build the index (run from brain repo root)
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/ingest.py /path/to/your/markdown-repo --verbose

Rebuild Index (after adding/changing files)

# Incremental update (only changed files)
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/ingest.py /path/to/your/markdown-repo

# Full rebuild
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/ingest.py /path/to/your/markdown-repo --force --verbose

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

混合搜索（向量 + 关键词）

结合语义相似度和 BM25 关键词匹配，以提高精确度，特别是在处理确切术语、名称或缩写时：

# 混合搜索（推荐用于大多数查询）
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb search "查询文本" --mode hybrid

# 仅关键词搜索（BM25）
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb search "确切短语" --mode keyword

搜索模式：semantic（默认）、hybrid（通过 Reciprocal Rank Fusion 实现向量 + BM25）、keyword（仅 BM25）。

检索某个领域的所有片段（适用于摘要生成）：

~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb area finance
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb area health -k 100

获取特定文件的所有片段：

~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb file "areas/finance/index.md"

检索指定日期范围内的片段（用于时间线）：

~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb date-range 2025-01-01 2025-12-31

# 统计信息
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb stats

# 列出所有已索引的文件
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb list

# 清理孤立片段（针对已从磁盘删除的文件）
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb prune /path/to/your/markdown-repo

使用 --collection 管理不同语料库的独立索引：

# 将文档摄取到命名集合中
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/ingest.py /path/to/work-docs --collection work

# 搜索命名集合
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/work-docs/.vectordb --collection work search "查询"

默认集合名称为 brain。

摘要生成工作流

适用于大型聚合任务（时间线、领域摘要、交叉分析）：

检索：使用搜索或领域/日期范围查询（配合 JSON 输出）检索相关片段
分批：将片段按文件、日期或主题分组为可管理的批次
摘要：使用 Claude 对每个批次进行摘要
综合：将批次摘要综合成最终输出

"总结我的财务状况"的示例工作流：

# 步骤 1：以 JSON 格式获取所有财务相关片段
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb -f json area finance -k 100

# 步骤 2：读取 JSON 输出并使用 Claude 进行综合
# （Claude 在读取片段后会自然地执行此步骤）

大脑按以下领域组织：

areas → 业务、技术、健康、人际关系、财务、哲学、心理、职业、收入
projects → 活跃项目
decisions → 决策日志
resources → 参考资料
reviews → 每日/每周/每月反思
outputs → 已完成的内容
docs → 计划和设计文档

分块与嵌入详情

Markdown： 基于标题的分块（尊重 #、##、### 边界）。每个片段都丰富了其标题链（例如 [标题 > 章节 > 子章节]）和文档标题，以提供更好的嵌入上下文。
PDF： 默认按 1000 字符进行页面感知分块。
DOCX： 默认按 1500 字符进行段落感知分块。
XLSX： 默认按 2000 字符进行行组分块，并保留工作表名称。
嵌入模型： all-MiniLM-L6-v2（ChromaDB 默认）。模型名称存储在集合元数据中。
BM25 索引： 在摄取过程中自动构建，以支持混合搜索。

"数据库未找到"：首先运行摄取脚本
"无结果"：尝试更广泛的查询词、移除领域筛选器、增加 -k 值，或尝试 --mode hybrid
结果过时：重新运行摄取以获取文件更改（增量更新，速度快）
孤立片段：使用 prune 命令删除已删除文件的片段
首次查询缓慢：ChromaDB 在首次使用时加载嵌入模型（约 10-20 秒），后续查询速度很快
"提取失败"：文件可能已损坏或受密码保护；请查看 stderr 获取详细信息

Semantic Search (default)

Find content semantically related to a query:

# Basic search (returns top 10 chunks)
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb search "query text here"

# More results
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb search "query text here" -k 20

# Filter by area
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb search "query text" --area finance

# JSON output (for programmatic use)
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb -f json search "query text" -k 5

Hybrid Search (vector + keyword)

Combines semantic similarity with BM25 keyword matching for better precision, especially with exact terms, names, or acronyms:

# Hybrid search (recommended for most queries)
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb search "query text" --mode hybrid

# Keyword-only search (BM25)
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb search "exact phrase" --mode keyword

Search modes: semantic (default), hybrid (vector + BM25 via Reciprocal Rank Fusion), keyword (BM25 only).

Retrieve all chunks for an area (useful for summarisation):

~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb area finance
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb area health -k 100

Get all chunks for a specific file:

~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb file "areas/finance/index.md"

Retrieve chunks within a date range (for timelines):

~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb date-range 2025-01-01 2025-12-31

# Statistics
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb stats

# List all indexed files
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb list

# Prune orphaned chunks (for files deleted from disk)
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb prune /path/to/your/markdown-repo

Use --collection to manage separate indexes for different corpora:

# Ingest into a named collection
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/ingest.py /path/to/work-docs --collection work

# Search a named collection
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/work-docs/.vectordb --collection work search "query"

Default collection name is brain.

Summarisation Workflow

For large aggregation tasks (timelines, domain summaries, cross-cutting analysis):

Retrieve relevant chunks using search or area/date-range queries with JSON output
Batch chunks into manageable groups (by file, date, or topic)
Summarise each batch using Claude
Synthesise batch summaries into final output

Example workflow for "summarise my financial position":

# Step 1: Get all finance chunks as JSON
~/.claude/skills/repo-search/.venv/bin/python ~/.claude/skills/repo-search/query.py --db-path /path/to/your/markdown-repo/.vectordb -f json area finance -k 100

# Step 2: Read the JSON output and synthesise with Claude
# (Claude does this step naturally after reading the chunks)

The brain is organised into these areas:

areas → business, technical, health, relationships, finance, philosophy, mental, career, income
projects → Active initiatives
decisions → Decision logs
resources → Reference material
reviews → Daily/weekly/monthly reflections
outputs → Finished content
docs → Plans and design documents

Chunking & Embedding Details

Markdown: Heading-aware chunking (respects #, ##, ### boundaries). Each chunk is enriched with its heading chain (e.g. [Title > Section > Subsection]) and document title for better embedding context.
PDF: Page-aware chunking at 1000 chars default.
DOCX: Paragraph-aware chunking at 1500 chars default.
XLSX: Row-group chunking at 2000 chars default with sheet names preserved.
Embedding model: all-MiniLM-L6-v2 (ChromaDB default). Model name is stored in collection metadata.
BM25 index: Built automatically during ingestion for hybrid search support.

"Database not found" : Run the ingest script first
"No results" : Try broader query terms, remove area filter, increase -k, or try --mode hybrid
Stale results : Re-run ingest to pick up file changes (incremental, fast)
Orphaned chunks : Use prune command to remove chunks for deleted files
Slow first query : ChromaDB loads the embedding model on first use (~10-20s), subsequent queries are fast
"Failed to extract" : The file may be corrupted or password-protected; check stderr for details

通过 LiteLLM 代理让 Claude Code 对接 GitHub Copilot 运行 | 高级变通方案指南

36,300 周安装

ChromaDB 语义搜索工具 - 支持 Markdown/PDF/DOCX/XLSX 文件的知识库搜索与摘要生成

🇨🇳中文介绍

仓库搜索与摘要生成

先决条件

首次设置

重建索引（添加/更改文件后）

搜索操作

语义搜索（默认）

🇺🇸English

Repo Search & Summarisation

Prerequisites

First-Time Setup

Rebuild Index (after adding/changing files)

相关 Skills

混合搜索（向量 + 关键词）

按领域浏览

按文件浏览

日期范围查询

数据库信息

命名集合

摘要生成工作流

可用领域

分块与嵌入详情

错误处理

Search Operations

Semantic Search (default)

Hybrid Search (vector + keyword)

Browse by Area

Browse by File

Date Range Query

Database Info

Named Collections

Summarisation Workflow

Available Areas

Chunking & Embedding Details

Error Handling

最新 Skills