向量搜索设计师：构建高性能向量相似性搜索系统的完整指南与最佳实践

Vector Search Designer by jmsktm/claude-settings

2 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/jmsktm/claude-settings --skill 'Vector Search Designer'

AI/机器学习数据库系统架构

🇨🇳中文介绍

向量搜索设计师

Vector Search Designer 技能可帮助您架构和实现向量相似性搜索系统，为语义搜索、推荐引擎和 AI 应用程序提供支持。它指导您选择合适的向量数据库、设计索引结构、优化查询性能，并扩展到数百万甚至数十亿个向量。

向量搜索已成为现代 AI 系统的基础，从 RAG 管道到产品推荐。此技能涵盖全栈内容：理解近似最近邻 (ANN) 算法、在数据库选项之间进行选择、调整召回率与延迟的权衡，以及实现生产就绪的搜索基础设施。

无论您是基于 Pinecone、Weaviate、Qdrant、pgvector 构建，还是实现自己的解决方案，此技能都能确保您的向量搜索系统满足性能和准确性要求。

核心工作流程

工作流程 1：选择向量数据库

收集需求：
- 规模：有多少向量？
- 查询模式：单次查询与批量查询，是否需要过滤器？
- 延迟要求：实时与批处理？
- 更新频率：静态与动态？
- 基础设施：托管与自托管？
比较选项：数据库 | 规模 | 托管 | 功能 | 最适合
---|---|---|---|---
Pinecone | 十亿+ | 是 | 混合搜索，命名空间 | 生产环境，零运维
Weaviate | 1亿+ | 两者皆可 | GraphQL，模块 | 多模态，复杂查询
Qdrant | 1亿+ | 两者皆可 | Rust 性能，过滤 | 高性能，自托管
Milvus | 十亿+ | 两者皆可 | GPU 支持，聚类 | 大规模，机器学习团队
pgvector | 1000万 | 否 | PostgreSQL 原生 | 现有 Postgres，小规模
Chroma | 100万 | 否 | 简单 API | 原型设计，嵌入式
使用您的数据评估
记录决策理由

工作流程 2：设计索引架构

选择 ANN 算法：
- HNSW：最佳的召回率/速度权衡，内存密集
- IVF：非常适合非常大的数据集，需要训练
- PQ：用于大规模压缩，会损失一些准确性
- Flat：精确搜索，仅适用于小数据集
配置索引参数：

HNSW 示例配置

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

工作流程 3：优化搜索性能

基准测试 基线性能：
- 每秒查询数 (QPS)
- 延迟 (p50, p95, p99)
- Recall@k 准确率
识别瓶颈：
- 索引加载时间
- 搜索延迟
- 过滤器计算
- 网络开销
应用优化：
- 调整索引参数 (ef, nprobe)
- 实现查询缓存
- 优化过滤器表达式
- 考虑量化以减少内存占用
验证召回率与延迟的权衡
记录最优配置

操作	命令/触发条件
选择数据库	"Which vector database for [use case]"
设计索引	"Design vector index for [scale]"
优化搜索	"Speed up vector search"
添加过滤	"Add metadata filters to vector search"
扩展向量	"Scale to [N] million vectors"
基准测试搜索	"Benchmark vector search performance"

选择合适规模的数据库：不要为不需要的规模过度设计
- <100 万向量：pgvector 或 Chroma 通常足够
- 100-1亿向量：Qdrant、Weaviate 或 Pinecone
- 1亿+ 向量：Milvus 或经过精心设计的 Pinecone
理解召回率与速度的权衡：ANN 本质上是近似的
- 更高的召回率 = 更慢的查询
- 根据您的准确性要求进行调整
- 测量实际召回率，不要假设
需要时使用混合搜索：结合向量搜索和关键词搜索
- 向量：语义相似性
- 关键词 (BM25)：精确术语、名称、代码
- 通常能改善混合查询的结果
为过滤设计元数据：预先规划过滤策略
- 为频繁使用的过滤器建立索引字段
- 避免在高基数字段上进行过滤
- 预过滤与后过滤的权衡
尽可能批量操作：减少网络开销
- 批量插入以进行数据摄取
- 在延迟允许的情况下进行批量查询
- 使用异步操作
监控和告警：生产环境搜索需要可观测性
- 查询延迟百分位数
- 索引大小和内存使用情况
- 随时间推移的召回率下降情况

处理具有多种表示形式的文档：

class MultiVectorIndex:
    def __init__(self):
        self.title_index = VectorIndex(dim=768)
        self.content_index = VectorIndex(dim=768)
        self.summary_index = VectorIndex(dim=768)

    def search(self, query_embedding, weights=None):
        weights = weights or {"title": 0.3, "content": 0.5, "summary": 0.2}

        results = {}
        for field, weight in weights.items():
            index = getattr(self, f"{field}_index")
            field_results = index.search(query_embedding, k=20)
            for doc_id, score in field_results:
                results[doc_id] = results.get(doc_id, 0) + score * weight

        return sorted(results.items(), key=lambda x: x[1], reverse=True)[:10]

过滤向量搜索策略

使用过滤器优化搜索：

def filtered_search(query_embedding, filters, k=10):
    # 策略 1：预过滤（针对选择性过滤器）
    if estimate_selectivity(filters) < 0.1:
        candidate_ids = apply_filters(filters)
        return vector_search_subset(query_embedding, candidate_ids, k)

    # 策略 2：后过滤（针对非选择性过滤器）
    elif estimate_selectivity(filters) > 0.5:
        results = vector_search(query_embedding, k * 3)
        filtered = [r for r in results if matches_filters(r, filters)]
        return filtered[:k]

    # 策略 3：混合（通用情况）
    else:
        return vector_search_with_filters(query_embedding, filters, k)

用于扩展的量化

以可接受的准确性损失减少内存占用：

# 乘积量化配置
pq_config = {
    "nbits": 8,  # 每个子量化器的位数
    "m": 16,  # 子量化器数量
    # 768 维 * 4 字节 = 3KB/向量 -> 16 * 1 字节 = 16 字节/向量
}

# 二值化量化（极端压缩）
binary_config = {
    "threshold": 0,  # 值 > 0 -> 1，否则 -> 0
    # 768 维 * 4 字节 = 3KB/向量 -> 768 位 = 96 字节/向量
}

高效处理动态数据：

class DynamicVectorIndex:
    def __init__(self, rebuild_threshold=10000):
        self.main_index = build_optimized_index()
        self.delta_index = []  # 最近添加的数据
        self.rebuild_threshold = rebuild_threshold

    def add(self, vector, metadata):
        self.delta_index.append((vector, metadata))
        if len(self.delta_index) >= self.rebuild_threshold:
            self.rebuild()

    def search(self, query, k):
        main_results = self.main_index.search(query, k)
        delta_results = brute_force_search(self.delta_index, query, k)
        return merge_results(main_results, delta_results, k)

    def rebuild(self):
        all_data = self.main_index.get_all() + self.delta_index
        self.main_index = build_optimized_index(all_data)
        self.delta_index = []

应避免的常见陷阱

在大规模场景下使用精确 (flat) 搜索而不是 ANN
不测量实际召回率，假设 ANN "足够好"
过度索引元数据字段，导致更新变慢
忽略 HNSW 索引的内存需求
未规划索引重建和维护窗口
假设向量数据库能处理所有用例（有时 Elasticsearch 更好）
忘记为余弦相似度归一化向量
在同一索引中混合来自不同模型的嵌入向量

🇺🇸English

Vector Search Designer

The Vector Search Designer skill helps you architect and implement vector similarity search systems that power semantic search, recommendation engines, and AI applications. It guides you through selecting the right vector database, designing index structures, optimizing query performance, and scaling to millions or billions of vectors.

Vector search has become foundational to modern AI systems, from RAG pipelines to product recommendations. This skill covers the full stack: understanding approximate nearest neighbor (ANN) algorithms, choosing between database options, tuning recall vs latency tradeoffs, and implementing production-ready search infrastructure.

Whether you are building on Pinecone, Weaviate, Qdrant, pgvector, or implementing your own solution, this skill ensures your vector search system meets your performance and accuracy requirements.

Core Workflows

Workflow 1: Select Vector Database

Gather requirements:
- Scale: How many vectors?
- Query patterns: Single vs batch, filters needed?
- Latency requirements: Real-time vs batch?
- Update frequency: Static vs dynamic?
- Infrastructure: Managed vs self-hosted?
Compare options: Database | Scale | Managed | Features | Best For
---|---|---|---|---
Pinecone | Billion+ | Yes | Hybrid, namespaces | Production, zero-ops
Weaviate | 100M+ | Both | GraphQL, modules | Multi-modal, complex queries
Qdrant | 100M+ | Both | Rust perf, filtering | High performance, self-hosted
Milvus | Billion+ | Both | GPU support, clustering | Large scale, ML teams
pgvector | 10M | No | PostgreSQL native | Existing Postgres, small scale
Chroma | 1M | No | Simple API | Prototyping, embedded
Evaluate with your data
Document decision rationale

Workflow 2: Design Index Architecture

Choose ANN algorithm:
- HNSW: Best recall/speed tradeoff, memory intensive
- IVF: Good for very large datasets, requires training
- PQ: Compression for scale, some accuracy loss
- Flat: Exact search, small datasets only
Configure index parameters:

HNSW example configuration

hnsw_config = {

    "M": 16,  # Connections per node (higher = better recall, more memory)
    "efConstruction": 200,  # Build-time search depth
    "efSearch": 100,  # Query-time search depth
}

# IVF example configuration
ivf_config = {
    "nlist": 1024,  # Number of clusters
    "nprobe": 32,  # Clusters to search at query time
}

3. Plan sharding strategy for scale 4. Design metadata schema for filtering

Workflow 3: Optimize Search Performance

Benchmark baseline performance:
- Queries per second (QPS)
- Latency (p50, p95, p99)
- Recall@k accuracy
Identify bottlenecks:
- Index loading time
- Search latency
- Filter computation
- Network overhead
Apply optimizations:
- Tune index parameters (ef, nprobe)
- Implement query caching
- Optimize filter expressions
- Consider quantization for memory
Validate recall vs latency tradeoff
Document optimal configuration

Quick Reference

Action	Command/Trigger
Choose database	"Which vector database for [use case]"
Design index	"Design vector index for [scale]"
Optimize search	"Speed up vector search"
Add filtering	"Add metadata filters to vector search"
Scale vectors	"Scale to [N] million vectors"
Benchmark search	"Benchmark vector search performance"

Best Practices

Right-Size Your Database : Don't over-engineer for scale you don't need
- <1M vectors: pgvector or Chroma often sufficient
- 1-100M vectors: Qdrant, Weaviate, or Pinecone
- 100M+ vectors: Milvus or Pinecone with careful design
Understand Recall vs Speed Tradeoff : ANN is approximate by design
- Higher recall = slower queries
- Tune based on your accuracy requirements
- Measure actual recall, don't assume
Use Hybrid Search When Needed : Combine vector and keyword search
- Vector: semantic similarity
- Keyword (BM25): exact terms, names, codes
- Typically improves results for mixed queries
Design Metadata for Filtering : Plan your filter strategy upfront
- Indexed fields for frequent filters
- Avoid filtering on high-cardinality fields
- Pre-filter vs post-filter tradeoffs
Batch Operations When Possible : Reduce network overhead
- Batch upserts for ingestion
- Batch queries when latency allows
- Use async operations
Monitor and Alert : Production search needs observability
- Query latency percentiles
- Index size and memory usage
- Recall degradation over time

Advanced Techniques

Multi-Vector Search

Handle documents with multiple representations:

class MultiVectorIndex:
    def __init__(self):
        self.title_index = VectorIndex(dim=768)
        self.content_index = VectorIndex(dim=768)
        self.summary_index = VectorIndex(dim=768)

    def search(self, query_embedding, weights=None):
        weights = weights or {"title": 0.3, "content": 0.5, "summary": 0.2}

        results = {}
        for field, weight in weights.items():
            index = getattr(self, f"{field}_index")
            field_results = index.search(query_embedding, k=20)
            for doc_id, score in field_results:
                results[doc_id] = results.get(doc_id, 0) + score * weight

        return sorted(results.items(), key=lambda x: x[1], reverse=True)[:10]

Filtered Vector Search Strategies

Optimize search with filters:

def filtered_search(query_embedding, filters, k=10):
    # Strategy 1: Pre-filter (for selective filters)
    if estimate_selectivity(filters) < 0.1:
        candidate_ids = apply_filters(filters)
        return vector_search_subset(query_embedding, candidate_ids, k)

    # Strategy 2: Post-filter (for non-selective filters)
    elif estimate_selectivity(filters) > 0.5:
        results = vector_search(query_embedding, k * 3)
        filtered = [r for r in results if matches_filters(r, filters)]
        return filtered[:k]

    # Strategy 3: Hybrid (general case)
    else:
        return vector_search_with_filters(query_embedding, filters, k)

Quantization for Scale

Reduce memory with acceptable accuracy loss:

# Product Quantization configuration
pq_config = {
    "nbits": 8,  # Bits per sub-quantizer
    "m": 16,  # Number of sub-quantizers
    # 768-dim * 4 bytes = 3KB/vector -> 16 * 1 byte = 16 bytes/vector
}

# Binary quantization (extreme compression)
binary_config = {
    "threshold": 0,  # Values > 0 -> 1, else -> 0
    # 768-dim * 4 bytes = 3KB/vector -> 768 bits = 96 bytes/vector
}

Incremental Index Updates

Handle dynamic data efficiently:

class DynamicVectorIndex:
    def __init__(self, rebuild_threshold=10000):
        self.main_index = build_optimized_index()
        self.delta_index = []  # Recent additions
        self.rebuild_threshold = rebuild_threshold

    def add(self, vector, metadata):
        self.delta_index.append((vector, metadata))
        if len(self.delta_index) >= self.rebuild_threshold:
            self.rebuild()

    def search(self, query, k):
        main_results = self.main_index.search(query, k)
        delta_results = brute_force_search(self.delta_index, query, k)
        return merge_results(main_results, delta_results, k)

    def rebuild(self):
        all_data = self.main_index.get_all() + self.delta_index
        self.main_index = build_optimized_index(all_data)
        self.delta_index = []

Common Pitfalls to Avoid

Using exact (flat) search at scale instead of ANN
Not measuring actual recall, assuming ANN is "good enough"
Over-indexing metadata fields, slowing down updates
Ignoring the memory requirements of HNSW indexes
Not planning for index rebuilds and maintenance windows
Assuming vector databases handle all use cases (sometimes Elasticsearch is better)
Forgetting to normalize vectors for cosine similarity
Mixing embeddings from different models in the same index

Weekly Installs

–

Repository

jmsktm/claude-settings

GitHub Stars

First Seen

–

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

45,100 周安装