npx skills add https://github.com/hamelsmu/evals-skills --skill evaluate-rag在选择指标之前,先完成对 RAG 流水线追踪的错误分析。检查检索到的内容与模型所需内容。确定问题是出在检索、生成,还是两者皆有。首先修复检索问题。
独立衡量每个组件。为每个检索阶段使用适当的指标:
你需要查询与真实相关文档片段的配对。
人工整理(质量最高): 编写真实的问题,并将每个问题映射到包含答案的确切片段。
合成问答生成(可扩展): 对于每个文档片段,提示 LLM 从中提取一个事实,并生成一个仅由该事实即可回答的问题。
合成问答提示模板:
Given a chunk of text, extract a specific, self-contained fact from it.
Then write a question that is directly and unambiguously answered
by that fact alone.
Return output in JSON format:
{ "fact": "...", "question": "..." }
Chunk: "{text_chunk}"
对抗性问题生成: 创建更难的查询,这些查询类似于多个片段中的内容,但仅由其中一个片段回答。
流程:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
示例:
只有片段 A 包含答案。片段 B 是一个合理的干扰项。
过滤合成问题: 使用少量示例的 LLM 评分来评估合成查询的真实性。只保留那些被评为真实(在1-5分制中得分为4-5分)的问题。此处适合使用李克特量表评分,因为目标是为数据集整理进行模糊排序,而不是衡量失败率。
Recall@k: 在前 k 个结果中找到的相关文档的比例。
Recall@k = (relevant docs in top k) / (total relevant docs for query)
对于第一轮检索,优先考虑召回率。LLM 可以忽略不相关内容,但无法根据缺失的内容生成。
Precision@k: 前 k 个结果中相关文档的比例。
Precision@k = (relevant docs in top k) / k
用于重排序评估。
平均倒数排名: 第一个相关文档出现的位置有多靠前。
MRR = (1/N) * sum(1/rank_of_first_relevant_doc)
最适合只需要一个关键片段的单事实查找。
归一化折损累计增益: 适用于文档具有不同效用的分级相关性。奖励将更相关的项目排在更高位置。
DCG@k = sum over i=1..k of: rel_i / log2(i+1)
IDCG@k = DCG@k with documents sorted by decreasing relevance
NDCG@k = DCG@k / IDCG@k
注意事项:弱相关文档的最优排序可能比排名较低的高度相关文档得分更高。需辅以 Recall@k。
选择 k 值: k 值因查询类型而异。事实查找使用 k=1-2。综合查询("总结市场趋势")使用 k=5-10。
| 查询类型 | 主要指标 |
|---|---|
| 单事实查找 | MRR |
| 需要广泛覆盖 | Recall@k |
| 排序质量很重要 | NDCG@k 或 Precision@k |
| 多跳推理 | Two-hop Recall@k |
将分块视为可调超参数。即使使用相同的检索器,仅基于分块,指标也会变化。
固定大小分块的网格搜索: 测试分块大小和重叠的组合。为每种配置重新索引语料库。在评估数据集上测量检索指标。
示例搜索网格:
| 分块大小 | 重叠 | Recall@5 | NDCG@5 |
|---|---|---|---|
| 128 词元 | 0 | 0.82 | 0.69 |
| 128 词元 | 64 | 0.88 | 0.75 |
| 256 词元 | 0 | 0.86 | 0.74 |
| 256 词元 | 128 | 0.89 | 0.77 |
| 512 词元 | 0 | 0.80 | 0.72 |
| 512 词元 | 256 | 0.83 | 0.74 |
内容感知分块: 当固定大小的分块分割了相关信息时:
在确认检索有效后,从两个维度评估 LLM 如何处理检索到的上下文:
答案忠实性: 输出是否准确反映了检索到的上下文?检查是否存在:
答案相关性: 输出是否解决了原始查询?答案可能忠实于上下文,但未能回答用户所问。
使用错误分析来发现流水线中的具体表现。识别哪种信息被幻觉化,以及哪些约束被遗漏。
| 上下文相关性 | 忠实性 | 答案相关性 | 诊断 |
|---|---|---|---|
| 高 | 高 | 低 | 生成器关注了正确文档的错误部分 |
| 高 | 低 | -- | 对检索内容的幻觉或误解 |
| 低 | -- | -- | 检索问题。修复分块、嵌入或查询预处理 |
对于需要来自多个片段信息的查询:
双跳 Recall@k: 两个真实片段都出现在前 k 个结果中的双跳查询的比例。
TwoHopRecall@k = (1/N) * sum(1 if {Chunk1, Chunk2} ⊆ top_k_results)
通过分类诊断失败:第一跳缺失、第二跳缺失,或排名超出前 k。
每周安装数
138
仓库
GitHub 星标数
955
首次出现
2026年3月3日
安全审计
安装于
codex136
gemini-cli135
kimi-cli135
cursor135
opencode135
amp135
Complete error analysis on RAG pipeline traces before selecting metrics. Inspect what was retrieved vs. what the model needed. Determine whether the problem is retrieval, generation, or both. Fix retrieval first.
Measure each component independently. Use the appropriate metric for each retrieval stage:
You need queries paired with ground-truth relevant document chunks.
Manual curation (highest quality): Write realistic questions and map each to the exact chunk(s) containing the answer.
Synthetic QA generation (scalable): For each document chunk, prompt an LLM to extract a fact and generate a question answerable only from that fact.
Synthetic QA prompt template:
Given a chunk of text, extract a specific, self-contained fact from it.
Then write a question that is directly and unambiguously answered
by that fact alone.
Return output in JSON format:
{ "fact": "...", "question": "..." }
Chunk: "{text_chunk}"
Adversarial question generation: Create harder queries that resemble content in multiple chunks but are only answered by one.
Process:
Example:
Only chunk A contains the answer. Chunk B is a plausible distractor.
Filtering synthetic questions: Rate synthetic queries for realism using few-shot LLM scoring. Keep only those rated realistic (4-5 on a 1-5 scale). Likert scoring is appropriate here, since the goal is fuzzy ranking for dataset curation, not measuring failure rates.
Recall@k: Fraction of relevant documents found in the top k results.
Recall@k = (relevant docs in top k) / (total relevant docs for query)
Prioritize recall for first-pass retrieval. LLMs can ignore irrelevant content but cannot generate from missing content.
Precision@k: Fraction of top k results that are relevant.
Precision@k = (relevant docs in top k) / k
Use for reranking evaluation.
Mean Reciprocal Rank (MRR): How early the first relevant document appears.
MRR = (1/N) * sum(1/rank_of_first_relevant_doc)
Best for single-fact lookups where only one key chunk is needed.
NDCG@k (Normalized Discounted Cumulative Gain): For graded relevance where documents have varying utility. Rewards placing more relevant items higher.
DCG@k = sum over i=1..k of: rel_i / log2(i+1)
IDCG@k = DCG@k with documents sorted by decreasing relevance
NDCG@k = DCG@k / IDCG@k
Caveat: Optimal ranking of weakly relevant documents can outscore a highly relevant document ranked lower. Supplement with Recall@k.
Choosing k: k varies by query type. A factual lookup uses k=1-2. A synthesis query ("summarize market trends") uses k=5-10.
| Query Type | Primary Metric |
|---|---|
| Single-fact lookups | MRR |
| Broad coverage needed | Recall@k |
| Ranked quality matters | NDCG@k or Precision@k |
| Multi-hop reasoning | Two-hop Recall@k |
Treat chunking as a tunable hyperparameter. Even with the same retriever, metrics vary based on chunking alone.
Grid search for fixed-size chunking: Test combinations of chunk size and overlap. Re-index the corpus for each configuration. Measure retrieval metrics on your evaluation dataset.
Example search grid:
| Chunk size | Overlap | Recall@5 | NDCG@5 |
|---|---|---|---|
| 128 tokens | 0 | 0.82 | 0.69 |
| 128 tokens | 64 | 0.88 | 0.75 |
| 256 tokens | 0 | 0.86 | 0.74 |
| 256 tokens | 128 | 0.89 | 0.77 |
| 512 tokens | 0 | 0.80 | 0.72 |
| 512 tokens | 256 | 0.83 | 0.74 |
Content-aware chunking: When fixed-size chunks split related information:
After confirming retrieval works, evaluate what the LLM does with the retrieved context along two dimensions:
Answer faithfulness: Does the output accurately reflect the retrieved context? Check for:
Answer relevance: Does the output address the original query? An answer can be faithful to the context but fail to answer what the user asked.
Use error analysis to discover specific manifestations in your pipeline. Identify what kind of information gets hallucinated and which constraints get omitted.
| Context Relevance | Faithfulness | Answer Relevance | Diagnosis |
|---|---|---|---|
| High | High | Low | Generator attended to wrong section of a correct document |
| High | Low | -- | Hallucination or misinterpretation of retrieved content |
| Low | -- | -- | Retrieval problem. Fix chunking, embeddings, or query preprocessing |
For queries requiring information from multiple chunks:
Two-hop Recall@k: Fraction of 2-hop queries where both ground-truth chunks appear in the top k results.
TwoHopRecall@k = (1/N) * sum(1 if {Chunk1, Chunk2} ⊆ top_k_results)
Diagnose failures by classifying: hop 1 miss, hop 2 miss, or rank-out-of-top-k.
Weekly Installs
138
Repository
GitHub Stars
955
First Seen
Mar 3, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykFail
Installed on
codex136
gemini-cli135
kimi-cli135
cursor135
opencode135
amp135
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
47,700 周安装