The Agent Skills Directory
npx skills add https://smithery.ai/skills/kthorn/building-paper-screening-rubrics核心原则: 通过头脑风暴 → 测试 → 优化 → 自动化 → 审查 → 迭代,协作构建筛选标准。
好的标准来自于预先理解边缘情况,并在批量筛选前对真实论文进行测试。
在以下情况使用此技能:
不应使用的情况:
提出与领域无关的问题,以理解是什么使论文具有相关性:
核心概念:
数据类型与成果:
论文类型:
关系与背景:
边缘情况:
将回答记录在 screening-criteria.json 中
基于头脑风暴,提出评分逻辑:
评分 (0-10):
关键词匹配 (0-3 分):
- 核心术语 1: +1 分
- 核心术语 2 或同义词: +1 分
- 相关术语: +1 分
数据类型匹配 (0-4 分):
- 测量类型 (IC50, Ki, EC50 等): +2 分
- 数据集/代码可用: +1 分
- 方法描述: +1 分
特异性 (0-3 分):
- 原创性研究: +3 分
- 方法论文: +2 分
- 综述: +1 分
特殊规则:
- 如果提及排除术语: 分数 = 0
阈值: ≥7 = 相关, 5-6 = 可能相关, <5 = 不相关
呈现给用户并询问: “此逻辑是否符合您的预期?”
将初始标准保存到 screening-criteria.json:
{
"version": "1.0.0",
"created": "2025-10-11T15:30:00Z",
"keywords": {
"core_terms": ["term1", "term2"],
"synonyms": {"term1": ["alt1", "alt2"]},
"related_terms": ["related1", "related2"],
"exclusion_terms": ["exclude1", "exclude2"]
},
"data_types": {
"measurements": ["IC50", "Ki", "MIC"],
"datasets": ["GEO:", "SRA:", "PDB:"],
"methods": ["protocol", "synthesis", "assay"]
},
"scoring": {
"keywords_max": 3,
"data_type_max": 4,
"specificity_max": 3,
"relevance_threshold": 7
},
"special_rules": [
{
"name": "scaffold_analogs",
"condition": "mentions target scaffold AND (analog OR derivative)",
"action": "add 3 points"
}
]
}
进行快速 PubMed 搜索以获取候选论文:
# 使用初始关键词搜索 20 篇论文
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=YOUR_QUERY&retmax=20&retmode=json"
获取前 10-15 篇论文的摘要:
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=PMID1,PMID2,...&retmode=xml&rettype=abstract"
向用户逐一展示摘要:
论文 1/10:
标题: [标题]
PMID: [12345678]
DOI: [10.1234/example]
摘要:
[完整摘要文本]
这篇论文与您的研究问题相关吗? (y/n/可能)
将用户判断记录在 test-set.json 中:
{
"test_papers": [
{
"pmid": "12345678",
"doi": "10.1234/example",
"title": "Paper title",
"abstract": "Full abstract text...",
"user_judgment": "relevant",
"timestamp": "2025-10-11T15:45:00Z"
}
]
}
继续直到获得 5-10 篇有明确判断的论文
将标准应用于每篇测试论文:
for paper in test_papers:
score = calculate_score(paper['abstract'], rubric)
predicted_status = "relevant" if score >= 7 else "not_relevant"
paper['predicted_score'] = score
paper['predicted_status'] = predicted_status
计算准确率:
correct = sum(1 for p in test_papers
if p['predicted_status'] == p['user_judgment'])
accuracy = correct / len(test_papers)
呈现分类报告:
标准测试结果 (5 篇论文):
✓ PMID 12345678: 分数 9 → 相关 (用户: 相关) ✓
✗ PMID 23456789: 分数 4 → 不相关 (用户: 相关) ← 假阴性
✓ PMID 34567890: 分数 8 → 相关 (用户: 相关) ✓
✓ PMID 45678901: 分数 3 → 不相关 (用户: 不相关) ✓
✗ PMID 56789012: 分数 7 → 相关 (用户: 不相关) ← 假阳性
准确率: 60% (3/5 正确)
目标: ≥80%
--- 假阴性: PMID 23456789 ---
标题: "化合物 X 的新型类似物具有改进的效力"
分数细分:
- 关键词: 1 分 (匹配 "化合物 X")
- 数据类型: 2 分 (提及 IC50 值)
- 特异性: 1 分 (原创性研究)
- 总计: 4 分 → 不相关
遗漏原因: 论文讨论了 "类似物" 但未触发 scaffold_analogs 规则
摘要摘录: "我们合成了 12 种化合物 X 的类似物..."
--- 假阳性: PMID 56789012 ---
标题: "激酶抑制剂综述"
分数细分:
- 关键词: 2 分
- 数据类型: 3 分
- 特异性: 2 分 (综述,非原创)
- 总计: 7 分 → 相关
错误原因: 综述论文,用户仅需要原创性研究
请求用户进行调整:
当前准确率: 60% (低于 80% 阈值)
改进标准的建议:
1. 加强 scaffold_analogs 规则 - "合成了 N 种类似物" 是否应始终触发?
2. 降低综述论文的分数 (目前 2 分,也许改为 0 分?)
3. 为核心概念添加更多同义词术语?
您想调整什么?
根据反馈更新 screening-criteria.json
更新示例:
{
"special_rules": [
{
"name": "scaffold_analogs",
"condition": "mentions target scaffold AND (analog OR derivative OR synthesized)",
"action": "add 3 points"
}
],
"paper_types": {
"primary_research": 3,
"methods": 2,
"review": 0 // 从 1 更改而来
}
}
使用更新后的标准重新为测试论文评分
展示新结果:
更新后的标准测试结果 (5 篇论文):
✓ PMID 12345678: 分数 9 → 相关 (用户: 相关) ✓
✓ PMID 23456789: 分数 7 → 相关 (用户: 相关) ✓ (已修复!)
✓ PMID 34567890: 分数 8 → 相关 (用户: 相关) ✓
✓ PMID 45678901: 分数 3 → 不相关 (用户: 不相关) ✓
✓ PMID 56789012: 分数 5 → 不相关 (用户: 不相关) ✓ (已修复!)
准确率: 100% (5/5 正确) ✓
目标: ≥80% ✓
标准已准备好进行批量筛选!
如果准确率 ≥80%: 进行批量筛选 如果 <80%: 继续迭代
一旦标准在测试集上得到验证:
{
"10.1234/example": {
"pmid": "12345678",
"title": "Paper title",
"abstract": "Full abstract text...",
"fetched": "2025-10-11T16:00:00Z"
}
}
{
"10.1234/example": {
"pmid": "12345678",
"status": "relevant",
"score": 9,
"source": "pubmed_search",
"timestamp": "2025-10-11T16:00:00Z",
"rubric_version": "1.0.0"
}
}
使用已验证的标准筛选了 127 篇论文:
- 高度相关 (≥8): 12 篇论文
- 相关 (7): 18 篇论文
- 可能相关 (5-6): 23 篇论文
- 不相关 (<5): 74 篇论文
所有摘要已缓存以供重新筛选。
结果已保存至 papers-reviewed.json。
请离线审查,如果发现任何错误分类请提供反馈。
用户离线审查论文,识别问题:
用户: "我审查了结果。有三篇论文被错误分类:
- PMID 23456789 得分为 4 但实际上相关 (讨论了支架类似物)
- PMID 34567890 得分为 8 但不相关 (错误靶点)
- PMID 45678901 得分为 6 但高度相关 (有关键数据集)
我们可以更新标准吗?"
根据反馈更新标准:
重新筛选工作流:
# 从 abstracts-cache.json 加载所有摘要
# 将更新后的标准应用于每一篇
# 生成更改报告
标准更新: v1.0.0 → v1.1.0
更改:
- 将 "derivative" 添加到 scaffold_analogs 规则触发词
- 将数据集奖励分数从 +1 增加到 +2 分
重新筛选 127 篇缓存的论文...
状态变更:
不相关 → 相关: 3 篇论文
- PMID 23456789 (分数 4→7)
- PMID 45678901 (分数 6→8)
相关 → 不相关: 1 篇论文
- PMID 34567890 (分数 8→6)
使用新分数更新 papers-reviewed.json。
新摘要:
- 高度相关: 13 篇论文 (+1)
- 相关: 19 篇论文 (+1)
research-sessions/YYYY-MM-DD-topic/
├── screening-criteria.json # 标准定义 (权重、规则、版本)
├── test-set.json # 用于验证的基准事实论文
├── abstracts-cache.json # 所有筛选论文的完整摘要
├── papers-reviewed.json # 简单跟踪: DOI、分数、状态
└── rubric-changelog.md # 标准更改历史及原因
在评估论文相关性之前:
在创建辅助脚本时:
在回答研究问题时:
score = 0
score += count_keyword_matches(abstract, keywords) # 0-3 分
score += count_data_type_matches(abstract, data_types) # 0-4 分
score += specificity_score(paper_type) # 0-3 分
# 应用特殊规则
if matches_special_rule(abstract, rule):
score += rule['bonus_points']
return score
药物化学:
{
"special_rules": [
{
"name": "scaffold_analogs",
"keywords": ["target_scaffold", "analog|derivative|series"],
"bonus": 3
},
{
"name": "sar_data",
"keywords": ["IC50|Ki|MIC", "structure-activity|SAR"],
"bonus": 2
}
]
}
基因组学:
{
"special_rules": [
{
"name": "public_data",
"keywords": ["GEO:|SRA:|ENA:", "accession"],
"bonus": 3
},
{
"name": "differential_expression",
"keywords": ["DEG|differentially expressed", "RNA-seq|microarray"],
"bonus": 2
}
]
}
计算方法:
{
"special_rules": [
{
"name": "code_available",
"keywords": ["github|gitlab|bitbucket", "code available|software"],
"bonus": 3
},
{
"name": "benchmark",
"keywords": ["benchmark|comparison", "performance|accuracy"],
"bonus": 2
}
]
}
跳过测试驱动的验证: 未经测试标准即进行批量筛选 → 许多错误分类,浪费时间 未缓存摘要: 重新筛选时从 PubMed 重新获取 → 速度慢,触及速率限制 无基准事实测试: 无法衡量标准准确率 → 不知道是否有效 测试论文太少: 仅在 2-3 篇论文上测试 → 标准过拟合,无法泛化 标准过于复杂: 包含 10 条以上规则的布尔逻辑 → 难以调试、更新、解释 未记录更改: 更新标准而不跟踪原因 → 无法重现,无法从错误中学习 阈值设置过高: 要求 95% 准确率 → 无限迭代,收益递减 未随会话保存标准: 会话结束时标准丢失 → 无法重现筛选
标准准备就绪时:
| 任务 | 文件 | 内容 |
|---|---|---|
| 定义标准 | screening-criteria.json | 关键词、权重、特殊规则 |
| 测试标准 | test-set.json | 带有用户判断的基准事实论文 |
| 缓存摘要 | abstracts-cache.json | 用于重新筛选的完整摘要文本 |
| 跟踪结果 | papers-reviewed.json | DOI、分数、状态、标准版本 |
| 记录更改 | rubric-changelog.md | 更新历史和原理 |
Agent: "我将帮助您构建筛选标准。让我们从一些关于什么使论文相关的问题开始..."
[头脑风暴阶段 - 提出 5 个问题类别]
User: "我正在寻找关于结核病耐药性、含有贝达喹啉和氯法齐明 MIC 数据的论文。关于类似物的论文也相关。"
Agent: "明白了。这是初始标准:
- 关键词: tuberculosis, drug resistance, bedaquiline, clofazimine (0-3 分)
- 数据: MIC 值, IC50, in vitro activity (0-4 分)
- 特异性: Primary research (0-3 分)
- 特殊规则: 如果提及目标药物 + 'analog/derivative' → +3 奖励分
- 阈值: ≥7 = 相关
这看起来对吗?"
User: "是的,我们来测试一下。"
Agent: "我将搜索 PubMed 获取 20 篇论文,并向您展示摘要进行分类..."
[展示 10 篇摘要,用户进行分类]
Agent: "在您的分类上测试标准...
准确率: 70% (7/10 正确)
- 2 个假阴性 (遗漏了相关论文)
- 1 个假阳性 (将综述标记为相关)
假阴性示例: 论文讨论了 'derivatives of bedaquiline' 但仅得 5 分,因为没有明确说 'analog'。
建议: 将 'derivative' 和 'compound' 添加到特殊规则触发词中?"
User: "是的,并且降低综述论文的分数。"
Agent: "已更新标准。重新测试...
准确率: 90% (9/10 正确) ✓
准备好对 127 篇论文进行批量筛选了吗?"
User: "是的,开始吧。"
Agent: [筛选 127 篇论文,保存结果]
"完成!找到了 18 篇相关论文。所有摘要已缓存以供重新筛选。
请离线审查,并告知我是否有任何错误分类。"
[稍后...]
User: "发现了 2 个错误分类。我们可以更新标准吗?"
Agent: "当然!我遗漏了什么?"
User: "PMID 12345 讨论了 pretomanid 与 bedaquiline 的联合疗法 - 应该是相关的。"
Agent: "明白了。将 'combination therapy' 添加为相关术语,并给予 +2 奖励分。
重新筛选所有 127 篇缓存的论文...
状态变更: 3 篇论文现在为相关 (包括 PMID 12345)。
已更新 papers-reviewed.json。"
构建标准后:
每周安装次数
–
来源
首次出现
–
Core principle: Build screening rubrics collaboratively through brainstorming → test → refine → automate → review → iterate.
Good rubrics come from understanding edge cases upfront and testing on real papers before bulk screening.
Use this skill when:
When NOT to use:
Ask domain-agnostic questions to understand what makes papers relevant:
Core Concepts:
Data Types & Artifacts:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
Paper Types:
Relationships & Context:
Edge Cases:
Document responses in screening-criteria.json
Based on brainstorming, propose scoring logic:
Scoring (0-10):
Keywords Match (0-3 pts):
- Core term 1: +1 pt
- Core term 2 OR synonym: +1 pt
- Related term: +1 pt
Data Type Match (0-4 pts):
- Measurement type (IC50, Ki, EC50, etc.): +2 pts
- Dataset/code available: +1 pt
- Methods described: +1 pt
Specificity (0-3 pts):
- Primary research: +3 pts
- Methods paper: +2 pts
- Review: +1 pt
Special Rules:
- If mentions exclusion term: score = 0
Threshold: ≥7 = relevant, 5-6 = possibly relevant, <5 = not relevant
Present to user and ask: "Does this logic match your expectations?"
Save initial rubric to screening-criteria.json:
{
"version": "1.0.0",
"created": "2025-10-11T15:30:00Z",
"keywords": {
"core_terms": ["term1", "term2"],
"synonyms": {"term1": ["alt1", "alt2"]},
"related_terms": ["related1", "related2"],
"exclusion_terms": ["exclude1", "exclude2"]
},
"data_types": {
"measurements": ["IC50", "Ki", "MIC"],
"datasets": ["GEO:", "SRA:", "PDB:"],
"methods": ["protocol", "synthesis", "assay"]
},
"scoring": {
"keywords_max": 3,
"data_type_max": 4,
"specificity_max": 3,
"relevance_threshold": 7
},
"special_rules": [
{
"name": "scaffold_analogs",
"condition": "mentions target scaffold AND (analog OR derivative)",
"action": "add 3 points"
}
]
}
Do a quick PubMed search to get candidate papers:
# Search for 20 papers using initial keywords
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=YOUR_QUERY&retmax=20&retmode=json"
Fetch abstracts for first 10-15 papers:
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=PMID1,PMID2,...&retmode=xml&rettype=abstract"
Present abstracts to user one at a time:
Paper 1/10:
Title: [Title]
PMID: [12345678]
DOI: [10.1234/example]
Abstract:
[Full abstract text]
Is this paper RELEVANT to your research question? (y/n/maybe)
Record user judgments in test-set.json:
{
"test_papers": [
{
"pmid": "12345678",
"doi": "10.1234/example",
"title": "Paper title",
"abstract": "Full abstract text...",
"user_judgment": "relevant",
"timestamp": "2025-10-11T15:45:00Z"
}
]
}
Continue until have 5-10 papers with clear judgments
Apply rubric to each test paper:
for paper in test_papers:
score = calculate_score(paper['abstract'], rubric)
predicted_status = "relevant" if score >= 7 else "not_relevant"
paper['predicted_score'] = score
paper['predicted_status'] = predicted_status
Calculate accuracy:
correct = sum(1 for p in test_papers
if p['predicted_status'] == p['user_judgment'])
accuracy = correct / len(test_papers)
Present classification report:
RUBRIC TEST RESULTS (5 papers):
✓ PMID 12345678: Score 9 → relevant (user: relevant) ✓
✗ PMID 23456789: Score 4 → not_relevant (user: relevant) ← FALSE NEGATIVE
✓ PMID 34567890: Score 8 → relevant (user: relevant) ✓
✓ PMID 45678901: Score 3 → not_relevant (user: not_relevant) ✓
✗ PMID 56789012: Score 7 → relevant (user: not_relevant) ← FALSE POSITIVE
Accuracy: 60% (3/5 correct)
Target: ≥80%
--- FALSE NEGATIVE: PMID 23456789 ---
Title: "Novel analogs of compound X with improved potency"
Score breakdown:
- Keywords: 1 pt (matched "compound X")
- Data type: 2 pts (mentioned IC50 values)
- Specificity: 1 pt (primary research)
- Total: 4 pts → not_relevant
Why missed: Paper discusses "analogs" but didn't trigger scaffold_analogs rule
Abstract excerpt: "We synthesized 12 analogs of compound X..."
--- FALSE POSITIVE: PMID 56789012 ---
Title: "Review of kinase inhibitors"
Score breakdown:
- Keywords: 2 pts
- Data type: 3 pts
- Specificity: 2 pts (review, not primary)
- Total: 7 pts → relevant
Why wrong: Review paper, user wants primary research only
Ask user for adjustments:
Current accuracy: 60% (below 80% threshold)
Suggestions to improve rubric:
1. Strengthen scaffold_analogs rule - should "synthesized N analogs" always trigger?
2. Lower points for review papers (currently 2 pts, maybe 0 pts?)
3. Add more synonym terms for core concepts?
What would you like to adjust?
Update screening-criteria.json based on feedback
Example update:
{
"special_rules": [
{
"name": "scaffold_analogs",
"condition": "mentions target scaffold AND (analog OR derivative OR synthesized)",
"action": "add 3 points"
}
],
"paper_types": {
"primary_research": 3,
"methods": 2,
"review": 0 // Changed from 1
}
}
Re-score test papers with updated rubric
Show new results:
UPDATED RUBRIC TEST RESULTS (5 papers):
✓ PMID 12345678: Score 9 → relevant (user: relevant) ✓
✓ PMID 23456789: Score 7 → relevant (user: relevant) ✓ (FIXED!)
✓ PMID 34567890: Score 8 → relevant (user: relevant) ✓
✓ PMID 45678901: Score 3 → not_relevant (user: not_relevant) ✓
✓ PMID 56789012: Score 5 → not_relevant (user: not_relevant) ✓ (FIXED!)
Accuracy: 100% (5/5 correct) ✓
Target: ≥80% ✓
Rubric is ready for bulk screening!
If accuracy ≥80%: Proceed to bulk screening If <80%: Continue iterating
Once rubric validated on test set:
{
"10.1234/example": {
"pmid": "12345678",
"title": "Paper title",
"abstract": "Full abstract text...",
"fetched": "2025-10-11T16:00:00Z"
}
}
3. Score all papers, save to papers-reviewed.json:
{
"10.1234/example": {
"pmid": "12345678",
"status": "relevant",
"score": 9,
"source": "pubmed_search",
"timestamp": "2025-10-11T16:00:00Z",
"rubric_version": "1.0.0"
}
}
4. Generate summary report:
Screened 127 papers using validated rubric:
- Highly relevant (≥8): 12 papers
- Relevant (7): 18 papers
- Possibly relevant (5-6): 23 papers
- Not relevant (<5): 74 papers
All abstracts cached for re-screening.
Results saved to papers-reviewed.json.
Review offline and provide feedback if any misclassifications found.
User reviews papers offline, identifies issues:
User: "I reviewed the results. Three papers were misclassified:
- PMID 23456789 scored 4 but is actually relevant (discusses scaffold analogs)
- PMID 34567890 scored 8 but not relevant (wrong target)
- PMID 45678901 scored 6 but is highly relevant (has key dataset)
Can we update the rubric?"
Update rubric based on feedback:
Re-screening workflow:
# Load all abstracts from abstracts-cache.json
# Apply updated rubric to each
# Generate change report
RUBRIC UPDATE: v1.0.0 → v1.1.0
Changes:
- Added "derivative" to scaffold_analogs rule
- Increased dataset bonus from +1 to +2 pts
Re-screening 127 cached papers...
Status changes:
not_relevant → relevant: 3 papers
- PMID 23456789 (score 4→7)
- PMID 45678901 (score 6→8)
relevant → not_relevant: 1 paper
- PMID 34567890 (score 8→6)
Updated papers-reviewed.json with new scores.
New summary:
- Highly relevant: 13 papers (+1)
- Relevant: 19 papers (+1)
research-sessions/YYYY-MM-DD-topic/
├── screening-criteria.json # Rubric definition (weights, rules, version)
├── test-set.json # Ground truth papers used for validation
├── abstracts-cache.json # Full abstracts for all screened papers
├── papers-reviewed.json # Simple tracking: DOI, score, status
└── rubric-changelog.md # History of rubric changes and why
Before evaluating-paper-relevance:
When creating helper scripts:
During answering-research-questions:
score = 0
score += count_keyword_matches(abstract, keywords) # 0-3 pts
score += count_data_type_matches(abstract, data_types) # 0-4 pts
score += specificity_score(paper_type) # 0-3 pts
# Apply special rules
if matches_special_rule(abstract, rule):
score += rule['bonus_points']
return score
Medicinal chemistry:
{
"special_rules": [
{
"name": "scaffold_analogs",
"keywords": ["target_scaffold", "analog|derivative|series"],
"bonus": 3
},
{
"name": "sar_data",
"keywords": ["IC50|Ki|MIC", "structure-activity|SAR"],
"bonus": 2
}
]
}
Genomics:
{
"special_rules": [
{
"name": "public_data",
"keywords": ["GEO:|SRA:|ENA:", "accession"],
"bonus": 3
},
{
"name": "differential_expression",
"keywords": ["DEG|differentially expressed", "RNA-seq|microarray"],
"bonus": 2
}
]
}
Computational methods:
{
"special_rules": [
{
"name": "code_available",
"keywords": ["github|gitlab|bitbucket", "code available|software"],
"bonus": 3
},
{
"name": "benchmark",
"keywords": ["benchmark|comparison", "performance|accuracy"],
"bonus": 2
}
]
}
Skipping test-driven validation: Bulk screen without testing rubric → Many misclassifications, wasted time Not caching abstracts: Re-fetch from PubMed when rescreening → Slow, hits rate limits No ground truth testing: Can't measure rubric accuracy → Don't know if it's working Too few test papers: Test on 2-3 papers → Rubric overfits, doesn't generalize Too complex rubric: Boolean logic with 10+ rules → Hard to debug, update, explain Not documenting changes: Update rubric without tracking why → Can't reproduce, learn from mistakes Setting threshold too high: Require 95% accuracy → Iterate forever, diminishing returns Not saving rubric with session: Rubric lost when session ends → Can't reproduce screening
Rubric is ready when:
| Task | File | Contents |
|---|---|---|
| Define rubric | screening-criteria.json | Keywords, weights, special rules |
| Test rubric | test-set.json | Ground truth papers with user judgments |
| Cache abstracts | abstracts-cache.json | Full abstract text for re-screening |
| Track results | papers-reviewed.json | DOI, score, status, rubric version |
| Document changes | rubric-changelog.md | History of updates and rationale |
Agent: "I'll help you build a screening rubric. Let's start with some questions about what makes papers relevant..."
[Brainstorming phase - ask 5 question categories]
User: "I'm looking for papers on tuberculosis drug resistance with MIC data for bedaquiline and clofazimine. Papers about analogs are also relevant."
Agent: "Got it. Here's the initial rubric:
- Keywords: tuberculosis, drug resistance, bedaquiline, clofazimine (0-3 pts)
- Data: MIC values, IC50, in vitro activity (0-4 pts)
- Specificity: Primary research (0-3 pts)
- Special rule: If mentions target drugs + 'analog/derivative' → +3 bonus pts
- Threshold: ≥7 = relevant
Does this look right?"
User: "Yes, let's test it."
Agent: "I'll search PubMed for 20 papers and show you abstracts to classify..."
[Shows 10 abstracts, user classifies them]
Agent: "Testing rubric on your classifications...
Accuracy: 70% (7/10 correct)
- 2 false negatives (missed relevant papers)
- 1 false positive (marked review as relevant)
False negative example: Paper discusses 'derivatives of bedaquiline' but scored only 5 pts because didn't explicitly say 'analog'.
Suggestion: Add 'derivative' and 'compound' to special rule trigger terms?"
User: "Yes, and lower review paper points."
Agent: "Updated rubric. Re-testing...
Accuracy: 90% (9/10 correct) ✓
Ready to run bulk screening on 127 papers?"
User: "Yes, go ahead."
Agent: [Screens 127 papers, saves results]
"Done! Found 18 relevant papers. All abstracts cached for re-screening.
Review offline and let me know if any misclassifications."
[Later...]
User: "Found 2 misclassifications. Can we update the rubric?"
Agent: "Sure! What did I miss?"
User: "PMID 12345 discusses pretomanid combination therapy with bedaquiline - should be relevant."
Agent: "Got it. Adding 'combination therapy' as related term with +2 bonus pts.
Re-screening all 127 cached papers...
Status changes: 3 papers now relevant (including PMID 12345).
Updated papers-reviewed.json."
After building rubric:
Weekly Installs
–
Source
First Seen
–
通过 LiteLLM 代理让 Claude Code 对接 GitHub Copilot 运行 | 高级变通方案指南
29,800 周安装