重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
code-clone-assistant by terrylica/cc-skills
npx skills add https://github.com/terrylica/cc-skills --skill code-clone-assistant使用 PMD CPD(精确重复)和 Semgrep(模式匹配)检测代码克隆并指导重构。
测试时间 : 2025年10月 - 在3个示例文件中检测到30个违规项 覆盖率 : 比单独使用任一工具多发现约3倍的违规项
在以下情况下使用此技能:
PMD CPD 和 Semgrep 检测不同类型的克隆:
| 方面 | PMD CPD | Semgrep |
|---|---|---|
| 检测对象 | 精确的复制粘贴重复项 | 带有变体的相似模式 |
| 范围 | 跨文件 ✅ | 文件内/跨文件(仅限 Pro 版) |
| 匹配方式 | 基于词法单元(忽略格式) |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 基于模式(AST 匹配) |
| 规则 | ❌ 无自定义规则 | ✅ 自定义规则 |
结果 : 同时使用两者可发现约3倍多的 DRY 原则违规项。
| 类型 | 描述 | PMD CPD | Semgrep |
|---|---|---|---|
| 类型-1 | 精确副本 | ✅ 默认支持 | ✅ |
| 类型-2 | 重命名标识符 | ✅ --ignore-* | ✅ |
| 类型-3 | 带有变体的近似克隆 | ⚠️ 部分支持 | ✅ 模式匹配 |
| 类型-4 | 语义克隆(相同行为) | ❌ | ❌ |
# 步骤 1: 检测精确重复项 (PMD CPD)
pmd cpd -d . -l python --minimum-tokens 20 -f markdown > pmd-results.md
# 步骤 2: 检测模式违规项 (Semgrep)
semgrep --config=clone-rules.yaml --sarif --quiet > semgrep-results.sarif
# 步骤 3: 分析合并结果 (Claude Code)
# 解析两个输出,按严重性排序优先级
# 步骤 4: 重构 (Claude Code 需用户批准)
# 提取共享函数,整合模式,验证测试
并非所有代码重复都是问题。有些代码库故意使用复制和适配模式,在这种情况下重构反而有害。运行克隆检测时,在建议重构之前,务必检查是否存在可接受的例外情况。
| 模式 | 为何可接受 | 示例 |
|---|---|---|
| 按目录分代的实验 | 每一代都是不可变的、自包含的实验。跨代共享代码会破坏来源追踪,并使过去的实验无法复现。 | SQL 模板、扫描脚本,其中每个 gen{NNN}/ 目录都是独立的 |
| 带有占位符替换的 SQL 模板 | SQL 没有导入/包含机制。模板使用 sed 占位符替换 (__PLACEHOLDER__),而非函数调用。将共享的 CTE 提取到单独的文件会破坏单文件执行模型。 | 共享信号检测和指标 CTE 的 ClickHouse 扫描模板 |
| 协议/模式样板代码 | 序列化格式、API 契约和有线协议要求在每个位置具有完全相同的结构。抽象它们会隐藏契约。 | 包装脚本中的 NDJSON 遥测行构造 |
| 测试夹具和黄金文件 | 测试数据故意复制生产模式以验证行为。共享夹具会创建脆弱的跨测试依赖。 | 测试设置代码、预期输出快照 |
当克隆检测发现符合可接受例外模式的重复时:
示例输出格式 :
Code Clone Analysis Results
PMD CPD Findings:
Clone 1: 115 lines (575 tokens) — base_bars → signals CTEs
gen610_template.sql:33 ↔ gen710_template.sql:38
Status: ACCEPTED EXCEPTION (generation-per-directory experiment)
Reason: Each generation is immutable. Shared CTEs would break
experiment provenance and reproducibility.
Clone 2: 36 lines (478 tokens) — metrics aggregation
gen610_template.sql:207 ↔ gen710_template.sql:244
Status: ACCEPTED EXCEPTION (SQL template without include mechanism)
Actionable Findings: 0
Accepted Exceptions: 2
项目可以在其 CLAUDE.md 中声明可接受的例外模式:
## Code Clone Exceptions
- `sql/gen*_template.sql` — generation-per-directory experiments (immutable)
- `scripts/gen*/` — copy-and-adapt sweep scripts (no shared infrastructure)
- `tests/fixtures/` — intentional duplication for test isolation
当项目的 CLAUDE.md 中存在此部分时,代码克隆助手应在分类发现之前检查它。
详细信息请参阅:
| 问题 | 原因 | 解决方案 |
|---|---|---|
| PMD CPD 未找到 | 未安装或不在 PATH 中 | brew install pmd 或从 PMD 发布页面下载 |
| Semgrep 超时 | 大型代码库扫描 | 使用 --exclude 限制范围 |
| 未检测到重复项 | minimum-tokens 值过高 | 降低 --minimum-tokens 值(尝试 15) |
| 误报过多 | minimum-tokens 值过低 | 增加 --minimum-tokens 值(尝试 30+) |
| 语言不被识别 | 错误的 -l 标志 | 检查 PMD CPD 支持的语言列表 |
| SARIF 解析错误 | Semgrep 输出格式错误 | 升级 Semgrep 到最新版本 |
| 大型仓库内存错误 | Java 堆大小不足 | 设置 PMD_JAVA_OPTS=-Xmx4g |
| 缺少克隆规则文件 | 未创建自定义规则 | 创建 clone-rules.yaml 或使用默认配置 |
每周安装数
67
仓库
GitHub 星标数
26
首次出现
2026年1月24日
安全审计
安装于
opencode63
gemini-cli62
codex61
claude-code60
cursor60
github-copilot59
Detect code clones and guide refactoring using PMD CPD (exact duplicates) + Semgrep (patterns).
Tested : October 2025 - 30 violations detected across 3 sample files Coverage : ~3x more violations than using either tool alone
Use this skill when:
PMD CPD and Semgrep detect different clone types:
| Aspect | PMD CPD | Semgrep |
|---|---|---|
| Detects | Exact copy-paste duplicates | Similar patterns with variations |
| Scope | Across files ✅ | Within/across files (Pro only) |
| Matching | Token-based (ignores formatting) | Pattern-based (AST matching) |
| Rules | ❌ No custom rules | ✅ Custom rules |
Result : Using both finds ~3x more DRY violations.
| Type | Description | PMD CPD | Semgrep |
|---|---|---|---|
| Type-1 | Exact copies | ✅ Default | ✅ |
| Type-2 | Renamed identifiers | ✅ --ignore-* | ✅ |
| Type-3 | Near-miss with variations | ⚠️ Partial | ✅ Patterns |
| Type-4 | Semantic clones (same behavior) | ❌ | ❌ |
# Step 1: Detect exact duplicates (PMD CPD)
pmd cpd -d . -l python --minimum-tokens 20 -f markdown > pmd-results.md
# Step 2: Detect pattern violations (Semgrep)
semgrep --config=clone-rules.yaml --sarif --quiet > semgrep-results.sarif
# Step 3: Analyze combined results (Claude Code)
# Parse both outputs, prioritize by severity
# Step 4: Refactor (Claude Code with user approval)
# Extract shared functions, consolidate patterns, verify tests
Not all code duplication is a problem. Some codebases deliberately use copy-and-adapt patterns where refactoring would be harmful. When running clone detection, always check for accepted exceptions before recommending refactoring.
| Pattern | Why Acceptable | Example |
|---|---|---|
| Generation-per-directory experiments | Each generation is an immutable, self-contained experiment. Sharing code across generations would break provenance and make past experiments non-reproducible. | SQL templates, sweep scripts where each gen{NNN}/ is independent |
| SQL templates with placeholder substitution | SQL has no import/include mechanism. Templates use sed placeholder replacement (__PLACEHOLDER__), not function calls. Extracting shared CTEs into separate files would break the single-file execution model. | ClickHouse sweep templates sharing signal detection + metrics CTEs |
| Protocol/schema boilerplate | Serialization formats, API contracts, and wire protocols require exact structure in each location. Abstracting them hides the contract. | NDJSON telemetry line construction in wrapper scripts |
| Test fixtures and golden files |
When clone detection finds duplication that matches an accepted exception pattern:
Example output format :
Code Clone Analysis Results
PMD CPD Findings:
Clone 1: 115 lines (575 tokens) — base_bars → signals CTEs
gen610_template.sql:33 ↔ gen710_template.sql:38
Status: ACCEPTED EXCEPTION (generation-per-directory experiment)
Reason: Each generation is immutable. Shared CTEs would break
experiment provenance and reproducibility.
Clone 2: 36 lines (478 tokens) — metrics aggregation
gen610_template.sql:207 ↔ gen710_template.sql:244
Status: ACCEPTED EXCEPTION (SQL template without include mechanism)
Actionable Findings: 0
Accepted Exceptions: 2
Projects can declare accepted exception patterns in their CLAUDE.md:
## Code Clone Exceptions
- `sql/gen*_template.sql` — generation-per-directory experiments (immutable)
- `scripts/gen*/` — copy-and-adapt sweep scripts (no shared infrastructure)
- `tests/fixtures/` — intentional duplication for test isolation
When this section exists in a project's CLAUDE.md, the code-clone-assistant should check it before classifying findings.
For detailed information, see:
| Issue | Cause | Solution |
|---|---|---|
| PMD CPD not found | Not installed or not in PATH | brew install pmd or download from PMD releases |
| Semgrep timeout | Large codebase scan | Use --exclude to limit scope |
| No duplicates detected | minimum-tokens too high | Lower --minimum-tokens value (try 15) |
| Too many false positives | minimum-tokens too low | Increase --minimum-tokens (try 30+) |
| Language not recognized | Wrong -l flag |
Weekly Installs
67
Repository
GitHub Stars
26
First Seen
Jan 24, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode63
gemini-cli62
codex61
claude-code60
cursor60
github-copilot59
CodeQL代码扫描配置指南 - GitHub Actions工作流与CLI本地运行教程
854 周安装
| Test data intentionally duplicates production patterns to verify behavior. Sharing fixtures creates brittle cross-test dependencies. |
| Test setup code, expected output snapshots |
| Check PMD CPD supported languages list |
| SARIF parse error | Semgrep output malformed | Upgrade Semgrep to latest version |
| Memory error on large repo | Java heap too small | Set PMD_JAVA_OPTS=-Xmx4g |
| Missing clone rules file | Custom rules not created | Create clone-rules.yaml or use default config |