⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

内容相似度检查器 - 使用余弦、杰卡德、莱文斯坦算法比较文档和文本相似度

content-similarity-checker by dkyazzentwatwa/chatgpt-skills

62 周安装量

38 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/dkyazzentwatwa/chatgpt-skills --skill content-similarity-checker

文本分析数据处理自然语言处理

🇨🇳中文介绍

内容相似度检查器

使用多种算法比较文档和文本的相似度。

功能特性

余弦相似度：基于 TF-IDF 的比较
杰卡德相似度：基于集合的比较
莱文斯坦距离：用于短文本的编辑距离
批量比较：比较多个文档
相似度矩阵：所有文档的成对比较
报告：详细的相似度报告

快速开始

from similarity_checker import SimilarityChecker

checker = SimilarityChecker()

# 比较两段文本
score = checker.compare(
    "The quick brown fox jumps over the lazy dog",
    "A fast brown fox leaps over a sleepy dog"
)
print(f"Similarity: {score:.2%}")

# 比较文档
score = checker.compare_files("doc1.txt", "doc2.txt")

命令行使用

# 比较两段文本
python similarity_checker.py --text1 "Hello world" --text2 "Hello there world"

# 比较两个文件
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt

# 比较文件夹中的所有文件
python similarity_checker.py --folder ./documents/ --output matrix.csv

# 使用特定算法
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --method jaccard

# 查找相似文档（阈值）
python similarity_checker.py --folder ./documents/ --threshold 0.7

# JSON 输出
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --json

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

SimilarityChecker 类

class SimilarityChecker:
    def __init__(self, method: str = "cosine")

    # 文本比较
    def compare(self, text1: str, text2: str) -> float
    def compare_files(self, file1: str, file2: str) -> float

    # 多种算法
    def compare_all_methods(self, text1: str, text2: str) -> dict

    # 批量比较
    def compare_to_corpus(self, text: str, corpus: list) -> list
    def similarity_matrix(self, documents: list) -> pd.DataFrame
    def find_duplicates(self, documents: list, threshold: float = 0.8) -> list

    # 文件夹操作
    def compare_folder(self, folder: str, threshold: float = None) -> dict
    def find_most_similar(self, text: str, folder: str, top_n: int = 5) -> list

    # 报告
    def generate_report(self, output: str) -> str

余弦相似度（默认）

最适合比较不同长度的文档：

checker = SimilarityChecker(method="cosine")
score = checker.compare(text1, text2)
# 返回值：0.0 到 1.0

适合比较单词/标记的集合：

checker = SimilarityChecker(method="jaccard")
score = checker.compare(text1, text2)
# 返回值：0.0 到 1.0

莱文斯坦距离（编辑距离）

最适合短文本、拼写错误检测：

checker = SimilarityChecker(method="levenshtein")
score = checker.compare(text1, text2)
# 返回值：0.0 到 1.0（归一化）

TF-IDF + 余弦相似度

高级：考虑术语重要性：

checker = SimilarityChecker(method="tfidf")
score = checker.compare(text1, text2)

checker = SimilarityChecker()

target = "Machine learning is a subset of artificial intelligence."
corpus = [
    "AI includes machine learning and deep learning.",
    "Python is a programming language.",
    "Neural networks power deep learning systems."
]

results = checker.compare_to_corpus(target, corpus)

# 返回：
[
    {"index": 0, "similarity": 0.65, "text": "AI includes..."},
    {"index": 2, "similarity": 0.42, "text": "Neural networks..."},
    {"index": 1, "similarity": 0.12, "text": "Python is..."}
]

documents = [
    "Document one content...",
    "Document two content...",
    "Document three content..."
]

matrix = checker.similarity_matrix(documents)

# 返回 DataFrame：
#          doc_0    doc_1    doc_2
# doc_0    1.000    0.750    0.320
# doc_1    0.750    1.000    0.410
# doc_2    0.320    0.410    1.000

documents = [...]  # 文本列表

duplicates = checker.find_duplicates(documents, threshold=0.85)

# 返回：
[
    {"doc1_index": 0, "doc2_index": 3, "similarity": 0.92},
    {"doc1_index": 2, "doc2_index": 7, "similarity": 0.88}
]

获取所有算法的相似度分数：

checker = SimilarityChecker()
results = checker.compare_all_methods(text1, text2)

# 返回：
{
    "cosine": 0.82,
    "jaccard": 0.65,
    "levenshtein": 0.71,
    "tfidf": 0.78,
    "average": 0.74
}

比较文件夹中的所有文件

checker = SimilarityChecker()
results = checker.compare_folder("./documents/")

# 返回：
{
    "files": ["doc1.txt", "doc2.txt", "doc3.txt"],
    "comparisons": 3,
    "similar_pairs": [
        {"file1": "doc1.txt", "file2": "doc3.txt", "similarity": 0.87}
    ],
    "matrix": <DataFrame>
}

查找与查询最相似的文档

query = "Your search text here..."
results = checker.find_most_similar(query, "./documents/", top_n=5)

# 返回：
[
    {"file": "doc3.txt", "similarity": 0.89},
    {"file": "doc1.txt", "similarity": 0.72},
    ...
]

result = checker.compare_with_details(text1, text2)

# 返回：
{
    "similarity": 0.82,
    "method": "cosine",
    "text1_length": 150,
    "text2_length": 180,
    "common_words": 25,
    "unique_words_text1": 10,
    "unique_words_text2": 15,
    "interpretation": "High similarity - likely related content"
}

checker = SimilarityChecker()

submission = open("student_paper.txt").read()
results = checker.compare_folder("./source_materials/")

suspicious = [p for p in results["similar_pairs"] if p["similarity"] > 0.6]

if suspicious:
    print(f"Warning: Found {len(suspicious)} potentially similar sources")
    for p in suspicious:
        print(f"  {p['file1']} matches {p['file2']}: {p['similarity']:.0%}")

checker = SimilarityChecker()

# 加载所有文档
docs = {}
for file in Path("./articles/").glob("*.txt"):
    docs[file.name] = file.read_text()

# 查找近似重复项
duplicates = checker.find_duplicates(list(docs.values()), threshold=0.9)

print(f"Found {len(duplicates)} duplicate pairs")

checker = SimilarityChecker()

query = "Best practices for Python web development"
results = checker.find_most_similar(query, "./blog_posts/", top_n=10)

print("Most relevant articles:")
for r in results:
    print(f"  {r['file']}: {r['similarity']:.0%} match")

scikit-learn>=1.3.0
nltk>=3.8.0
numpy>=1.24.0
pandas>=2.0.0

🇺🇸English

Content Similarity Checker

Compare documents and text for similarity using multiple algorithms.

Features

Cosine Similarity : TF-IDF based comparison
Jaccard Similarity : Set-based comparison
Levenshtein Distance : Edit distance for short texts
Batch Comparison : Compare multiple documents
Similarity Matrix : Pairwise comparison of all documents
Reports : Detailed similarity reports

Quick Start

from similarity_checker import SimilarityChecker

checker = SimilarityChecker()

# Compare two texts
score = checker.compare(
    "The quick brown fox jumps over the lazy dog",
    "A fast brown fox leaps over a sleepy dog"
)
print(f"Similarity: {score:.2%}")

# Compare documents
score = checker.compare_files("doc1.txt", "doc2.txt")

CLI Usage

# Compare two texts
python similarity_checker.py --text1 "Hello world" --text2 "Hello there world"

# Compare two files
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt

# Compare all files in folder
python similarity_checker.py --folder ./documents/ --output matrix.csv

# Use specific algorithm
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --method jaccard

# Find similar documents (threshold)
python similarity_checker.py --folder ./documents/ --threshold 0.7

# JSON output
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --json

API Reference

SimilarityChecker Class

class SimilarityChecker:
    def __init__(self, method: str = "cosine")

    # Text comparison
    def compare(self, text1: str, text2: str) -> float
    def compare_files(self, file1: str, file2: str) -> float

    # Multiple algorithms
    def compare_all_methods(self, text1: str, text2: str) -> dict

    # Batch comparison
    def compare_to_corpus(self, text: str, corpus: list) -> list
    def similarity_matrix(self, documents: list) -> pd.DataFrame
    def find_duplicates(self, documents: list, threshold: float = 0.8) -> list

    # Folder operations
    def compare_folder(self, folder: str, threshold: float = None) -> dict
    def find_most_similar(self, text: str, folder: str, top_n: int = 5) -> list

    # Report
    def generate_report(self, output: str) -> str

Similarity Methods

Cosine Similarity (Default)

Best for comparing documents of different lengths:

checker = SimilarityChecker(method="cosine")
score = checker.compare(text1, text2)
# Returns: 0.0 to 1.0

Jaccard Similarity

Good for comparing sets of words/tokens:

checker = SimilarityChecker(method="jaccard")
score = checker.compare(text1, text2)
# Returns: 0.0 to 1.0

Levenshtein (Edit Distance)

Best for short texts, typo detection:

checker = SimilarityChecker(method="levenshtein")
score = checker.compare(text1, text2)
# Returns: 0.0 to 1.0 (normalized)

TF-IDF + Cosine

Advanced: considers term importance:

checker = SimilarityChecker(method="tfidf")
score = checker.compare(text1, text2)

Batch Comparison

Compare to Corpus

checker = SimilarityChecker()

target = "Machine learning is a subset of artificial intelligence."
corpus = [
    "AI includes machine learning and deep learning.",
    "Python is a programming language.",
    "Neural networks power deep learning systems."
]

results = checker.compare_to_corpus(target, corpus)

# Returns:
[
    {"index": 0, "similarity": 0.65, "text": "AI includes..."},
    {"index": 2, "similarity": 0.42, "text": "Neural networks..."},
    {"index": 1, "similarity": 0.12, "text": "Python is..."}
]

Similarity Matrix

documents = [
    "Document one content...",
    "Document two content...",
    "Document three content..."
]

matrix = checker.similarity_matrix(documents)

# Returns DataFrame:
#          doc_0    doc_1    doc_2
# doc_0    1.000    0.750    0.320
# doc_1    0.750    1.000    0.410
# doc_2    0.320    0.410    1.000

Find Duplicates

documents = [...]  # List of texts

duplicates = checker.find_duplicates(documents, threshold=0.85)

# Returns:
[
    {"doc1_index": 0, "doc2_index": 3, "similarity": 0.92},
    {"doc1_index": 2, "doc2_index": 7, "similarity": 0.88}
]

Compare All Methods

Get similarity scores from all algorithms:

checker = SimilarityChecker()
results = checker.compare_all_methods(text1, text2)

# Returns:
{
    "cosine": 0.82,
    "jaccard": 0.65,
    "levenshtein": 0.71,
    "tfidf": 0.78,
    "average": 0.74
}

Folder Operations

Compare All Files in Folder

checker = SimilarityChecker()
results = checker.compare_folder("./documents/")

# Returns:
{
    "files": ["doc1.txt", "doc2.txt", "doc3.txt"],
    "comparisons": 3,
    "similar_pairs": [
        {"file1": "doc1.txt", "file2": "doc3.txt", "similarity": 0.87}
    ],
    "matrix": <DataFrame>
}

Find Most Similar to Query

query = "Your search text here..."
results = checker.find_most_similar(query, "./documents/", top_n=5)

# Returns:
[
    {"file": "doc3.txt", "similarity": 0.89},
    {"file": "doc1.txt", "similarity": 0.72},
    ...
]

Output Format

Comparison Result

result = checker.compare_with_details(text1, text2)

# Returns:
{
    "similarity": 0.82,
    "method": "cosine",
    "text1_length": 150,
    "text2_length": 180,
    "common_words": 25,
    "unique_words_text1": 10,
    "unique_words_text2": 15,
    "interpretation": "High similarity - likely related content"
}

Example Workflows

Plagiarism Check

checker = SimilarityChecker()

submission = open("student_paper.txt").read()
results = checker.compare_folder("./source_materials/")

suspicious = [p for p in results["similar_pairs"] if p["similarity"] > 0.6]

if suspicious:
    print(f"Warning: Found {len(suspicious)} potentially similar sources")
    for p in suspicious:
        print(f"  {p['file1']} matches {p['file2']}: {p['similarity']:.0%}")

Document Deduplication

checker = SimilarityChecker()

# Load all documents
docs = {}
for file in Path("./articles/").glob("*.txt"):
    docs[file.name] = file.read_text()

# Find near-duplicates
duplicates = checker.find_duplicates(list(docs.values()), threshold=0.9)

print(f"Found {len(duplicates)} duplicate pairs")

Content Matching

checker = SimilarityChecker()

query = "Best practices for Python web development"
results = checker.find_most_similar(query, "./blog_posts/", top_n=10)

print("Most relevant articles:")
for r in results:
    print(f"  {r['file']}: {r['similarity']:.0%} match")

Dependencies

scikit-learn>=1.3.0
nltk>=3.8.0
numpy>=1.24.0
pandas>=2.0.0

Weekly Installs

Repository

dkyazzentwatwa/…t-skills

GitHub Stars

First Seen

Jan 24, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

gemini-cli50

opencode50

codex47

cursor47

github-copilot45

amp42

奥派经济聊天室：AI模拟哈耶克与米塞斯对话，探讨奥地利学派经济学

1,600 周安装

内容相似度检查器 - 使用余弦、杰卡德、莱文斯坦算法比较文档和文本相似度

🇨🇳中文介绍

内容相似度检查器

功能特性

快速开始

命令行使用

相关 Skills

API 参考

SimilarityChecker 类

相似度方法

余弦相似度（默认）

杰卡德相似度