重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
sentence-transformers by orchestra-research/ai-research-skills
npx skills add https://github.com/orchestra-research/ai-research-skills --skill sentence-transformers使用 transformers 进行句子和文本嵌入的 Python 框架。
在以下情况使用:
指标:
在以下情况使用替代方案:
pip install sentence-transformers
from sentence_transformers import SentenceTransformer
# 加载模型
model = SentenceTransformer('all-MiniLM-L6-v2')
# 生成嵌入
sentences = [
"This is an example sentence",
"Each sentence is converted to a vector"
]
embeddings = model.encode(sentences)
print(embeddings.shape) # (2, 384)
# 余弦相似度
from sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item():.4f}")
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
# 快速,质量好(384 维)
model = SentenceTransformer('all-MiniLM-L6-v2')
# 质量更好(768 维)
model = SentenceTransformer('all-mpnet-base-v2')
# 最佳质量(1024 维,较慢)
model = SentenceTransformer('all-roberta-large-v1')
# 50+ 种语言
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
# 100+ 种语言
model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
# 法律领域
model = SentenceTransformer('nlpaueb/legal-bert-base-uncased')
# 科学论文
model = SentenceTransformer('allenai/specter')
# 代码
model = SentenceTransformer('microsoft/codebert-base')
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
# 语料库
corpus = [
"Python is a programming language",
"Machine learning uses algorithms",
"Neural networks are powerful"
]
# 编码语料库
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
# 查询
query = "What is Python?"
query_embedding = model.encode(query, convert_to_tensor=True)
# 查找最相似的
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=3)
print(hits)
# 余弦相似度
similarity = util.cos_sim(embedding1, embedding2)
# 点积
similarity = util.dot_score(embedding1, embedding2)
# 成对余弦相似度
similarities = util.cos_sim(embeddings, embeddings)
# 高效的批处理
sentences = ["sentence 1", "sentence 2", ...] * 1000
embeddings = model.encode(
sentences,
batch_size=32,
show_progress_bar=True,
convert_to_tensor=False # 或 True 以获得 PyTorch 张量
)
from sentence_transformers import InputExample, losses
from torch.utils.data import DataLoader
# 训练数据
train_examples = [
InputExample(texts=['sentence 1', 'sentence 2'], label=0.8),
InputExample(texts=['sentence 3', 'sentence 4'], label=0.3),
]
train_dataloader = DataLoader(train_examples, batch_size=16)
# 损失函数
train_loss = losses.CosineSimilarityLoss(model)
# 训练
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=10,
warmup_steps=100
)
# 保存
model.save('my-finetuned-model')
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-mpnet-base-v2"
)
# 与向量存储一起使用
from langchain_chroma import Chroma
vectorstore = Chroma.from_documents(
documents=docs,
embedding=embeddings
)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(
model_name="sentence-transformers/all-mpnet-base-v2"
)
from llama_index.core import Settings
Settings.embed_model = embed_model
# 在索引中使用
index = VectorStoreIndex.from_documents(documents)
| 模型 | 维度 | 速度 | 质量 | 使用场景 |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | 快 | 良好 | 通用、原型设计 |
| all-mpnet-base-v2 | 768 | 中等 | 更好 | 生产环境 RAG |
| all-roberta-large-v1 | 1024 | 慢 | 最佳 | 需要高精度 |
| paraphrase-multilingual | 768 | 中等 | 良好 | 多语言 |
| 模型 | 速度(句子/秒) | 内存 | 维度 |
|---|---|---|---|
| MiniLM | ~2000 | 120MB | 384 |
| MPNet | ~600 | 420MB | 768 |
| RoBERTa | ~300 | 1.3GB | 1024 |
每周安装量
67
代码仓库
GitHub 星标
5.6K
首次出现
2026年2月7日
安全审计
安装于
opencode58
codex57
cursor57
gemini-cli56
claude-code55
github-copilot55
Python framework for sentence and text embeddings using transformers.
Use when:
Metrics :
Use alternatives instead :
pip install sentence-transformers
from sentence_transformers import SentenceTransformer
# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Generate embeddings
sentences = [
"This is an example sentence",
"Each sentence is converted to a vector"
]
embeddings = model.encode(sentences)
print(embeddings.shape) # (2, 384)
# Cosine similarity
from sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item():.4f}")
# Fast, good quality (384 dim)
model = SentenceTransformer('all-MiniLM-L6-v2')
# Better quality (768 dim)
model = SentenceTransformer('all-mpnet-base-v2')
# Best quality (1024 dim, slower)
model = SentenceTransformer('all-roberta-large-v1')
# 50+ languages
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
# 100+ languages
model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
# Legal domain
model = SentenceTransformer('nlpaueb/legal-bert-base-uncased')
# Scientific papers
model = SentenceTransformer('allenai/specter')
# Code
model = SentenceTransformer('microsoft/codebert-base')
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
# Corpus
corpus = [
"Python is a programming language",
"Machine learning uses algorithms",
"Neural networks are powerful"
]
# Encode corpus
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
# Query
query = "What is Python?"
query_embedding = model.encode(query, convert_to_tensor=True)
# Find most similar
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=3)
print(hits)
# Cosine similarity
similarity = util.cos_sim(embedding1, embedding2)
# Dot product
similarity = util.dot_score(embedding1, embedding2)
# Pairwise cosine similarity
similarities = util.cos_sim(embeddings, embeddings)
# Efficient batch processing
sentences = ["sentence 1", "sentence 2", ...] * 1000
embeddings = model.encode(
sentences,
batch_size=32,
show_progress_bar=True,
convert_to_tensor=False # or True for PyTorch tensors
)
from sentence_transformers import InputExample, losses
from torch.utils.data import DataLoader
# Training data
train_examples = [
InputExample(texts=['sentence 1', 'sentence 2'], label=0.8),
InputExample(texts=['sentence 3', 'sentence 4'], label=0.3),
]
train_dataloader = DataLoader(train_examples, batch_size=16)
# Loss function
train_loss = losses.CosineSimilarityLoss(model)
# Train
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=10,
warmup_steps=100
)
# Save
model.save('my-finetuned-model')
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-mpnet-base-v2"
)
# Use with vector stores
from langchain_chroma import Chroma
vectorstore = Chroma.from_documents(
documents=docs,
embedding=embeddings
)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(
model_name="sentence-transformers/all-mpnet-base-v2"
)
from llama_index.core import Settings
Settings.embed_model = embed_model
# Use in index
index = VectorStoreIndex.from_documents(documents)
| Model | Dimensions | Speed | Quality | Use Case |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Fast | Good | General, prototyping |
| all-mpnet-base-v2 | 768 | Medium | Better | Production RAG |
| all-roberta-large-v1 | 1024 | Slow | Best | High accuracy needed |
| paraphrase-multilingual | 768 | Medium | Good | Multilingual |
| Model | Speed (sentences/sec) | Memory | Dimension |
|---|---|---|---|
| MiniLM | ~2000 | 120MB | 384 |
| MPNet | ~600 | 420MB | 768 |
| RoBERTa | ~300 | 1.3GB | 1024 |
Weekly Installs
67
Repository
GitHub Stars
5.6K
First Seen
Feb 7, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode58
codex57
cursor57
gemini-cli56
claude-code55
github-copilot55
AI Elements:基于shadcn/ui的AI原生应用组件库,快速构建对话界面
75,300 周安装
Crank Skill:自主执行史诗的AI编排工具,支持Swarm并行处理与知识飞轮集成
369 周安装
AI会话压缩技术:减少70-95%令牌成本,提升AI对话性能与可扩展性
368 周安装
多智能体研究流程:广度优先研究与延迟综合方法 | 基于300+真实调度提炼
367 周安装
Claude代码分析工作流:记录式协作分析工具analyze-with-file使用指南
52 周安装
Claude技能:自动化文档工作流 - 智能管理项目文档生命周期(CLAUDE.md/README.md)
364 周安装
AI 驱动的项目事后复盘技能 - 自动提取经验教训,优化团队知识管理
368 周安装