重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
sentencepiece by orchestra-research/ai-research-skills
npx skills add https://github.com/orchestra-research/ai-research-skills --skill sentencepiece一种无监督分词器,可直接处理原始文本,无需语言特定的预处理。
在以下情况下使用 SentencePiece:
性能:
替代方案:
# Python
pip install sentencepiece
# C++ (需要 CMake)
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build && cd build
cmake .. && make -j $(nproc)
sudo make install
# 命令行 (BPE,词汇表大小 8000)
spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe
# Python API
import sentencepiece as spm
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='m',
vocab_size=8000,
model_type='bpe'
)
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
训练时间:对于 100MB 语料库约 1-2 分钟
import sentencepiece as spm
# 加载模型
sp = spm.SentencePieceProcessor(model_file='m.model')
# 编码为片段
pieces = sp.encode('This is a test', out_type=str)
print(pieces) # ['▁This', '▁is', '▁a', '▁test']
# 编码为 ID
ids = sp.encode('This is a test', out_type=int)
print(ids) # [284, 47, 11, 1243]
# 解码
text = sp.decode(ids)
print(text) # "This is a test"
text = "Hello world"
pieces = sp.encode(text, out_type=str)
print(pieces) # ['▁Hello', '▁world']
# 解码保留空格
decoded = sp.decode_pieces(pieces)
print(decoded) # "Hello world"
核心原则:将文本视为原始 Unicode,空格 = ▁(元符号)
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='bpe_model',
vocab_size=16000,
model_type='bpe'
)
使用者:mBART
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='unigram_model',
vocab_size=8000,
model_type='unigram'
)
使用者:T5, ALBERT, XLNet
spm.SentencePieceTrainer.train(
input='corpus.txt',
model_prefix='m',
vocab_size=32000,
model_type='unigram',
character_coverage=0.9995, # CJK 语言用 1.0
user_defined_symbols=['[SEP]', '[CLS]'],
unk_piece='<unk>',
num_threads=16
)
| 语言类型 | 覆盖率 | 原理 |
|---|---|---|
| 英语 | 0.9995 | 覆盖最常见字符 |
| CJK(中文) | 1.0 | 需要所有字符 |
| 多语言 | 0.9995 | 平衡 |
# 采样不同的分词结果
for _ in range(3):
pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1)
print(pieces)
# 输出(每次可能不同):
# ['▁token', 'ization']
# ['▁tok', 'en', 'ization']
使用场景:用于增强模型鲁棒性的数据增强。
spm.SentencePieceTrainer.train(
input='c4_corpus.txt',
model_prefix='t5',
vocab_size=32000,
model_type='unigram',
user_defined_symbols=[f'<extra_id_{i}>' for i in range(100)],
unk_id=2,
eos_id=1,
pad_id=0
)
from transformers import T5Tokenizer
# T5 内部使用 SentencePiece
tokenizer = T5Tokenizer.from_pretrained('t5-base')
inputs = tokenizer('translate English to French: Hello', return_tensors='pt')
| 语料库 | BPE (16k) | Unigram (8k) |
|---|---|---|
| 100 MB | 1-2 分钟 | 3-4 分钟 |
| 1 GB | 10-15 分钟 | 30-40 分钟 |
T5 系列:t5-base, t5-large (32k 词汇表, Unigram) ALBERT:albert-base-v2 (30k 词汇表, Unigram) XLNet:xlnet-base-cased (32k 词汇表, Unigram) mBART:facebook/mbart-large-50 (250k 词汇表, BPE)
每周安装量
62
代码仓库
GitHub 星标数
5.5K
首次出现
2026年2月7日
安全审计
已安装于
codex53
opencode53
cursor53
gemini-cli52
claude-code51
github-copilot51
Unsupervised tokenizer that works on raw text without language-specific preprocessing.
Use SentencePiece when:
Performance :
Use alternatives instead :
# Python
pip install sentencepiece
# C++ (requires CMake)
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build && cd build
cmake .. && make -j $(nproc)
sudo make install
# Command-line (BPE with 8000 vocab)
spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe
# Python API
import sentencepiece as spm
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='m',
vocab_size=8000,
model_type='bpe'
)
Training time : ~1-2 minutes for 100MB corpus
import sentencepiece as spm
# Load model
sp = spm.SentencePieceProcessor(model_file='m.model')
# Encode to pieces
pieces = sp.encode('This is a test', out_type=str)
print(pieces) # ['▁This', '▁is', '▁a', '▁test']
# Encode to IDs
ids = sp.encode('This is a test', out_type=int)
print(ids) # [284, 47, 11, 1243]
# Decode
text = sp.decode(ids)
print(text) # "This is a test"
text = "Hello world"
pieces = sp.encode(text, out_type=str)
print(pieces) # ['▁Hello', '▁world']
# Decode preserves spaces
decoded = sp.decode_pieces(pieces)
print(decoded) # "Hello world"
Key principle : Treat text as raw Unicode, whitespace = ▁ (meta symbol)
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='bpe_model',
vocab_size=16000,
model_type='bpe'
)
Used by : mBART
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='unigram_model',
vocab_size=8000,
model_type='unigram'
)
Used by : T5, ALBERT, XLNet
spm.SentencePieceTrainer.train(
input='corpus.txt',
model_prefix='m',
vocab_size=32000,
model_type='unigram',
character_coverage=0.9995, # 1.0 for CJK
user_defined_symbols=['[SEP]', '[CLS]'],
unk_piece='<unk>',
num_threads=16
)
| Language Type | Coverage | Rationale |
|---|---|---|
| English | 0.9995 | Most common chars |
| CJK (Chinese) | 1.0 | All characters needed |
| Multilingual | 0.9995 | Balance |
# Sample different tokenizations
for _ in range(3):
pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1)
print(pieces)
# Output (different each time):
# ['▁token', 'ization']
# ['▁tok', 'en', 'ization']
Use case : Data augmentation for robustness.
spm.SentencePieceTrainer.train(
input='c4_corpus.txt',
model_prefix='t5',
vocab_size=32000,
model_type='unigram',
user_defined_symbols=[f'<extra_id_{i}>' for i in range(100)],
unk_id=2,
eos_id=1,
pad_id=0
)
from transformers import T5Tokenizer
# T5 uses SentencePiece internally
tokenizer = T5Tokenizer.from_pretrained('t5-base')
inputs = tokenizer('translate English to French: Hello', return_tensors='pt')
| Corpus | BPE (16k) | Unigram (8k) |
|---|---|---|
| 100 MB | 1-2 min | 3-4 min |
| 1 GB | 10-15 min | 30-40 min |
T5 family : t5-base, t5-large (32k vocab, Unigram) ALBERT : albert-base-v2 (30k vocab, Unigram) XLNet : xlnet-base-cased (32k vocab, Unigram) mBART : facebook/mbart-large-50 (250k vocab, BPE)
Weekly Installs
62
Repository
GitHub Stars
5.5K
First Seen
Feb 7, 2026
Security Audits
Gen Agent Trust HubFailSocketPassSnykWarn
Installed on
codex53
opencode53
cursor53
gemini-cli52
claude-code51
github-copilot51
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
53,700 周安装