sentencepiece by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill sentencepiece一种无监督的分词器,可直接处理原始文本,无需语言特定的预处理。
在以下情况下使用 SentencePiece:
性能:
替代方案:
# Python
pip install sentencepiece
# C++ (需要 CMake)
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build && cd build
cmake .. && make -j $(nproc)
sudo make install
# 命令行(BPE,词汇量 8000)
spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe
# Python API
import sentencepiece as spm
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='m',
vocab_size=8000,
model_type='bpe'
)
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
训练时间:对于 100MB 语料库约 1-2 分钟
import sentencepiece as spm
# 加载模型
sp = spm.SentencePieceProcessor(model_file='m.model')
# 编码为片段
pieces = sp.encode('This is a test', out_type=str)
print(pieces) # ['▁This', '▁is', '▁a', '▁test']
# 编码为 ID
ids = sp.encode('This is a test', out_type=int)
print(ids) # [284, 47, 11, 1243]
# 解码
text = sp.decode(ids)
print(text) # "This is a test"
text = "Hello world"
pieces = sp.encode(text, out_type=str)
print(pieces) # ['▁Hello', '▁world']
# 解码保留空格
decoded = sp.decode_pieces(pieces)
print(decoded) # "Hello world"
核心原则:将文本视为原始 Unicode,空格 = ▁(元符号)
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='bpe_model',
vocab_size=16000,
model_type='bpe'
)
使用者:mBART
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='unigram_model',
vocab_size=8000,
model_type='unigram'
)
使用者:T5, ALBERT, XLNet
spm.SentencePieceTrainer.train(
input='corpus.txt',
model_prefix='m',
vocab_size=32000,
model_type='unigram',
character_coverage=0.9995, # 对于 CJK 使用 1.0
user_defined_symbols=['[SEP]', '[CLS]'],
unk_piece='<unk>',
num_threads=16
)
| 语言类型 | 覆盖率 | 原理 |
|---|---|---|
| 英语 | 0.9995 | 覆盖最常用字符 |
| CJK(中文) | 1.0 | 需要所有字符 |
| 多语言 | 0.9995 | 平衡处理 |
# 采样不同的分词结果
for _ in range(3):
pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1)
print(pieces)
# 输出(每次可能不同):
# ['▁token', 'ization']
# ['▁tok', 'en', 'ization']
使用场景:用于增强模型鲁棒性的数据增广。
spm.SentencePieceTrainer.train(
input='c4_corpus.txt',
model_prefix='t5',
vocab_size=32000,
model_type='unigram',
user_defined_symbols=[f'<extra_id_{i}>' for i in range(100)],
unk_id=2,
eos_id=1,
pad_id=0
)
from transformers import T5Tokenizer
# T5 内部使用 SentencePiece
tokenizer = T5Tokenizer.from_pretrained('t5-base')
inputs = tokenizer('translate English to French: Hello', return_tensors='pt')
| 语料库 | BPE (16k) | Unigram (8k) |
|---|---|---|
| 100 MB | 1-2 分钟 | 3-4 分钟 |
| 1 GB | 10-15 分钟 | 30-40 分钟 |
T5 系列:t5-base, t5-large (32k 词汇量, Unigram) ALBERT:albert-base-v2 (30k 词汇量, Unigram) XLNet:xlnet-base-cased (32k 词汇量, Unigram) mBART:facebook/mbart-large-50 (250k 词汇量, BPE)
每周安装量
137
代码仓库
GitHub 星标数
22.6K
首次出现
2026 年 1 月 21 日
安全审计
安装于
claude-code115
opencode109
gemini-cli103
cursor100
codex93
antigravity88
Unsupervised tokenizer that works on raw text without language-specific preprocessing.
Use SentencePiece when:
Performance :
Use alternatives instead :
# Python
pip install sentencepiece
# C++ (requires CMake)
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build && cd build
cmake .. && make -j $(nproc)
sudo make install
# Command-line (BPE with 8000 vocab)
spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe
# Python API
import sentencepiece as spm
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='m',
vocab_size=8000,
model_type='bpe'
)
Training time : ~1-2 minutes for 100MB corpus
import sentencepiece as spm
# Load model
sp = spm.SentencePieceProcessor(model_file='m.model')
# Encode to pieces
pieces = sp.encode('This is a test', out_type=str)
print(pieces) # ['▁This', '▁is', '▁a', '▁test']
# Encode to IDs
ids = sp.encode('This is a test', out_type=int)
print(ids) # [284, 47, 11, 1243]
# Decode
text = sp.decode(ids)
print(text) # "This is a test"
text = "Hello world"
pieces = sp.encode(text, out_type=str)
print(pieces) # ['▁Hello', '▁world']
# Decode preserves spaces
decoded = sp.decode_pieces(pieces)
print(decoded) # "Hello world"
Key principle : Treat text as raw Unicode, whitespace = ▁ (meta symbol)
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='bpe_model',
vocab_size=16000,
model_type='bpe'
)
Used by : mBART
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='unigram_model',
vocab_size=8000,
model_type='unigram'
)
Used by : T5, ALBERT, XLNet
spm.SentencePieceTrainer.train(
input='corpus.txt',
model_prefix='m',
vocab_size=32000,
model_type='unigram',
character_coverage=0.9995, # 1.0 for CJK
user_defined_symbols=['[SEP]', '[CLS]'],
unk_piece='<unk>',
num_threads=16
)
| Language Type | Coverage | Rationale |
|---|---|---|
| English | 0.9995 | Most common chars |
| CJK (Chinese) | 1.0 | All characters needed |
| Multilingual | 0.9995 | Balance |
# Sample different tokenizations
for _ in range(3):
pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1)
print(pieces)
# Output (different each time):
# ['▁token', 'ization']
# ['▁tok', 'en', 'ization']
Use case : Data augmentation for robustness.
spm.SentencePieceTrainer.train(
input='c4_corpus.txt',
model_prefix='t5',
vocab_size=32000,
model_type='unigram',
user_defined_symbols=[f'<extra_id_{i}>' for i in range(100)],
unk_id=2,
eos_id=1,
pad_id=0
)
from transformers import T5Tokenizer
# T5 uses SentencePiece internally
tokenizer = T5Tokenizer.from_pretrained('t5-base')
inputs = tokenizer('translate English to French: Hello', return_tensors='pt')
| Corpus | BPE (16k) | Unigram (8k) |
|---|---|---|
| 100 MB | 1-2 min | 3-4 min |
| 1 GB | 10-15 min | 30-40 min |
T5 family : t5-base, t5-large (32k vocab, Unigram) ALBERT : albert-base-v2 (30k vocab, Unigram) XLNet : xlnet-base-cased (32k vocab, Unigram) mBART : facebook/mbart-large-50 (250k vocab, BPE)
Weekly Installs
137
Repository
GitHub Stars
22.6K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubWarnSocketPassSnykWarn
Installed on
claude-code115
opencode109
gemini-cli103
cursor100
codex93
antigravity88
React 组合模式指南:Vercel 组件架构最佳实践,提升代码可维护性
118,000 周安装