SentencePiece 分词工具教程：多语言模型训练、BPE/Unigram 算法详解与快速入门

sentencepiece by davila7/claude-code-templates

196 周安装量

24,200 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill sentencepiece

AI/机器学习开发自然语言处理

🇨🇳中文介绍

SentencePiece - 语言无关的分词工具

一种无监督的分词器，可直接处理原始文本，无需语言特定的预处理。

何时使用 SentencePiece

在以下情况下使用 SentencePiece：

构建多语言模型（无需语言特定规则）
处理 CJK 语言（中文、日文、韩文）
需要可复现的分词（确定性词汇表）
希望在原始文本上训练（无需预分词）
需要轻量级部署（6MB 内存，50k 句子/秒）

性能：

速度：50,000 句子/秒
内存：加载模型约 6MB
支持语言：全部（语言无关）

替代方案：

HuggingFace Tokenizers：训练速度更快，更灵活
tiktoken：用于 OpenAI 模型（GPT-3.5/4）
BERT WordPiece：面向英语的任务

快速开始

安装

# Python
pip install sentencepiece

# C++ (需要 CMake)
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build && cd build
cmake .. && make -j $(nproc)
sudo make install

训练模型

# 命令行（BPE，词汇量 8000）
spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe

# Python API
import sentencepiece as spm

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='m',
    vocab_size=8000,
    model_type='bpe'
)

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

语言无关的设计

将空格作为符号（▁）

text = "Hello world"
pieces = sp.encode(text, out_type=str)
print(pieces)  # ['▁Hello', '▁world']

# 解码保留空格
decoded = sp.decode_pieces(pieces)
print(decoded)  # "Hello world"

核心原则：将文本视为原始 Unicode，空格 = ▁（元符号）

BPE（字节对编码）

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='bpe_model',
    vocab_size=16000,
    model_type='bpe'
)

使用者：mBART

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='unigram_model',
    vocab_size=8000,
    model_type='unigram'
)

使用者：T5, ALBERT, XLNet

spm.SentencePieceTrainer.train(
    input='corpus.txt',
    model_prefix='m',
    vocab_size=32000,
    model_type='unigram',
    character_coverage=0.9995,  # 对于 CJK 使用 1.0
    user_defined_symbols=['[SEP]', '[CLS]'],
    unk_piece='<unk>',
    num_threads=16
)

语言类型	覆盖率	原理
英语	0.9995	覆盖最常用字符
CJK（中文）	1.0	需要所有字符
多语言	0.9995	平衡处理

# 采样不同的分词结果
for _ in range(3):
    pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1)
    print(pieces)

# 输出（每次可能不同）：
# ['▁token', 'ization']
# ['▁tok', 'en', 'ization']

使用场景：用于增强模型鲁棒性的数据增广。

spm.SentencePieceTrainer.train(
    input='c4_corpus.txt',
    model_prefix='t5',
    vocab_size=32000,
    model_type='unigram',
    user_defined_symbols=[f'<extra_id_{i}>' for i in range(100)],
    unk_id=2,
    eos_id=1,
    pad_id=0
)

与 transformers 集成

from transformers import T5Tokenizer

# T5 内部使用 SentencePiece
tokenizer = T5Tokenizer.from_pretrained('t5-base')
inputs = tokenizer('translate English to French: Hello', return_tensors='pt')

语料库	BPE (16k)	Unigram (8k)
100 MB	1-2 分钟	3-4 分钟
1 GB	10-15 分钟	30-40 分钟

SentencePiece：50,000 句子/秒
HF Tokenizers：200,000 句子/秒（快 4 倍）

T5 系列：t5-base, t5-large (32k 词汇量, Unigram) ALBERT：albert-base-v2 (30k 词汇量, Unigram) XLNet：xlnet-base-cased (32k 词汇量, Unigram) mBART：facebook/mbart-large-50 (250k 词汇量, BPE)

训练指南 - 详细选项、语料库准备
算法 - BPE vs Unigram、子词正则化

GitHub：https://github.com/google/sentencepiece ⭐ 10,000+
论文：https://arxiv.org/abs/1808.06226 (EMNLP 2018)
版本：0.2.0+

2026 年 1 月 21 日

🇺🇸English

SentencePiece - Language-Independent Tokenization

Unsupervised tokenizer that works on raw text without language-specific preprocessing.

When to use SentencePiece

Use SentencePiece when:

Building multilingual models (no language-specific rules)
Working with CJK languages (Chinese, Japanese, Korean)
Need reproducible tokenization (deterministic vocabulary)
Want to train on raw text (no pre-tokenization needed)
Require lightweight deployment (6MB memory, 50k sentences/sec)

Performance :

Speed : 50,000 sentences/sec
Memory : ~6MB for loaded model
Languages : All (language-independent)

Use alternatives instead :

HuggingFace Tokenizers : Faster training, more flexibility
tiktoken : OpenAI models (GPT-3.5/4)
BERT WordPiece : English-centric tasks

Quick start

Installation

# Python
pip install sentencepiece

# C++ (requires CMake)
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build && cd build
cmake .. && make -j $(nproc)
sudo make install

Train model

# Command-line (BPE with 8000 vocab)
spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe

# Python API
import sentencepiece as spm

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='m',
    vocab_size=8000,
    model_type='bpe'
)

Training time : ~1-2 minutes for 100MB corpus

Encode and decode

import sentencepiece as spm

# Load model
sp = spm.SentencePieceProcessor(model_file='m.model')

# Encode to pieces
pieces = sp.encode('This is a test', out_type=str)
print(pieces)  # ['▁This', '▁is', '▁a', '▁test']

# Encode to IDs
ids = sp.encode('This is a test', out_type=int)
print(ids)  # [284, 47, 11, 1243]

# Decode
text = sp.decode(ids)
print(text)  # "This is a test"

Language-independent design

Whitespace as symbol (▁)

text = "Hello world"
pieces = sp.encode(text, out_type=str)
print(pieces)  # ['▁Hello', '▁world']

# Decode preserves spaces
decoded = sp.decode_pieces(pieces)
print(decoded)  # "Hello world"

Key principle : Treat text as raw Unicode, whitespace = ▁ (meta symbol)

Tokenization algorithms

BPE (Byte-Pair Encoding)

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='bpe_model',
    vocab_size=16000,
    model_type='bpe'
)

Used by : mBART

Unigram (default)

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='unigram_model',
    vocab_size=8000,
    model_type='unigram'
)

Used by : T5, ALBERT, XLNet

Training configuration

Essential parameters

spm.SentencePieceTrainer.train(
    input='corpus.txt',
    model_prefix='m',
    vocab_size=32000,
    model_type='unigram',
    character_coverage=0.9995,  # 1.0 for CJK
    user_defined_symbols=['[SEP]', '[CLS]'],
    unk_piece='<unk>',
    num_threads=16
)

Character coverage

Language Type	Coverage	Rationale
English	0.9995	Most common chars
CJK (Chinese)	1.0	All characters needed
Multilingual	0.9995	Balance

Encoding options

Subword regularization

# Sample different tokenizations
for _ in range(3):
    pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1)
    print(pieces)

# Output (different each time):
# ['▁token', 'ization']
# ['▁tok', 'en', 'ization']

Use case : Data augmentation for robustness.

Common patterns

T5-style training

spm.SentencePieceTrainer.train(
    input='c4_corpus.txt',
    model_prefix='t5',
    vocab_size=32000,
    model_type='unigram',
    user_defined_symbols=[f'<extra_id_{i}>' for i in range(100)],
    unk_id=2,
    eos_id=1,
    pad_id=0
)

Integration with transformers

from transformers import T5Tokenizer

# T5 uses SentencePiece internally
tokenizer = T5Tokenizer.from_pretrained('t5-base')
inputs = tokenizer('translate English to French: Hello', return_tensors='pt')

Performance benchmarks

Training speed

Corpus	BPE (16k)	Unigram (8k)
100 MB	1-2 min	3-4 min
1 GB	10-15 min	30-40 min

Tokenization speed

SentencePiece : 50,000 sentences/sec
HF Tokenizers : 200,000 sentences/sec (4× faster)

Supported models

T5 family : t5-base, t5-large (32k vocab, Unigram) ALBERT : albert-base-v2 (30k vocab, Unigram) XLNet : xlnet-base-cased (32k vocab, Unigram) mBART : facebook/mbart-large-50 (250k vocab, BPE)

References

Training Guide - Detailed options, corpus preparation
Algorithms - BPE vs Unigram, subword regularization

Resources

GitHub : https://github.com/google/sentencepiece ⭐ 10,000+
Paper : https://arxiv.org/abs/1808.06226 (EMNLP 2018)
Version : 0.2.0+

Weekly Installs

137

Repository

davila7/claude-…emplates

GitHub Stars

22.6K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubWarn SocketPass SnykWarn

Installed on

claude-code115

opencode109

gemini-cli103

cursor100

codex93

antigravity88

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

118,000 周安装

SentencePiece 分词工具教程：多语言模型训练、BPE/Unigram 算法详解与快速入门

🇨🇳中文介绍

SentencePiece - 语言无关的分词工具

何时使用 SentencePiece

快速开始

安装

训练模型

相关 Skills

编码和解码

语言无关的设计

将空格作为符号（▁）

分词算法

BPE（字节对编码）

Unigram（默认）

训练配置

基本参数

字符覆盖率

编码选项

子词正则化

常见模式

T5 风格训练