npx skills add https://github.com/jmsktm/claude-settings --skill 'NLP Pipeline Builder'NLP 流水线构建器技能将指导您设计和实现自然语言处理流水线,将原始文本转化为结构化、可操作的洞察。从预处理到高级分析,此技能涵盖了 NLP 任务的完整范围,并帮助您根据特定需求选择正确的方法。
现代 NLP 提供了多种范式:基于规则的方法、经典机器学习和深度学习/LLMs。此技能帮助您在这些选项中导航,构建在准确性、延迟、成本和可维护性之间取得平衡的流水线。无论您需要大规模实时处理还是对特定文档进行深度分析,此技能都能确保您的流水线符合目的。
从分词到语义分析,从单个文档到流式文本,此技能帮助您构建健壮的 NLP 系统,以处理现实世界中混乱且复杂的文本。
定义 需求: * 输入:什么文本?什么格式?多大体量? * 输出:提取什么信息? * 约束:延迟、准确性、成本
选择 流水线阶段:
标准 NLP 流水线: 文本 → 预处理 → 分词 → 特征提取 → 任务模型 → 输出
示例阶段:
- 预处理:清洗、规范化
- 语言学:分词、词性标注、命名实体识别、句法分析
- 语义:嵌入、主题建模
- 任务特定:分类、提取、生成
| 3. 选择 每个阶段的方法: 阶段 | 经典方法 | 深度学习 | LLM |
|---|---|---|---|
| 分词 | Regex, NLTK | SentencePiece | 模型特定 |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 命名实体识别 | CRF, 规则 | BiLSTM-CRF, BERT | 基于提示 |
| 分类 | SVM, NB | CNN, BERT | 零/少样本学习 |
| 提取 | Regex, 模式 | Seq2Seq | 基于提示 |
清洗 文本:
def clean_text(text):
text = unicodedata.normalize("NFKC", text)
text = remove_control_characters(text)
text = " ".join(text.split())
return text
分割 成单元: * 句子分割 * 段落检测 * 文档结构化
适当 分词: * 用于分析的词语分词 * 用于模型的子词分词 * 特定语言考虑
规范化 以确保一致性: * 大小写规范化 * 词形还原/词干提取 * 处理缩略词、缩写
设置 处理基础设施:
class NLPPipeline: def init(self, config): self.preprocessor = TextPreprocessor(config) self.tokenizer = load_tokenizer(config.tokenizer) self.models = { "ner": load_model(config.ner_model), "sentiment": load_model(config.sentiment_model), "classification": load_model(config.classifier) } self.cache = ResultCache() if config.use_cache else None
def process(self, text, tasks=None): tasks = tasks or ["all"]
# 预处理
cleaned = self.preprocessor.clean(text)
tokens = self.tokenizer.tokenize(cleaned)
# 运行请求的分析
results = {"text": text, "tokens": tokens}
for task, model in self.models.items():
if task in tasks or "all" in tasks:
results[task] = model.predict(tokens)
return results
实现 批处理以提高吞吐量
添加 缓存以处理重复输入
设置 监控和日志记录
使用 多样化输入进行测试
| 操作 | 命令/触发条件 |
|---|---|
| 设计流水线 | "为 [任务] 设计 NLP 流水线" |
| 预处理文本 | "如何预处理 [文本类型]" |
| 选择分词器 | "适用于 [用例] 的最佳分词器" |
| 提取实体 | "从文本中提取实体" |
| 分类文本 | "构建文本分类器" |
| 扩展流水线 | "将 NLP 扩展到 [体量]" |
理解您的文本:不同的文本需要不同的处理方式
保留重要信息:预处理不应破坏信息
正确处理编码:Unicode 很棘手
批处理以提高效率:模型推理成本高
优雅地失败:文本是混乱且不可预测的
对流水线进行版本控制:可复现性很重要
链接提取器以获取复杂信息:
class ExtractionPipeline:
def __init__(self):
self.ner = NERModel()
self.relation = RelationExtractor()
self.coreference = CoreferenceResolver()
def extract(self, text):
# 阶段 1:命名实体识别
entities = self.ner.extract(text)
# 阶段 2:指代消解
resolved = self.coreference.resolve(text, entities)
# 阶段 3:关系提取
relations = self.relation.extract(text, resolved)
# 阶段 4:构建知识图谱
graph = build_graph(resolved, relations)
return {
"entities": resolved,
"relations": relations,
"graph": graph
}
在 LLM 能增加价值的地方使用 LLM,在不能的地方使用经典方法:
class HybridPipeline:
def process(self, text):
# 快速的经典预处理
cleaned = classical_clean(text)
sentences = classical_sentence_split(cleaned)
# 经典 NER(快速、可预测)
entities = classical_ner(sentences)
# LLM 处理复杂任务(较慢、能力更强)
sentiment = llm_sentiment(text) # 细致的情感分析
summary = llm_summarize(text) # 抽象摘要
return {
"sentences": sentences,
"entities": entities, # 经典方法
"sentiment": sentiment, # LLM
"summary": summary # LLM
}
处理连续的文本流:
class StreamingNLP:
def __init__(self, batch_size=32, timeout_ms=100):
self.batch_size = batch_size
self.timeout_ms = timeout_ms
self.buffer = []
self.last_process_time = time.time()
async def add(self, text):
self.buffer.append(text)
# 如果批次已满或超时,则进行处理
if len(self.buffer) >= self.batch_size:
return await self.flush()
elif (time.time() - self.last_process_time) * 1000 > self.timeout_ms:
return await self.flush()
async def flush(self):
if not self.buffer:
return []
batch = self.buffer
self.buffer = []
self.last_process_time = time.time()
# 批处理
results = await self.pipeline.process_batch(batch)
return results
处理多语言文本:
class MultilingualPipeline:
def __init__(self):
self.detector = LanguageDetector()
self.pipelines = {
"en": EnglishPipeline(),
"es": SpanishPipeline(),
"zh": ChinesePipeline(),
"default": UniversalPipeline()
}
def process(self, text):
lang = self.detector.detect(text)
pipeline = self.pipelines.get(lang, self.pipelines["default"])
return {
"language": lang,
"results": pipeline.process(text)
}
每周安装次数
–
代码仓库
GitHub 星标数
2
首次出现时间
–
安全审计
The NLP Pipeline Builder skill guides you through designing and implementing natural language processing pipelines that transform raw text into structured, actionable insights. From preprocessing to advanced analysis, this skill covers the full spectrum of NLP tasks and helps you choose the right approach for your specific needs.
Modern NLP offers multiple paradigms: rule-based approaches, classical ML, and deep learning/LLMs. This skill helps you navigate these options, building pipelines that balance accuracy, latency, cost, and maintainability. Whether you need real-time processing at scale or deep analysis of specific documents, this skill ensures your pipeline is fit for purpose.
From tokenization to semantic analysis, from single documents to streaming text, this skill helps you build robust NLP systems that handle real-world text with all its messiness and complexity.
Define requirements:
Select pipeline stages:
Standard NLP Pipeline: Text → Preprocessing → Tokenization → Feature Extraction → Task Model → Output
Example stages:
- Preprocessing: cleaning, normalization
- Linguistic: tokenization, POS, NER, parsing
- Semantic: embeddings, topic modeling
- Task-specific: classification, extraction, generation
| 3. Choose approach per stage: Stage | Classical | Deep Learning | LLM |
|---|---|---|---|
| Tokenization | Regex, NLTK | SentencePiece | Model-specific |
| NER | CRF, rules | BiLSTM-CRF, BERT | Prompt-based |
| Classification | SVM, NB | CNN, BERT | Zero/few-shot |
| Extraction | Regex, patterns | Seq2Seq | Prompt-based |
Clean text:
def clean_text(text):
text = unicodedata.normalize("NFKC", text)
text = remove_control_characters(text)
text = " ".join(text.split())
return text
Segment into units:
Tokenize appropriately:
Normalize for consistency:
Set up processing infrastructure:
class NLPPipeline: def init(self, config): self.preprocessor = TextPreprocessor(config) self.tokenizer = load_tokenizer(config.tokenizer) self.models = { "ner": load_model(config.ner_model), "sentiment": load_model(config.sentiment_model), "classification": load_model(config.classifier) } self.cache = ResultCache() if config.use_cache else None
def process(self, text, tasks=None): tasks = tasks or ["all"]
# Preprocessing
cleaned = self.preprocessor.clean(text)
tokens = self.tokenizer.tokenize(cleaned)
# Run requested analyses
results = {"text": text, "tokens": tokens}
for task, model in self.models.items():
if task in tasks or "all" in tasks:
results[task] = model.predict(tokens)
return results
Implement batching for throughput
Add caching for repeated inputs
Set up monitoring and logging
Test with diverse inputs
| Action | Command/Trigger |
|---|---|
| Design pipeline | "Design NLP pipeline for [task]" |
| Preprocess text | "How to preprocess [text type]" |
| Choose tokenizer | "Best tokenizer for [use case]" |
| Extract entities | "Extract entities from text" |
| Classify text | "Build text classifier" |
| Scale pipeline | "Scale NLP to [volume]" |
Understand Your Text : Different text requires different treatment
Preserve What Matters : Preprocessing shouldn't destroy information
Handle Encoding Correctly : Unicode is tricky
Batch for Efficiency : Model inference is expensive
Fail Gracefully : Text is messy and unpredictable
Version Your Pipeline : Reproducibility matters
Chain extractors for complex information:
class ExtractionPipeline:
def __init__(self):
self.ner = NERModel()
self.relation = RelationExtractor()
self.coreference = CoreferenceResolver()
def extract(self, text):
# Stage 1: Named Entity Recognition
entities = self.ner.extract(text)
# Stage 2: Coreference Resolution
resolved = self.coreference.resolve(text, entities)
# Stage 3: Relation Extraction
relations = self.relation.extract(text, resolved)
# Stage 4: Build knowledge graph
graph = build_graph(resolved, relations)
return {
"entities": resolved,
"relations": relations,
"graph": graph
}
Use LLMs where they add value, classical where they don't:
class HybridPipeline:
def process(self, text):
# Fast classical preprocessing
cleaned = classical_clean(text)
sentences = classical_sentence_split(cleaned)
# Classical NER (fast, predictable)
entities = classical_ner(sentences)
# LLM for complex tasks (slower, more capable)
sentiment = llm_sentiment(text) # Nuanced sentiment
summary = llm_summarize(text) # Abstractive summary
return {
"sentences": sentences,
"entities": entities, # Classical
"sentiment": sentiment, # LLM
"summary": summary # LLM
}
Handle continuous text streams:
class StreamingNLP:
def __init__(self, batch_size=32, timeout_ms=100):
self.batch_size = batch_size
self.timeout_ms = timeout_ms
self.buffer = []
self.last_process_time = time.time()
async def add(self, text):
self.buffer.append(text)
# Process if batch full or timeout
if len(self.buffer) >= self.batch_size:
return await self.flush()
elif (time.time() - self.last_process_time) * 1000 > self.timeout_ms:
return await self.flush()
async def flush(self):
if not self.buffer:
return []
batch = self.buffer
self.buffer = []
self.last_process_time = time.time()
# Batch process
results = await self.pipeline.process_batch(batch)
return results
Handle multilingual text:
class MultilingualPipeline:
def __init__(self):
self.detector = LanguageDetector()
self.pipelines = {
"en": EnglishPipeline(),
"es": SpanishPipeline(),
"zh": ChinesePipeline(),
"default": UniversalPipeline()
}
def process(self, text):
lang = self.detector.detect(text)
pipeline = self.pipelines.get(lang, self.pipelines["default"])
return {
"language": lang,
"results": pipeline.process(text)
}
Weekly Installs
–
Repository
GitHub Stars
2
First Seen
–
Security Audits
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
45,100 周安装