自然语言处理(NLP)技能：BERT/GPT模型、文本分类、情感分析、命名实体识别完整指南

Natural Language Processing by aj-geddes/useful-ai-prompts

126 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/aj-geddes/useful-ai-prompts --skill 'Natural Language Processing'

AI/机器学习 Python Web框架自然语言处理

🇨🇳中文介绍

自然语言处理

概述

此技能提供了一套全面的工具，用于构建 NLP 应用程序，涵盖现代 Transformer 模型（如 BERT、GPT）、经典 NLP 技术，可用于文本分类、命名实体识别、情感分析等任务。

使用场景

构建用于情感分析、主题分类或意图识别的文本分类系统
从非结构化文本中提取命名实体（人物、地点、组织）
实现机器翻译、文本摘要或问答系统
处理和分析大量文本数据以获取洞察
创建聊天机器人、虚拟助手或对话式 AI 应用
针对特定领域的 NLP 任务微调预训练的 Transformer 模型

NLP 核心任务

文本分类：情感、主题、意图分类
命名实体识别：识别人物、地点、组织
机器翻译：语言间的文本翻译
文本摘要：提取关键信息
问答系统：在文档中寻找答案
文本生成：生成连贯的文本

常用模型与库

Transformers：BERT、GPT、RoBERTa、T5
spaCy：工业级 NLP 流水线
NLTK：经典 NLP 工具包
Hugging Face：预训练模型中心
PyTorch/TensorFlow：深度学习框架

Python 实现

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
import re
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import torch
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                         AutoModelForTokenClassification, pipeline,
                         TextClassificationPipeline)
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import warnings
warnings.filterwarnings('ignore')

# 下载所需的 NLTK 资源
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

print("=== 1. 文本预处理 ===")

def preprocess_text(text, remove_stopwords=True, lemmatize=True):
    """完整的文本预处理流水线"""
    # 小写化
    text = text.lower()

    # 移除特殊字符和数字
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # 分词
    tokens = word_tokenize(text)

    # 移除停用词
    if remove_stopwords:
        stop_words = set(stopwords.words('english'))
        tokens = [t for t in tokens if t not in stop_words]

    # 词形还原
    if lemmatize:
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(t) for t in tokens]

    return tokens, ' '.join(tokens)

sample_text = "The quick brown foxes are jumping over the lazy dogs! Amazing performance."
tokens, processed = preprocess_text(sample_text)
print(f"原始文本: {sample_text}")
print(f"处理后: {processed}")
print(f"词元: {tokens}\n")

# 2. 使用 sklearn 进行文本分类
print("=== 2. 传统文本分类 ===")

# 示例数据
texts = [
    "I love this product, it's amazing!",
    "This movie is fantastic and entertaining.",
    "Best purchase ever, highly recommended.",
    "Terrible quality, very disappointed.",
    "Worst experience, waste of money.",
    "Horrible service and poor quality.",
    "The food was delicious and fresh.",
    "Great atmosphere and friendly staff.",
    "Bad weather today, very gloomy.",
    "The book was boring and uninteresting."
]

labels = [1, 1, 1, 0, 0, 0, 1, 1, 0, 0]  # 1: 正面, 0: 负面

# TF-IDF 向量化
tfidf = TfidfVectorizer(max_features=100, ngram_range=(1, 2))
X_tfidf = tfidf.fit_transform(texts)

# 训练分类器
clf = MultinomialNB()
clf.fit(X_tfidf, labels)

# 评估
predictions = clf.predict(X_tfidf)
print(f"准确率: {accuracy_score(labels, predictions):.4f}")
print(f"精确率: {precision_score(labels, predictions):.4f}")
print(f"召回率: {recall_score(labels, predictions):.4f}")
print(f"F1 分数: {f1_score(labels, predictions):.4f}\n")

# 3. 基于 Transformer 的文本分类
print("=== 3. 基于 Transformer 的分类 ===")

try:
    # 使用 Hugging Face transformers 进行情感分析
    sentiment_pipeline = pipeline(
        "sentiment-analysis",
        model="distilbert-base-uncased-finetuned-sst-2-english"
    )

    test_sentences = [
        "This is a wonderful movie!",
        "I absolutely hate this product.",
        "It's okay, nothing special.",
        "Amazing quality and fast delivery!"
    ]

    print("情感分析结果:")
    for sentence in test_sentences:
        result = sentiment_pipeline(sentence)
        print(f"  文本: {sentence}")
        print(f"  情感: {result[0]['label']}, 分数: {result[0]['score']:.4f}\n")

except Exception as e:
    print(f"Transformer 模型不可用: {str(e)}\n")

# 4. 命名实体识别
print("=== 4. 命名实体识别 ===")

try:
    ner_pipeline = pipeline(
        "ner",
        model="distilbert-base-uncased",
        aggregation_strategy="simple"
    )

    text = "Apple Inc. was founded by Steve Jobs in Cupertino, California."
    entities = ner_pipeline(text)

    print(f"文本: {text}")
    print("实体:")
    for entity in entities:
        print(f"  {entity['word']}: {entity['entity_group']} (置信度: {entity['score']:.4f})")

except Exception as e:
    print(f"NER 模型不可用: {str(e)}\n")

# 5. 词嵌入与相似度
print("\n=== 5. 词嵌入与相似度 ===")

from sklearn.metrics.pairwise import cosine_similarity

# 简单的词袋模型嵌入
vectorizer = CountVectorizer(max_features=50)
docs = [
    "machine learning is great",
    "deep learning uses neural networks",
    "machine learning and deep learning"
]

embeddings = vectorizer.fit_transform(docs).toarray()

# 计算相似度
similarity_matrix = cosine_similarity(embeddings)
print("文档相似度矩阵:")
print(pd.DataFrame(similarity_matrix, columns=[f"Doc{i}" for i in range(len(docs))],
                  index=[f"Doc{i}" for i in range(len(docs))]).round(3))

# 6. 分词与词汇表分析
print("\n=== 6. 分词分析 ===")

corpus = " ".join(texts)
tokens, _ = preprocess_text(corpus)

# 词汇表
vocab = Counter(tokens)
print(f"词汇表大小: {len(vocab)}")
print("前 10 个最常出现的词:")
for word, count in vocab.most_common(10):
    print(f"  {word}: {count}")

# 7. 高级 Transformer 流水线
print("\n=== 7. 高级 NLP 任务 ===")

try:
    # 零样本分类
    zero_shot_pipeline = pipeline(
        "zero-shot-classification",
        model="facebook/bart-large-mnli"
    )

    sequence = "Apple is discussing the possibility of acquiring startup for 1 billion dollars"
    candidate_labels = ["business", "sports", "technology", "politics"]

    result = zero_shot_pipeline(sequence, candidate_labels)
    print("零样本分类结果:")
    for label, score in zip(result['labels'], result['scores']):
        print(f"  {label}: {score:.4f}")

except Exception as e:
    print(f"高级流水线不可用: {str(e)}\n")

# 8. 文本统计与分析
print("\n=== 8. 文本统计 ===")

sample_texts = [
    "Natural language processing is fascinating.",
    "Machine learning enables artificial intelligence.",
    "Deep learning revolutionizes computer vision."
]

stats_data = []
for text in sample_texts:
    words = text.split()
    chars = len(text)
    avg_word_len = np.mean([len(w) for w in words])

    stats_data.append({
        'Text': text[:40] + '...' if len(text) > 40 else text,
        'Words': len(words),
        'Characters': chars,
        'Avg Word Len': avg_word_len
    })

stats_df = pd.DataFrame(stats_data)
print(stats_df.to_string(index=False))

# 9. 可视化
print("\n=== 9. NLP 可视化 ===")

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 词频
word_freq = vocab.most_common(15)
words, freqs = zip(*word_freq)
axes[0, 0].barh(range(len(words)), freqs, color='steelblue')
axes[0, 0].set_yticks(range(len(words)))
axes[0, 0].set_yticklabels(words)
axes[0, 0].set_xlabel('频率')
axes[0, 0].set_title('前 15 个最常出现的词')
axes[0, 0].invert_yaxis()

# 情感分布
sentiments = ['Positive', 'Negative', 'Positive', 'Negative', 'Positive']
sentiment_counts = Counter(sentiments)
axes[0, 1].pie(sentiment_counts.values(), labels=sentiment_counts.keys(),
              autopct='%1.1f%%', colors=['green', 'red'])
axes[0, 1].set_title('情感分布')

# 文档相似度热力图
im = axes[1, 0].imshow(similarity_matrix, cmap='YlOrRd', aspect='auto')
axes[1, 0].set_xticks(range(len(docs)))
axes[1, 0].set_yticks(range(len(docs)))
axes[1, 0].set_xticklabels([f'Doc{i}' for i in range(len(docs))])
axes[1, 0].set_yticklabels([f'Doc{i}' for i in range(len(docs))])
axes[1, 0].set_title('文档相似度热力图')
plt.colorbar(im, ax=axes[1, 0])

# 文本长度分布
text_lengths = [len(t.split()) for t in texts]
axes[1, 1].hist(text_lengths, bins=5, color='coral', edgecolor='black')
axes[1, 1].set_xlabel('词数')
axes[1, 1].set_ylabel('频率')
axes[1, 1].set_title('文本长度分布')
axes[1, 1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('nlp_analysis.png', dpi=100, bbox_inches='tight')
print("\nNLP 可视化已保存为 'nlp_analysis.png'")

# 10. 总结
print("\n=== NLP 总结 ===")
print(f"处理的文本数: {len(texts)}")
print(f"唯一词汇: {len(vocab)} 个词")
print(f"平均文本长度: {np.mean([len(t.split()) for t in texts]):.2f} 个词")
print(f"分类准确率: {accuracy_score(labels, predictions):.4f}")

print("\n自然语言处理设置完成!")

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

🇺🇸English

Natural Language Processing

Overview

This skill provides comprehensive tools for building NLP applications using modern transformers, BERT, GPT, and classical NLP techniques for text classification, named entity recognition, sentiment analysis, and more.

When to Use

Building text classification systems for sentiment analysis, topic categorization, or intent detection
Extracting named entities (people, places, organizations) from unstructured text
Implementing machine translation, text summarization, or question answering systems
Processing and analyzing large volumes of textual data for insights
Creating chatbots, virtual assistants, or conversational AI applications
Fine-tuning pre-trained transformer models for domain-specific NLP tasks

NLP Core Tasks

Text Classification : Sentiment, topic, intent classification
Named Entity Recognition : Identifying people, places, organizations
Machine Translation : Text translation between languages
Text Summarization : Extracting key information
Question Answering : Finding answers in documents
Text Generation : Generating coherent text

Popular Models and Libraries

Transformers : BERT, GPT, RoBERTa, T5
spaCy : Industrial NLP pipeline
NLTK : Classic NLP toolkit
Hugging Face : Pre-trained models hub
PyTorch/TensorFlow : Deep learning frameworks

Python Implementation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
import re
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import torch
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                         AutoModelForTokenClassification, pipeline,
                         TextClassificationPipeline)
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import warnings
warnings.filterwarnings('ignore')

# Download required NLTK resources
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

print("=== 1. Text Preprocessing ===")

def preprocess_text(text, remove_stopwords=True, lemmatize=True):
    """Complete text preprocessing pipeline"""
    # Lowercase
    text = text.lower()

    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords
    if remove_stopwords:
        stop_words = set(stopwords.words('english'))
        tokens = [t for t in tokens if t not in stop_words]

    # Lemmatize
    if lemmatize:
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(t) for t in tokens]

    return tokens, ' '.join(tokens)

sample_text = "The quick brown foxes are jumping over the lazy dogs! Amazing performance."
tokens, processed = preprocess_text(sample_text)
print(f"Original: {sample_text}")
print(f"Processed: {processed}")
print(f"Tokens: {tokens}\n")

# 2. Text Classification with sklearn
print("=== 2. Traditional Text Classification ===")

# Sample data
texts = [
    "I love this product, it's amazing!",
    "This movie is fantastic and entertaining.",
    "Best purchase ever, highly recommended.",
    "Terrible quality, very disappointed.",
    "Worst experience, waste of money.",
    "Horrible service and poor quality.",
    "The food was delicious and fresh.",
    "Great atmosphere and friendly staff.",
    "Bad weather today, very gloomy.",
    "The book was boring and uninteresting."
]

labels = [1, 1, 1, 0, 0, 0, 1, 1, 0, 0]  # 1: positive, 0: negative

# TF-IDF vectorization
tfidf = TfidfVectorizer(max_features=100, ngram_range=(1, 2))
X_tfidf = tfidf.fit_transform(texts)

# Train classifier
clf = MultinomialNB()
clf.fit(X_tfidf, labels)

# Evaluate
predictions = clf.predict(X_tfidf)
print(f"Accuracy: {accuracy_score(labels, predictions):.4f}")
print(f"Precision: {precision_score(labels, predictions):.4f}")
print(f"Recall: {recall_score(labels, predictions):.4f}")
print(f"F1: {f1_score(labels, predictions):.4f}\n")

# 3. Transformer-based text classification
print("=== 3. Transformer-based Classification ===")

try:
    # Use Hugging Face transformers for sentiment analysis
    sentiment_pipeline = pipeline(
        "sentiment-analysis",
        model="distilbert-base-uncased-finetuned-sst-2-english"
    )

    test_sentences = [
        "This is a wonderful movie!",
        "I absolutely hate this product.",
        "It's okay, nothing special.",
        "Amazing quality and fast delivery!"
    ]

    print("Sentiment Analysis Results:")
    for sentence in test_sentences:
        result = sentiment_pipeline(sentence)
        print(f"  Text: {sentence}")
        print(f"  Sentiment: {result[0]['label']}, Score: {result[0]['score']:.4f}\n")

except Exception as e:
    print(f"Transformer model not available: {str(e)}\n")

# 4. Named Entity Recognition (NER)
print("=== 4. Named Entity Recognition ===")

try:
    ner_pipeline = pipeline(
        "ner",
        model="distilbert-base-uncased",
        aggregation_strategy="simple"
    )

    text = "Apple Inc. was founded by Steve Jobs in Cupertino, California."
    entities = ner_pipeline(text)

    print(f"Text: {text}")
    print("Entities:")
    for entity in entities:
        print(f"  {entity['word']}: {entity['entity_group']} (score: {entity['score']:.4f})")

except Exception as e:
    print(f"NER model not available: {str(e)}\n")

# 5. Word embeddings and similarity
print("\n=== 5. Word Embeddings and Similarity ===")

from sklearn.metrics.pairwise import cosine_similarity

# Simple bag-of-words embeddings
vectorizer = CountVectorizer(max_features=50)
docs = [
    "machine learning is great",
    "deep learning uses neural networks",
    "machine learning and deep learning"
]

embeddings = vectorizer.fit_transform(docs).toarray()

# Compute similarity
similarity_matrix = cosine_similarity(embeddings)
print("Document Similarity Matrix:")
print(pd.DataFrame(similarity_matrix, columns=[f"Doc{i}" for i in range(len(docs))],
                  index=[f"Doc{i}" for i in range(len(docs))]).round(3))

# 6. Tokenization and vocabulary
print("\n=== 6. Tokenization Analysis ===")

corpus = " ".join(texts)
tokens, _ = preprocess_text(corpus)

# Vocabulary
vocab = Counter(tokens)
print(f"Vocabulary size: {len(vocab)}")
print("Top 10 most common words:")
for word, count in vocab.most_common(10):
    print(f"  {word}: {count}")

# 7. Advanced Transformer pipeline
print("\n=== 7. Advanced NLP Tasks ===")

try:
    # Zero-shot classification
    zero_shot_pipeline = pipeline(
        "zero-shot-classification",
        model="facebook/bart-large-mnli"
    )

    sequence = "Apple is discussing the possibility of acquiring startup for 1 billion dollars"
    candidate_labels = ["business", "sports", "technology", "politics"]

    result = zero_shot_pipeline(sequence, candidate_labels)
    print("Zero-shot Classification Results:")
    for label, score in zip(result['labels'], result['scores']):
        print(f"  {label}: {score:.4f}")

except Exception as e:
    print(f"Advanced pipeline not available: {str(e)}\n")

# 8. Text statistics and analysis
print("\n=== 8. Text Statistics ===")

sample_texts = [
    "Natural language processing is fascinating.",
    "Machine learning enables artificial intelligence.",
    "Deep learning revolutionizes computer vision."
]

stats_data = []
for text in sample_texts:
    words = text.split()
    chars = len(text)
    avg_word_len = np.mean([len(w) for w in words])

    stats_data.append({
        'Text': text[:40] + '...' if len(text) > 40 else text,
        'Words': len(words),
        'Characters': chars,
        'Avg Word Len': avg_word_len
    })

stats_df = pd.DataFrame(stats_data)
print(stats_df.to_string(index=False))

# 9. Visualization
print("\n=== 9. NLP Visualization ===")

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Word frequency
word_freq = vocab.most_common(15)
words, freqs = zip(*word_freq)
axes[0, 0].barh(range(len(words)), freqs, color='steelblue')
axes[0, 0].set_yticks(range(len(words)))
axes[0, 0].set_yticklabels(words)
axes[0, 0].set_xlabel('Frequency')
axes[0, 0].set_title('Top 15 Most Frequent Words')
axes[0, 0].invert_yaxis()

# Sentiment distribution
sentiments = ['Positive', 'Negative', 'Positive', 'Negative', 'Positive']
sentiment_counts = Counter(sentiments)
axes[0, 1].pie(sentiment_counts.values(), labels=sentiment_counts.keys(),
              autopct='%1.1f%%', colors=['green', 'red'])
axes[0, 1].set_title('Sentiment Distribution')

# Document similarity heatmap
im = axes[1, 0].imshow(similarity_matrix, cmap='YlOrRd', aspect='auto')
axes[1, 0].set_xticks(range(len(docs)))
axes[1, 0].set_yticks(range(len(docs)))
axes[1, 0].set_xticklabels([f'Doc{i}' for i in range(len(docs))])
axes[1, 0].set_yticklabels([f'Doc{i}' for i in range(len(docs))])
axes[1, 0].set_title('Document Similarity Heatmap')
plt.colorbar(im, ax=axes[1, 0])

# Text length distribution
text_lengths = [len(t.split()) for t in texts]
axes[1, 1].hist(text_lengths, bins=5, color='coral', edgecolor='black')
axes[1, 1].set_xlabel('Number of Words')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Text Length Distribution')
axes[1, 1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('nlp_analysis.png', dpi=100, bbox_inches='tight')
print("\nNLP visualization saved as 'nlp_analysis.png'")

# 10. Summary
print("\n=== NLP Summary ===")
print(f"Texts processed: {len(texts)}")
print(f"Unique vocabulary: {len(vocab)} words")
print(f"Average text length: {np.mean([len(t.split()) for t in texts]):.2f} words")
print(f"Classification accuracy: {accuracy_score(labels, predictions):.4f}")

print("\nNatural language processing setup completed!")

Common NLP Tasks and Models

Classification : DistilBERT, RoBERTa, ELECTRA
NER : BioBERT, SciBERT, spaCy models
Translation : MarianMT, M2M-100
Summarization : BART, Pegasus, T5
QA : BERT, RoBERTa, DeBERTa

Text Preprocessing Pipeline

Lowercasing and cleaning
Tokenization
Stopword removal
Lemmatization/Stemming
Vectorization

Best Practices

Use pre-trained models when available
Fine-tune on task-specific data
Handle out-of-vocabulary words
Batch process for efficiency
Monitor for bias in models

Deliverables

Trained NLP model
Text classification results
Named entities extracted
Performance metrics
Visualization dashboard
Inference API

Weekly Installs

Repository

aj-geddes/usefu…-prompts

GitHub Stars

126

First Seen

Jan 1, 1970

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass