scikit-learn by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill scikit-learn本技能为使用 scikit-learn(经典的机器学习行业标准 Python 库)进行机器学习任务提供全面指导。使用本技能进行分类、回归、聚类、降维、预处理、模型评估以及构建生产就绪的 ML 流水线。
# 使用 uv 安装 scikit-learn
uv uv pip install scikit-learn
# 可选:安装可视化依赖项
uv uv pip install matplotlib seaborn
# 通常一起使用
uv uv pip install pandas numpy
在以下情况使用 scikit-learn 技能:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# 分割数据
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# 预处理
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 训练模型
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# 评估
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
# 定义特征类型
numeric_features = ['age', 'income']
categorical_features = ['gender', 'occupation']
# 创建预处理流水线
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# 组合转换器
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# 完整流水线
model = Pipeline([
('preprocessor', preprocessor),
('classifier', GradientBoostingClassifier(random_state=42))
])
# 拟合与预测
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
用于分类和回归任务的全面算法。
关键算法:
何时使用:
参见: references/supervised_learning.md 获取详细的算法文档、参数和使用示例。
通过聚类和降维发现未标记数据中的模式。
聚类算法:
降维:
何时使用:
参见: references/unsupervised_learning.md 获取详细文档。
用于稳健模型评估、交叉验证和超参数调优的工具。
交叉验证策略:
超参数调优:
指标:
何时使用:
参见: references/model_evaluation.md 获取全面的指标和调优策略。
将原始数据转换为适合机器学习的格式。
缩放与归一化:
编码分类变量:
处理缺失值:
特征工程:
何时使用:
参见: references/preprocessing.md 获取详细的预处理技术。
构建可复现、生产就绪的 ML 工作流。
关键组件:
优点:
何时使用:
参见: references/pipelines_and_composition.md 获取全面的流水线模式。
运行一个包含预处理、模型比较、超参数调优和评估的完整分类工作流:
python scripts/classification_pipeline.py
此脚本演示:
执行聚类分析,包含算法比较和可视化:
python scripts/clustering_analysis.py
此脚本演示:
本技能包含深入探讨特定主题的全面参考文件:
文件: references/quick_reference.md
文件: references/supervised_learning.md
文件: references/unsupervised_learning.md
文件: references/model_evaluation.md
文件: references/preprocessing.md
文件: references/pipelines_and_composition.md
加载和探索数据
import pandas as pd
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']
使用分层策略分割数据
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
创建预处理流水线
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
# 分别处理数值和分类特征
preprocessor = ColumnTransformer([
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(), categorical_features)
])
构建完整流水线
model = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(random_state=42))
])
调优超参数
from sklearn.model_selection import GridSearchCV
param_grid = {
'classifier__n_estimators': [100, 200],
'classifier__max_depth': [10, 20, None]
}
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)
在测试集上评估
from sklearn.metrics import classification_report
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))
预处理数据
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
寻找最优聚类数量
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
scores = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
labels = kmeans.fit_predict(X_scaled)
scores.append(silhouette_score(X_scaled, labels))
optimal_k = range(2, 11)[np.argmax(scores)]
应用聚类
model = KMeans(n_clusters=optimal_k, random_state=42)
labels = model.fit_predict(X_scaled)
使用降维可视化
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='viridis')
流水线防止数据泄露并确保一致性:
# 好:在流水线中进行预处理
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
# 不好:在外部进行预处理(可能导致信息泄露)
X_scaled = StandardScaler().fit_transform(X)
切勿在测试数据上拟合:
# 好
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # 仅转换
# 不好
scaler = StandardScaler()
X_all_scaled = scaler.fit_transform(np.vstack([X_train, X_test]))
保持类别分布:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
需要特征缩放的算法:
不需要缩放的算法:
问题: 模型未收敛 解决方案: 增加 max_iter 或缩放特征
model = LogisticRegression(max_iter=1000)
问题: 过拟合 解决方案: 使用正则化、交叉验证或更简单的模型
# 添加正则化
model = Ridge(alpha=1.0)
# 使用交叉验证
scores = cross_val_score(model, X, y, cv=5)
解决方案: 使用为大数据设计的算法
# 对大数据集使用 SGD
from sklearn.linear_model import SGDClassifier
model = SGDClassifier()
# 或对聚类使用 MiniBatchKMeans
from sklearn.cluster import MiniBatchKMeans
model = MiniBatchKMeans(n_clusters=8, batch_size=100)
每周安装
272
代码仓库
GitHub Stars
22.6K
首次出现
Jan 21, 2026
安全审计
安装于
opencode227
gemini-cli211
claude-code208
codex203
cursor195
github-copilot188
This skill provides comprehensive guidance for machine learning tasks using scikit-learn, the industry-standard Python library for classical machine learning. Use this skill for classification, regression, clustering, dimensionality reduction, preprocessing, model evaluation, and building production-ready ML pipelines.
# Install scikit-learn using uv
uv uv pip install scikit-learn
# Optional: Install visualization dependencies
uv uv pip install matplotlib seaborn
# Commonly used with
uv uv pip install pandas numpy
Use the scikit-learn skill when:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Preprocess
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# Evaluate
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
# Define feature types
numeric_features = ['age', 'income']
categorical_features = ['gender', 'occupation']
# Create preprocessing pipelines
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Combine transformers
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Full pipeline
model = Pipeline([
('preprocessor', preprocessor),
('classifier', GradientBoostingClassifier(random_state=42))
])
# Fit and predict
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Comprehensive algorithms for classification and regression tasks.
Key algorithms:
When to use:
See: references/supervised_learning.md for detailed algorithm documentation, parameters, and usage examples.
Discover patterns in unlabeled data through clustering and dimensionality reduction.
Clustering algorithms:
Dimensionality reduction:
When to use:
See: references/unsupervised_learning.md for detailed documentation.
Tools for robust model evaluation, cross-validation, and hyperparameter tuning.
Cross-validation strategies:
Hyperparameter tuning:
Metrics:
When to use:
See: references/model_evaluation.md for comprehensive metrics and tuning strategies.
Transform raw data into formats suitable for machine learning.
Scaling and normalization:
Encoding categorical variables:
Handling missing values:
Feature engineering:
When to use:
See: references/preprocessing.md for detailed preprocessing techniques.
Build reproducible, production-ready ML workflows.
Key components:
Benefits:
When to use:
See: references/pipelines_and_composition.md for comprehensive pipeline patterns.
Run a complete classification workflow with preprocessing, model comparison, hyperparameter tuning, and evaluation:
python scripts/classification_pipeline.py
This script demonstrates:
Perform clustering analysis with algorithm comparison and visualization:
python scripts/clustering_analysis.py
This script demonstrates:
This skill includes comprehensive reference files for deep dives into specific topics:
File: references/quick_reference.md
File: references/supervised_learning.md
File: references/unsupervised_learning.md
File: references/model_evaluation.md
File: references/preprocessing.md
File: references/pipelines_and_composition.md
Load and explore data
import pandas as pd
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']
Split data with stratification
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
Create preprocessing pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
# Handle numeric and categorical features separately
preprocessor = ColumnTransformer([
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(), categorical_features)
])
Build complete pipeline
model = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(random_state=42))
])
Preprocess data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Find optimal number of clusters
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
scores = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
labels = kmeans.fit_predict(X_scaled)
scores.append(silhouette_score(X_scaled, labels))
optimal_k = range(2, 11)[np.argmax(scores)]
Apply clustering
model = KMeans(n_clusters=optimal_k, random_state=42)
labels = model.fit_predict(X_scaled)
Visualize with dimensionality reduction
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='viridis')
Pipelines prevent data leakage and ensure consistency:
# Good: Preprocessing in pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
# Bad: Preprocessing outside (can leak information)
X_scaled = StandardScaler().fit_transform(X)
Never fit on test data:
# Good
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Only transform
# Bad
scaler = StandardScaler()
X_all_scaled = scaler.fit_transform(np.vstack([X_train, X_test]))
Preserve class distribution:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
Algorithms requiring feature scaling:
Algorithms not requiring scaling:
Issue: Model didn't converge Solution: Increase max_iter or scale features
model = LogisticRegression(max_iter=1000)
Issue: Overfitting Solution: Use regularization, cross-validation, or simpler model
# Add regularization
model = Ridge(alpha=1.0)
# Use cross-validation
scores = cross_val_score(model, X, y, cv=5)
Solution: Use algorithms designed for large data
# Use SGD for large datasets
from sklearn.linear_model import SGDClassifier
model = SGDClassifier()
# Or MiniBatchKMeans for clustering
from sklearn.cluster import MiniBatchKMeans
model = MiniBatchKMeans(n_clusters=8, batch_size=100)
Weekly Installs
272
Repository
GitHub Stars
22.6K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode227
gemini-cli211
claude-code208
codex203
cursor195
github-copilot188
专业SEO审计工具:全面网站诊断、技术SEO优化与页面分析指南
58,700 周安装
Tune hyperparameters
from sklearn.model_selection import GridSearchCV
param_grid = {
'classifier__n_estimators': [100, 200],
'classifier__max_depth': [10, 20, None]
}
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)
Evaluate on test set
from sklearn.metrics import classification_report
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))