scikit-learn 机器学习教程：Python 分类、回归、聚类、降维与模型评估实战指南

scikit-learn by davila7/claude-code-templates

350 周安装量

23,500 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill scikit-learn

AI/机器学习 Python Web框架数据分析

🇨🇳中文介绍

Scikit-learn

概述

本技能为使用 scikit-learn（经典的机器学习行业标准 Python 库）进行机器学习任务提供全面指导。使用本技能进行分类、回归、聚类、降维、预处理、模型评估以及构建生产就绪的 ML 流水线。

安装

# 使用 uv 安装 scikit-learn
uv uv pip install scikit-learn

# 可选：安装可视化依赖项
uv uv pip install matplotlib seaborn

# 通常一起使用
uv uv pip install pandas numpy

何时使用本技能

在以下情况使用 scikit-learn 技能：

构建分类或回归模型
执行聚类或降维
为机器学习进行数据预处理和转换
使用交叉验证评估模型性能
使用网格或随机搜索调整超参数
为生产工作流创建 ML 流水线
比较针对某项任务的不同算法
处理结构化（表格）数据和文本数据
需要可解释的、经典的机器学习方法时

快速开始

分类示例

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# 分割数据
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# 预处理
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 训练模型
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

# 评估
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

包含混合数据的完整流水线

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier

# 定义特征类型
numeric_features = ['age', 'income']
categorical_features = ['gender', 'occupation']

# 创建预处理流水线
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# 组合转换器
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# 完整流水线
model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(random_state=42))
])

# 拟合与预测
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

用于分类和回归任务的全面算法。

线性模型：Logistic Regression, Linear Regression, Ridge, Lasso, ElasticNet
基于树的模型：Decision Trees, Random Forest, Gradient Boosting
支持向量机：SVC, SVR 及各种核函数
集成方法：AdaBoost, Voting, Stacking
神经网络：MLPClassifier, MLPRegressor
其他：Naive Bayes, K-Nearest Neighbors

分类：预测离散类别（垃圾邮件检测、图像分类、欺诈检测）
回归：预测连续值（价格预测、需求预测）

参见： references/supervised_learning.md 获取详细的算法文档、参数和使用示例。

通过聚类和降维发现未标记数据中的模式。

基于划分：K-Means, MiniBatchKMeans
基于密度：DBSCAN, HDBSCAN, OPTICS
层次聚类：AgglomerativeClustering
概率模型：Gaussian Mixture Models
其他：MeanShift, SpectralClustering, BIRCH

线性：PCA, TruncatedSVD, NMF
流形学习：t-SNE, UMAP, Isomap, LLE
特征提取：FastICA, LatentDirichletAllocation

客户细分、异常检测、数据可视化
减少特征维度、探索性数据分析
主题建模、图像压缩

参见： references/unsupervised_learning.md 获取详细文档。

3. 模型评估与选择

用于稳健模型评估、交叉验证和超参数调优的工具。

交叉验证策略：

KFold, StratifiedKFold（分类）
TimeSeriesSplit（时序数据）
GroupKFold（分组样本）

超参数调优：

GridSearchCV（穷举搜索）
RandomizedSearchCV（随机采样）
HalvingGridSearchCV（连续减半）

分类：accuracy, precision, recall, F1-score, ROC AUC, confusion matrix
回归：MSE, RMSE, MAE, R², MAPE
聚类：silhouette score, Calinski-Harabasz, Davies-Bouldin

客观比较模型性能
寻找最优超参数
通过交叉验证防止过拟合
利用学习曲线理解模型行为

参见： references/model_evaluation.md 获取全面的指标和调优策略。

将原始数据转换为适合机器学习的格式。

缩放与归一化：

StandardScaler（零均值，单位方差）
MinMaxScaler（限定范围）
RobustScaler（对异常值稳健）
Normalizer（样本归一化）

编码分类变量：

OneHotEncoder（名义类别）
OrdinalEncoder（有序类别）
LabelEncoder（目标编码）

处理缺失值：

SimpleImputer（均值、中位数、众数）
KNNImputer（k近邻）
IterativeImputer（多元插补）

PolynomialFeatures（交互项）
KBinsDiscretizer（分箱）
Feature selection（RFE, SelectKBest, SelectFromModel）

在训练任何需要缩放特征的算法之前（SVM, KNN, Neural Networks）
将分类变量转换为数值格式
系统性地处理缺失数据
为线性模型创建非线性特征

参见： references/preprocessing.md 获取详细的预处理技术。

5. 流水线与组合

构建可复现、生产就绪的 ML 工作流。

Pipeline：顺序链接转换器和估计器
ColumnTransformer：对不同列应用不同的预处理
FeatureUnion：并行组合多个转换器
TransformedTargetRegressor：转换目标变量

防止交叉验证中的数据泄露
简化代码并提高可维护性
支持联合超参数调优
确保训练和预测之间的一致性

生产工作流始终使用 Pipeline
混合数值和分类特征时（使用 ColumnTransformer）
使用预处理步骤进行交叉验证时
超参数调优包含预处理参数时

参见： references/pipelines_and_composition.md 获取全面的流水线模式。

运行一个包含预处理、模型比较、超参数调优和评估的完整分类工作流：

python scripts/classification_pipeline.py

处理混合数据类型（数值和分类）
使用交叉验证进行模型比较
使用 GridSearchCV 进行超参数调优
使用多个指标进行综合评估
特征重要性分析

执行聚类分析，包含算法比较和可视化：

python scripts/clustering_analysis.py

寻找最优聚类数量（肘部法则、轮廓分析）
比较多种聚类算法（K-Means, DBSCAN, Agglomerative, Gaussian Mixture）
在没有真实标签的情况下评估聚类质量
使用 PCA 投影可视化结果

本技能包含深入探讨特定主题的全面参考文件：

文件： references/quick_reference.md

常用导入模式和安装说明
常见任务的快速工作流模板
算法选择速查表
常见模式和陷阱
性能优化技巧

文件： references/supervised_learning.md

线性模型（回归和分类）
支持向量机
决策树和集成方法
K近邻、朴素贝叶斯、神经网络
算法选择指南

文件： references/unsupervised_learning.md

所有聚类算法及其参数和用例
降维技术
离群值和新颖性检测
高斯混合模型
方法选择指南

文件： references/model_evaluation.md

交叉验证策略
超参数调优方法
分类、回归和聚类指标
学习和验证曲线
模型选择的最佳实践

文件： references/preprocessing.md

特征缩放和归一化
编码分类变量
缺失值插补
特征工程技术
自定义转换器

文件： references/pipelines_and_composition.md

流水线构建与使用
用于混合数据类型的 ColumnTransformer
用于并行转换的 FeatureUnion
完整的端到端示例
最佳实践

加载和探索数据

import pandas as pd
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']

使用分层策略分割数据

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

创建预处理流水线

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

# 分别处理数值和分类特征
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(), categorical_features)
])

构建完整流水线

model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

调优超参数

from sklearn.model_selection import GridSearchCV

param_grid = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth': [10, 20, None]
}

grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)

在测试集上评估

from sklearn.metrics import classification_report

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))

预处理数据

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

寻找最优聚类数量

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

scores = []
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X_scaled)
    scores.append(silhouette_score(X_scaled, labels))

optimal_k = range(2, 11)[np.argmax(scores)]

应用聚类

model = KMeans(n_clusters=optimal_k, random_state=42)
labels = model.fit_predict(X_scaled)

使用降维可视化

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)

plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='viridis')

始终使用流水线

流水线防止数据泄露并确保一致性：

# 好：在流水线中进行预处理
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

# 不好：在外部进行预处理（可能导致信息泄露）
X_scaled = StandardScaler().fit_transform(X)

仅在训练数据上拟合

切勿在测试数据上拟合：

# 好
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # 仅转换

# 不好
scaler = StandardScaler()
X_all_scaled = scaler.fit_transform(np.vstack([X_train, X_test]))

分类时使用分层分割

保持类别分布：

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

设置随机种子以确保可复现性

model = RandomForestClassifier(n_estimators=100, random_state=42)

选择合适的指标

平衡数据：Accuracy, F1-score
不平衡数据：Precision, Recall, ROC AUC, Balanced Accuracy
成本敏感：定义自定义评分器

需要时缩放特征

需要特征缩放的算法：

SVM, KNN, Neural Networks
PCA, 带正则化的 Linear/Logistic Regression
K-Means 聚类

不需要缩放的算法：

基于树的模型（Decision Trees, Random Forest, Gradient Boosting）
Naive Bayes

问题： 模型未收敛 解决方案： 增加 max_iter 或缩放特征

model = LogisticRegression(max_iter=1000)

问题： 过拟合 解决方案： 使用正则化、交叉验证或更简单的模型

# 添加正则化
model = Ridge(alpha=1.0)

# 使用交叉验证
scores = cross_val_score(model, X, y, cv=5)

大数据集内存错误

解决方案： 使用为大数据设计的算法

# 对大数据集使用 SGD
from sklearn.linear_model import SGDClassifier
model = SGDClassifier()

# 或对聚类使用 MiniBatchKMeans
from sklearn.cluster import MiniBatchKMeans
model = MiniBatchKMeans(n_clusters=8, batch_size=100)

🇺🇸English

Scikit-learn

Overview

This skill provides comprehensive guidance for machine learning tasks using scikit-learn, the industry-standard Python library for classical machine learning. Use this skill for classification, regression, clustering, dimensionality reduction, preprocessing, model evaluation, and building production-ready ML pipelines.

Installation

# Install scikit-learn using uv
uv uv pip install scikit-learn

# Optional: Install visualization dependencies
uv uv pip install matplotlib seaborn

# Commonly used with
uv uv pip install pandas numpy

When to Use This Skill

Use the scikit-learn skill when:

Building classification or regression models
Performing clustering or dimensionality reduction
Preprocessing and transforming data for machine learning
Evaluating model performance with cross-validation
Tuning hyperparameters with grid or random search
Creating ML pipelines for production workflows
Comparing different algorithms for a task
Working with both structured (tabular) and text data
Need interpretable, classical machine learning approaches

Quick Start

Classification Example

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Preprocess
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

# Evaluate
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))

Complete Pipeline with Mixed Data

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier

# Define feature types
numeric_features = ['age', 'income']
categorical_features = ['gender', 'occupation']

# Create preprocessing pipelines
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformers
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Full pipeline
model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(random_state=42))
])

# Fit and predict
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Core Capabilities

1. Supervised Learning

Comprehensive algorithms for classification and regression tasks.

Key algorithms:

Linear models : Logistic Regression, Linear Regression, Ridge, Lasso, ElasticNet
Tree-based : Decision Trees, Random Forest, Gradient Boosting
Support Vector Machines : SVC, SVR with various kernels
Ensemble methods : AdaBoost, Voting, Stacking
Neural Networks : MLPClassifier, MLPRegressor
Others : Naive Bayes, K-Nearest Neighbors

When to use:

Classification: Predicting discrete categories (spam detection, image classification, fraud detection)
Regression: Predicting continuous values (price prediction, demand forecasting)

See: references/supervised_learning.md for detailed algorithm documentation, parameters, and usage examples.

2. Unsupervised Learning

Discover patterns in unlabeled data through clustering and dimensionality reduction.

Clustering algorithms:

Partition-based : K-Means, MiniBatchKMeans
Density-based : DBSCAN, HDBSCAN, OPTICS
Hierarchical : AgglomerativeClustering
Probabilistic : Gaussian Mixture Models
Others : MeanShift, SpectralClustering, BIRCH

Dimensionality reduction:

Linear : PCA, TruncatedSVD, NMF
Manifold learning : t-SNE, UMAP, Isomap, LLE
Feature extraction : FastICA, LatentDirichletAllocation

When to use:

Customer segmentation, anomaly detection, data visualization
Reducing feature dimensions, exploratory data analysis
Topic modeling, image compression

See: references/unsupervised_learning.md for detailed documentation.

3. Model Evaluation and Selection

Tools for robust model evaluation, cross-validation, and hyperparameter tuning.

Cross-validation strategies:

KFold, StratifiedKFold (classification)
TimeSeriesSplit (temporal data)
GroupKFold (grouped samples)

Hyperparameter tuning:

GridSearchCV (exhaustive search)
RandomizedSearchCV (random sampling)
HalvingGridSearchCV (successive halving)

Metrics:

Classification : accuracy, precision, recall, F1-score, ROC AUC, confusion matrix
Regression : MSE, RMSE, MAE, R², MAPE
Clustering : silhouette score, Calinski-Harabasz, Davies-Bouldin

When to use:

Comparing model performance objectively
Finding optimal hyperparameters
Preventing overfitting through cross-validation
Understanding model behavior with learning curves

See: references/model_evaluation.md for comprehensive metrics and tuning strategies.

4. Data Preprocessing

Transform raw data into formats suitable for machine learning.

Scaling and normalization:

StandardScaler (zero mean, unit variance)
MinMaxScaler (bounded range)
RobustScaler (robust to outliers)
Normalizer (sample-wise normalization)

Encoding categorical variables:

OneHotEncoder (nominal categories)
OrdinalEncoder (ordered categories)
LabelEncoder (target encoding)

Handling missing values:

SimpleImputer (mean, median, most frequent)
KNNImputer (k-nearest neighbors)
IterativeImputer (multivariate imputation)

Feature engineering:

PolynomialFeatures (interaction terms)
KBinsDiscretizer (binning)
Feature selection (RFE, SelectKBest, SelectFromModel)

When to use:

Before training any algorithm that requires scaled features (SVM, KNN, Neural Networks)
Converting categorical variables to numeric format
Handling missing data systematically
Creating non-linear features for linear models

See: references/preprocessing.md for detailed preprocessing techniques.

5. Pipelines and Composition

Build reproducible, production-ready ML workflows.

Key components:

Pipeline : Chain transformers and estimators sequentially
ColumnTransformer : Apply different preprocessing to different columns
FeatureUnion : Combine multiple transformers in parallel
TransformedTargetRegressor : Transform target variable

Benefits:

Prevents data leakage in cross-validation
Simplifies code and improves maintainability
Enables joint hyperparameter tuning
Ensures consistency between training and prediction

When to use:

Always use Pipelines for production workflows
When mixing numerical and categorical features (use ColumnTransformer)
When performing cross-validation with preprocessing steps
When hyperparameter tuning includes preprocessing parameters

See: references/pipelines_and_composition.md for comprehensive pipeline patterns.

Example Scripts

Classification Pipeline

Run a complete classification workflow with preprocessing, model comparison, hyperparameter tuning, and evaluation:

python scripts/classification_pipeline.py

This script demonstrates:

Handling mixed data types (numeric and categorical)
Model comparison using cross-validation
Hyperparameter tuning with GridSearchCV
Comprehensive evaluation with multiple metrics
Feature importance analysis

Clustering Analysis

Perform clustering analysis with algorithm comparison and visualization:

python scripts/clustering_analysis.py

This script demonstrates:

Finding optimal number of clusters (elbow method, silhouette analysis)
Comparing multiple clustering algorithms (K-Means, DBSCAN, Agglomerative, Gaussian Mixture)
Evaluating clustering quality without ground truth
Visualizing results with PCA projection

Reference Documentation

This skill includes comprehensive reference files for deep dives into specific topics:

Quick Reference

File: references/quick_reference.md

Common import patterns and installation instructions
Quick workflow templates for common tasks
Algorithm selection cheat sheets
Common patterns and gotchas
Performance optimization tips

Supervised Learning

File: references/supervised_learning.md

Linear models (regression and classification)
Support Vector Machines
Decision Trees and ensemble methods
K-Nearest Neighbors, Naive Bayes, Neural Networks
Algorithm selection guide

Unsupervised Learning

File: references/unsupervised_learning.md

All clustering algorithms with parameters and use cases
Dimensionality reduction techniques
Outlier and novelty detection
Gaussian Mixture Models
Method selection guide

Model Evaluation

File: references/model_evaluation.md

Cross-validation strategies
Hyperparameter tuning methods
Classification, regression, and clustering metrics
Learning and validation curves
Best practices for model selection

Preprocessing

File: references/preprocessing.md

Feature scaling and normalization
Encoding categorical variables
Missing value imputation
Feature engineering techniques
Custom transformers

Pipelines and Composition

File: references/pipelines_and_composition.md

Pipeline construction and usage
ColumnTransformer for mixed data types
FeatureUnion for parallel transformations
Complete end-to-end examples
Best practices

Common Workflows

Building a Classification Model

Load and explore data

import pandas as pd
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']

Split data with stratification

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

Create preprocessing pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

# Handle numeric and categorical features separately
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(), categorical_features)
])

Build complete pipeline

model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

Performing Clustering Analysis

Preprocess data

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Find optimal number of clusters

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

scores = []
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X_scaled)
    scores.append(silhouette_score(X_scaled, labels))

optimal_k = range(2, 11)[np.argmax(scores)]

Apply clustering

model = KMeans(n_clusters=optimal_k, random_state=42)
labels = model.fit_predict(X_scaled)

Visualize with dimensionality reduction

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)

plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='viridis')

Best Practices

Always Use Pipelines

Pipelines prevent data leakage and ensure consistency:

# Good: Preprocessing in pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

# Bad: Preprocessing outside (can leak information)
X_scaled = StandardScaler().fit_transform(X)

Fit on Training Data Only

Never fit on test data:

# Good
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Only transform

# Bad
scaler = StandardScaler()
X_all_scaled = scaler.fit_transform(np.vstack([X_train, X_test]))

Use Stratified Splitting for Classification

Preserve class distribution:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

Set Random State for Reproducibility

model = RandomForestClassifier(n_estimators=100, random_state=42)

Choose Appropriate Metrics

Balanced data: Accuracy, F1-score
Imbalanced data: Precision, Recall, ROC AUC, Balanced Accuracy
Cost-sensitive: Define custom scorer

Scale Features When Required

Algorithms requiring feature scaling:

SVM, KNN, Neural Networks
PCA, Linear/Logistic Regression with regularization
K-Means clustering

Algorithms not requiring scaling:

Tree-based models (Decision Trees, Random Forest, Gradient Boosting)
Naive Bayes

Troubleshooting Common Issues

ConvergenceWarning

Issue: Model didn't converge Solution: Increase max_iter or scale features

model = LogisticRegression(max_iter=1000)

Poor Performance on Test Set

Issue: Overfitting Solution: Use regularization, cross-validation, or simpler model

# Add regularization
model = Ridge(alpha=1.0)

# Use cross-validation
scores = cross_val_score(model, X, y, cv=5)

Memory Error with Large Datasets

Solution: Use algorithms designed for large data

# Use SGD for large datasets
from sklearn.linear_model import SGDClassifier
model = SGDClassifier()

# Or MiniBatchKMeans for clustering
from sklearn.cluster import MiniBatchKMeans
model = MiniBatchKMeans(n_clusters=8, batch_size=100)

Additional Resources

Official Documentation: https://scikit-learn.org/stable/
User Guide: https://scikit-learn.org/stable/user_guide.html
API Reference: https://scikit-learn.org/stable/api/index.html
Examples Gallery: https://scikit-learn.org/stable/auto_examples/index.html

Weekly Installs

272

Repository

davila7/claude-…emplates

GitHub Stars

22.6K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode227

gemini-cli211

claude-code208

codex203

cursor195

github-copilot188

专业SEO审计工具：全面网站诊断、技术SEO优化与页面分析指南

58,700 周安装

Tune hyperparameters

from sklearn.model_selection import GridSearchCV

param_grid = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth': [10, 20, None]
}

grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)

Evaluate on test set

from sklearn.metrics import classification_report

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))

scikit-learn 机器学习教程：Python 分类、回归、聚类、降维与模型评估实战指南

🇨🇳中文介绍

Scikit-learn

概述

安装

何时使用本技能

快速开始

分类示例

相关 Skills

包含混合数据的完整流水线

核心能力

1. 监督学习

2. 无监督学习

3. 模型评估与选择

4. 数据预处理

5. 流水线与组合

示例脚本

分类流水线

聚类分析

参考文档

快速参考

监督学习

无监督学习

模型评估

预处理

流水线与组合

常见工作流

构建分类模型

执行聚类分析

最佳实践

始终使用流水线

仅在训练数据上拟合

分类时使用分层分割

设置随机种子以确保可复现性

选择合适的指标

需要时缩放特征

常见问题排查

ConvergenceWarning

测试集性能差

大数据集内存错误

额外资源

🇺🇸English

Scikit-learn

Overview

Installation

When to Use This Skill

Quick Start

Classification Example

Complete Pipeline with Mixed Data

Core Capabilities

1. Supervised Learning

2. Unsupervised Learning

3. Model Evaluation and Selection

4. Data Preprocessing

5. Pipelines and Composition

Example Scripts

Classification Pipeline

Clustering Analysis

Reference Documentation

Quick Reference

Supervised Learning

Unsupervised Learning

Model Evaluation

Preprocessing

Pipelines and Composition

Common Workflows

Building a Classification Model

Performing Clustering Analysis

Best Practices

Always Use Pipelines

Fit on Training Data Only

Use Stratified Splitting for Classification

Set Random State for Reproducibility

Choose Appropriate Metrics

Scale Features When Required

Troubleshooting Common Issues

ConvergenceWarning

Poor Performance on Test Set

Memory Error with Large Datasets

Additional Resources