机器学习模型训练指南：从数据准备到模型评估的完整流程与最佳实践

ml-model-training by secondsky/claude-skills

81 周安装量

91 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/secondsky/claude-skills --skill ml-model-training

AI/机器学习 Python Web框架数据处理

🇨🇳中文介绍

机器学习模型训练

通过恰当的数据处理和评估来训练机器学习模型。

训练流程

数据准备 → 2. 特征工程 → 3. 模型选择 → 4. 训练 → 5. 评估

数据准备

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# 加载并清理数据
df = pd.read_csv('data.csv')
df = df.dropna()

# 编码分类变量
le = LabelEncoder()
df['category'] = le.fit_transform(df['category'])

# 分割数据 (70/15/15)
X = df.drop('target', axis=1)
y = df['target']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)

# 缩放特征
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

Scikit-learn 训练

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_val)
print(classification_report(y_val, y_pred))

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

FlyClaw：零登录航班聚合查询工具，Python实现多源航班信息与价格搜索

4,000,000 周安装

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

896,800 周安装

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

120,000 周安装

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

67,500 周安装

import torch
import torch.nn as nn

class Model(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.layers(x)

model = Model(X_train.shape[1])
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.BCELoss()

for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    output = model(X_train_tensor)
    loss = criterion(output, y_train_tensor)
    loss.backward()
    optimizer.step()

任务	指标
分类	准确率、精确率、召回率、F1分数、AUC-ROC
回归	均方误差、均方根误差、平均绝对误差、决定系数

PyTorch : 查看 references/pytorch-training.md 获取包含以下内容的完整训练流程：
- 包含批量归一化和 Dropout 的自定义模型类
- 带有早停机制的训练/验证循环
- 学习率调度
- 模型检查点保存
- 包含分类报告的完整评估
TensorFlow/Keras : 查看 references/tensorflow-keras.md 获取以下内容：
- 顺序模型架构
- 回调函数（早停、学习率衰减、模型检查点、TensorBoard）
- 训练历史可视化
- 用于移动端部署的 TFLite 转换
- 自定义训练循环

使用交叉验证进行稳健评估
使用 MLflow 跟踪实验
定期保存模型检查点
监控过拟合现象
记录超参数
使用 70/15/15 的训练/验证/测试集划分

在没有验证集的情况下进行训练
忽略类别不平衡问题
跳过特征缩放
使用测试集进行超参数调优
忘记设置随机种子

问题：在分割数据之前进行缩放或转换，会导致测试集信息泄露到训练过程中。

解决方案：始终先分割数据，然后仅在训练数据上拟合转换器：

# ✅ 正确：在训练集上拟合，然后转换训练/验证/测试集
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)  # 仅转换
X_test = scaler.transform(X_test)  # 仅转换

# ❌ 错误：在所有数据上拟合
X_all = scaler.fit_transform(X)  # 泄露了测试集信息！

2. 忽略类别不平衡

问题：在不平衡数据集（例如，95% 为 A 类，5% 为 B 类）上训练，会导致模型只预测多数类。

解决方案：使用类别权重或重采样：

from sklearn.utils.class_weight import compute_class_weight

# 计算类别权重
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
model = RandomForestClassifier(class_weight='balanced')

# 或者使用 SMOTE 对少数类进行过采样
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

3. 因缺乏正则化导致的过拟合

问题：复杂模型会记住训练数据，在验证/测试集上表现不佳。

解决方案：添加正则化技术：

# PyTorch 中的 Dropout
nn.Dropout(0.3)

# scikit-learn 中的 L2 正则化
RandomForestClassifier(max_depth=10, min_samples_split=20)

# Keras 中的早停
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
model.fit(X_train, y_train, validation_data=(X_val, y_val), callbacks=[early_stop])

4. 未设置随机种子

问题：结果在不同运行之间不可复现，使得调试和比较变得不可能。

解决方案：设置所有随机种子：

import random
import numpy as np
import torch

random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

5. 使用测试集进行超参数调优

问题：在测试集上优化超参数会导致模型对测试数据过拟合。

解决方案：使用验证集进行调优，测试集仅用于最终评估：

from sklearn.model_selection import GridSearchCV

# ✅ 正确：在训练+验证集上调优，在测试集上评估
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 15]}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)  # 在训练集上进行交叉验证
best_model = grid_search.best_estimator_

# 在预留的测试集上进行最终评估
final_score = best_model.score(X_test, y_test)

何时加载参考文档

在需要时加载参考文件：

PyTorch 实现细节：加载 references/pytorch-training.md 以获取包含早停、学习率调度和检查点保存的完整训练循环。
TensorFlow/Keras 模式：加载 references/tensorflow-keras.md 以获取回调函数用法、自定义训练循环以及使用 TFLite 进行移动端部署。

🇺🇸English

ML Model Training

Train machine learning models with proper data handling and evaluation.

Training Workflow

Data Preparation → 2. Feature Engineering → 3. Model Selection → 4. Training → 5. Evaluation

Data Preparation

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Load and clean data
df = pd.read_csv('data.csv')
df = df.dropna()

# Encode categorical variables
le = LabelEncoder()
df['category'] = le.fit_transform(df['category'])

# Split data (70/15/15)
X = df.drop('target', axis=1)
y = df['target']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

Scikit-learn Training

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_val)
print(classification_report(y_val, y_pred))

PyTorch Training

import torch
import torch.nn as nn

class Model(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.layers(x)

model = Model(X_train.shape[1])
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.BCELoss()

for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    output = model(X_train_tensor)
    loss = criterion(output, y_train_tensor)
    loss.backward()
    optimizer.step()

Evaluation Metrics

Task	Metrics
Classification	Accuracy, Precision, Recall, F1, AUC-ROC
Regression	MSE, RMSE, MAE, R²

Complete Framework Examples

PyTorch : See references/pytorch-training.md for complete training with:
- Custom model classes with BatchNorm and Dropout
- Training/validation loops with early stopping
- Learning rate scheduling
- Model checkpointing
- Full evaluation with classification report
TensorFlow/Keras : See references/tensorflow-keras.md for:
- Sequential model architecture
- Callbacks (EarlyStopping, ReduceLROnPlateau, ModelCheckpoint, TensorBoard)
- Training history visualization
- TFLite conversion for mobile deployment
- Custom training loops

Best Practices

Do:

Use cross-validation for robust evaluation
Track experiments with MLflow
Save model checkpoints regularly
Monitor for overfitting
Document hyperparameters
Use 70/15/15 train/val/test split

Don't:

Train without a validation set
Ignore class imbalance
Skip feature scaling
Use test set for hyperparameter tuning
Forget to set random seeds

Known Issues Prevention

1. Data Leakage

Problem : Scaling or transforming data before splitting leads to test set information leaking into training.

Solution : Always split data first, then fit transformers only on training data:

# ✅ Correct: Fit on train, transform train/val/test
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)  # Only transform
X_test = scaler.transform(X_test)  # Only transform

# ❌ Wrong: Fitting on all data
X_all = scaler.fit_transform(X)  # Leaks test info!

2. Class Imbalance Ignored

Problem : Training on imbalanced datasets (e.g., 95% class A, 5% class B) leads to models that predict only the majority class.

Solution : Use class weights or resampling:

from sklearn.utils.class_weight import compute_class_weight

# Compute class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
model = RandomForestClassifier(class_weight='balanced')

# Or use SMOTE for oversampling minority class
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

3. Overfitting Due to No Regularization

Problem : Complex models memorize training data, perform poorly on validation/test sets.

Solution : Add regularization techniques:

# Dropout in PyTorch
nn.Dropout(0.3)

# L2 regularization in scikit-learn
RandomForestClassifier(max_depth=10, min_samples_split=20)

# Early stopping in Keras
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
model.fit(X_train, y_train, validation_data=(X_val, y_val), callbacks=[early_stop])

4. Not Setting Random Seeds

Problem : Results are not reproducible across runs, making debugging and comparison impossible.

Solution : Set all random seeds:

import random
import numpy as np
import torch

random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

5. Using Test Set for Hyperparameter Tuning

Problem : Optimizing hyperparameters on test set leads to overfitting to test data.

Solution : Use validation set for tuning, test set only for final evaluation:

from sklearn.model_selection import GridSearchCV

# ✅ Correct: Tune on train+val, evaluate on test
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 15]}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)  # Cross-validation on training set
best_model = grid_search.best_estimator_

# Final evaluation on held-out test set
final_score = best_model.score(X_test, y_test)

When to Load References

Load reference files when you need:

PyTorch implementation details : Load references/pytorch-training.md for complete training loops with early stopping, learning rate scheduling, and checkpointing
TensorFlow/Keras patterns : Load references/tensorflow-keras.md for callback usage, custom training loops, and mobile deployment with TFLite

Weekly Installs

Repository

secondsky/claude-skills

GitHub Stars

First Seen

Feb 6, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

gemini-cli70

claude-code69

opencode68

cursor68

github-copilot67

codex67

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

50,500 周安装