ml-model-training by secondsky/claude-skills
npx skills add https://github.com/secondsky/claude-skills --skill ml-model-training通过恰当的数据处理和评估来训练机器学习模型。
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
# 加载并清理数据
df = pd.read_csv('data.csv')
df = df.dropna()
# 编码分类变量
le = LabelEncoder()
df['category'] = le.fit_transform(df['category'])
# 分割数据 (70/15/15)
X = df.drop('target', axis=1)
y = df['target']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)
# 缩放特征
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
print(classification_report(y_val, y_pred))
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
import torch
import torch.nn as nn
class Model(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(input_dim, 64),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, 1),
nn.Sigmoid()
)
def forward(self, x):
return self.layers(x)
model = Model(X_train.shape[1])
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.BCELoss()
for epoch in range(100):
model.train()
optimizer.zero_grad()
output = model(X_train_tensor)
loss = criterion(output, y_train_tensor)
loss.backward()
optimizer.step()
| 任务 | 指标 |
|---|---|
| 分类 | 准确率、精确率、召回率、F1分数、AUC-ROC |
| 回归 | 均方误差、均方根误差、平均绝对误差、决定系数 |
建议:
不建议:
问题:在分割数据之前进行缩放或转换,会导致测试集信息泄露到训练过程中。
解决方案:始终先分割数据,然后仅在训练数据上拟合转换器:
# ✅ 正确:在训练集上拟合,然后转换训练/验证/测试集
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val) # 仅转换
X_test = scaler.transform(X_test) # 仅转换
# ❌ 错误:在所有数据上拟合
X_all = scaler.fit_transform(X) # 泄露了测试集信息!
问题:在不平衡数据集(例如,95% 为 A 类,5% 为 B 类)上训练,会导致模型只预测多数类。
解决方案:使用类别权重或重采样:
from sklearn.utils.class_weight import compute_class_weight
# 计算类别权重
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
model = RandomForestClassifier(class_weight='balanced')
# 或者使用 SMOTE 对少数类进行过采样
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
问题:复杂模型会记住训练数据,在验证/测试集上表现不佳。
解决方案:添加正则化技术:
# PyTorch 中的 Dropout
nn.Dropout(0.3)
# scikit-learn 中的 L2 正则化
RandomForestClassifier(max_depth=10, min_samples_split=20)
# Keras 中的早停
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
model.fit(X_train, y_train, validation_data=(X_val, y_val), callbacks=[early_stop])
问题:结果在不同运行之间不可复现,使得调试和比较变得不可能。
解决方案:设置所有随机种子:
import random
import numpy as np
import torch
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(42)
问题:在测试集上优化超参数会导致模型对测试数据过拟合。
解决方案:使用验证集进行调优,测试集仅用于最终评估:
from sklearn.model_selection import GridSearchCV
# ✅ 正确:在训练+验证集上调优,在测试集上评估
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 15]}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train) # 在训练集上进行交叉验证
best_model = grid_search.best_estimator_
# 在预留的测试集上进行最终评估
final_score = best_model.score(X_test, y_test)
在需要时加载参考文件:
references/pytorch-training.md 以获取包含早停、学习率调度和检查点保存的完整训练循环。references/tensorflow-keras.md 以获取回调函数用法、自定义训练循环以及使用 TFLite 进行移动端部署。每周安装次数
81
代码仓库
GitHub 星标数
91
首次出现
2026年2月6日
安全审计
安装于
gemini-cli70
claude-code69
opencode68
cursor68
github-copilot67
codex67
Train machine learning models with proper data handling and evaluation.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Load and clean data
df = pd.read_csv('data.csv')
df = df.dropna()
# Encode categorical variables
le = LabelEncoder()
df['category'] = le.fit_transform(df['category'])
# Split data (70/15/15)
X = df.drop('target', axis=1)
y = df['target']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)
# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
print(classification_report(y_val, y_pred))
import torch
import torch.nn as nn
class Model(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(input_dim, 64),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, 1),
nn.Sigmoid()
)
def forward(self, x):
return self.layers(x)
model = Model(X_train.shape[1])
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.BCELoss()
for epoch in range(100):
model.train()
optimizer.zero_grad()
output = model(X_train_tensor)
loss = criterion(output, y_train_tensor)
loss.backward()
optimizer.step()
| Task | Metrics |
|---|---|
| Classification | Accuracy, Precision, Recall, F1, AUC-ROC |
| Regression | MSE, RMSE, MAE, R² |
PyTorch : See references/pytorch-training.md for complete training with:
TensorFlow/Keras : See references/tensorflow-keras.md for:
Do:
Don't:
Problem : Scaling or transforming data before splitting leads to test set information leaking into training.
Solution : Always split data first, then fit transformers only on training data:
# ✅ Correct: Fit on train, transform train/val/test
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val) # Only transform
X_test = scaler.transform(X_test) # Only transform
# ❌ Wrong: Fitting on all data
X_all = scaler.fit_transform(X) # Leaks test info!
Problem : Training on imbalanced datasets (e.g., 95% class A, 5% class B) leads to models that predict only the majority class.
Solution : Use class weights or resampling:
from sklearn.utils.class_weight import compute_class_weight
# Compute class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
model = RandomForestClassifier(class_weight='balanced')
# Or use SMOTE for oversampling minority class
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
Problem : Complex models memorize training data, perform poorly on validation/test sets.
Solution : Add regularization techniques:
# Dropout in PyTorch
nn.Dropout(0.3)
# L2 regularization in scikit-learn
RandomForestClassifier(max_depth=10, min_samples_split=20)
# Early stopping in Keras
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
model.fit(X_train, y_train, validation_data=(X_val, y_val), callbacks=[early_stop])
Problem : Results are not reproducible across runs, making debugging and comparison impossible.
Solution : Set all random seeds:
import random
import numpy as np
import torch
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(42)
Problem : Optimizing hyperparameters on test set leads to overfitting to test data.
Solution : Use validation set for tuning, test set only for final evaluation:
from sklearn.model_selection import GridSearchCV
# ✅ Correct: Tune on train+val, evaluate on test
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 15]}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train) # Cross-validation on training set
best_model = grid_search.best_estimator_
# Final evaluation on held-out test set
final_score = best_model.score(X_test, y_test)
Load reference files when you need:
references/pytorch-training.md for complete training loops with early stopping, learning rate scheduling, and checkpointingreferences/tensorflow-keras.md for callback usage, custom training loops, and mobile deployment with TFLiteWeekly Installs
81
Repository
GitHub Stars
91
First Seen
Feb 6, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
gemini-cli70
claude-code69
opencode68
cursor68
github-copilot67
codex67
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
50,500 周安装
Mock Generator:自动生成测试模拟对象、桩和夹具,支持Jest/Vitest/Pytest等框架
2 周安装
Express.js API生成器 - TypeScript REST API脚手架工具,快速构建结构化后端
2 周安装
依赖漏洞扫描器 - 一键安全审计 npm/pip/Cargo 项目依赖,修复 CVE 漏洞
2 周安装
Codex配置写作技能 - 提升AI辅助编程与文档编写效率的GitHub工具
2 周安装
OpenAI文档配置工具 - 优化Codex集成与API调用设置
2 周安装
iOS SpriteKit 游戏开发框架配置工具 - 优化游戏开发流程
2 周安装