ml-pipeline by jeffallan/claude-skills
npx skills add https://github.com/jeffallan/claude-skills --skill ml-pipeline专注于生产级机器学习基础设施、编排系统和自动化训练工作流程的高级 ML 流水线工程师。
根据上下文加载详细指导:
| 主题 | 参考 | 加载时机 |
|---|---|---|
| 特征工程 | references/feature-engineering.md | 特征流水线、转换、特征存储、Feast、数据验证 |
| 训练流水线 | references/training-pipelines.md | 训练编排、分布式训练、超参数调优、资源管理 |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 实验跟踪 | references/experiment-tracking.md | MLflow、Weights & Biases、实验记录、模型注册表 |
| 流水线编排 | references/pipeline-orchestration.md | Kubeflow Pipelines、Airflow、Prefect、DAG 设计、工作流自动化 |
| 模型验证 | references/model-validation.md | 评估策略、验证工作流、A/B 测试、影子部署 |
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import numpy as np
# Pin random state for reproducibility
SEED = 42
np.random.seed(SEED)
mlflow.set_experiment("my-classifier-experiment")
with mlflow.start_run():
# Log all hyperparameters — never hardcode silently
params = {"n_estimators": 100, "max_depth": 5, "random_state": SEED}
mlflow.log_params(params)
model = RandomForestClassifier(**params)
model.fit(X_train, y_train)
preds = model.predict(X_test)
# Log metrics
mlflow.log_metric("accuracy", accuracy_score(y_test, preds))
mlflow.log_metric("f1", f1_score(y_test, preds, average="weighted"))
# Log and register the model artifact
mlflow.sklearn.log_model(model, artifact_path="model",
registered_model_name="my-classifier")
from kfp.v2 import dsl
from kfp.v2.dsl import component, Input, Output, Dataset, Model, Metrics
@component(base_image="python:3.10", packages_to_install=["scikit-learn", "mlflow"])
def train_model(
train_data: Input[Dataset],
model_output: Output[Model],
metrics_output: Output[Metrics],
n_estimators: int = 100,
max_depth: int = 5,
):
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import pickle, json
df = pd.read_csv(train_data.path)
X, y = df.drop("label", axis=1), df["label"]
model = RandomForestClassifier(n_estimators=n_estimators,
max_depth=max_depth, random_state=42)
model.fit(X, y)
with open(model_output.path, "wb") as f:
pickle.dump(model, f)
metrics_output.log_metric("train_samples", len(df))
@dsl.pipeline(name="training-pipeline")
def training_pipeline(data_path: str, n_estimators: int = 100):
train_step = train_model(n_estimators=n_estimators)
# Chain additional steps (validate, register, deploy) here
import great_expectations as ge
def validate_training_data(df):
"""Run schema and distribution checks. Raise on failure — never skip."""
gdf = ge.from_pandas(df)
results = gdf.expect_column_values_to_not_be_null("label")
results &= gdf.expect_column_values_to_be_between("feature_1", 0, 1)
if not results["success"]:
raise ValueError(f"Data validation failed: {results['result']}")
return df # safe to proceed to training
始终:
绝不:
在实现流水线时,请提供:
MLflow、Kubeflow Pipelines、Apache Airflow、Prefect、Feast、Weights & Biases、Neptune、DVC、Great Expectations、Ray、Horovod、Kubernetes、Docker、S3/GCS/Azure Blob、模型注册表模式、特征存储架构、分布式训练、超参数优化
每周安装数
741
仓库
GitHub 星标数
7.3K
首次出现
2026年1月21日
安全审计
安装于
opencode612
claude-code597
gemini-cli596
codex581
cursor551
github-copilot549
Senior ML pipeline engineer specializing in production-grade machine learning infrastructure, orchestration systems, and automated training workflows.
Load detailed guidance based on context:
| Topic | Reference | Load When |
|---|---|---|
| Feature Engineering | references/feature-engineering.md | Feature pipelines, transformations, feature stores, Feast, data validation |
| Training Pipelines | references/training-pipelines.md | Training orchestration, distributed training, hyperparameter tuning, resource management |
| Experiment Tracking | references/experiment-tracking.md | MLflow, Weights & Biases, experiment logging, model registry |
| Pipeline Orchestration | references/pipeline-orchestration.md | Kubeflow Pipelines, Airflow, Prefect, DAG design, workflow automation |
| Model Validation | references/model-validation.md | Evaluation strategies, validation workflows, A/B testing, shadow deployment |
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import numpy as np
# Pin random state for reproducibility
SEED = 42
np.random.seed(SEED)
mlflow.set_experiment("my-classifier-experiment")
with mlflow.start_run():
# Log all hyperparameters — never hardcode silently
params = {"n_estimators": 100, "max_depth": 5, "random_state": SEED}
mlflow.log_params(params)
model = RandomForestClassifier(**params)
model.fit(X_train, y_train)
preds = model.predict(X_test)
# Log metrics
mlflow.log_metric("accuracy", accuracy_score(y_test, preds))
mlflow.log_metric("f1", f1_score(y_test, preds, average="weighted"))
# Log and register the model artifact
mlflow.sklearn.log_model(model, artifact_path="model",
registered_model_name="my-classifier")
from kfp.v2 import dsl
from kfp.v2.dsl import component, Input, Output, Dataset, Model, Metrics
@component(base_image="python:3.10", packages_to_install=["scikit-learn", "mlflow"])
def train_model(
train_data: Input[Dataset],
model_output: Output[Model],
metrics_output: Output[Metrics],
n_estimators: int = 100,
max_depth: int = 5,
):
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import pickle, json
df = pd.read_csv(train_data.path)
X, y = df.drop("label", axis=1), df["label"]
model = RandomForestClassifier(n_estimators=n_estimators,
max_depth=max_depth, random_state=42)
model.fit(X, y)
with open(model_output.path, "wb") as f:
pickle.dump(model, f)
metrics_output.log_metric("train_samples", len(df))
@dsl.pipeline(name="training-pipeline")
def training_pipeline(data_path: str, n_estimators: int = 100):
train_step = train_model(n_estimators=n_estimators)
# Chain additional steps (validate, register, deploy) here
import great_expectations as ge
def validate_training_data(df):
"""Run schema and distribution checks. Raise on failure — never skip."""
gdf = ge.from_pandas(df)
results = gdf.expect_column_values_to_not_be_null("label")
results &= gdf.expect_column_values_to_be_between("feature_1", 0, 1)
if not results["success"]:
raise ValueError(f"Data validation failed: {results['result']}")
return df # safe to proceed to training
Always:
Never:
When implementing a pipeline, provide:
MLflow, Kubeflow Pipelines, Apache Airflow, Prefect, Feast, Weights & Biases, Neptune, DVC, Great Expectations, Ray, Horovod, Kubernetes, Docker, S3/GCS/Azure Blob, model registry patterns, feature store architecture, distributed training, hyperparameter optimization
Weekly Installs
741
Repository
GitHub Stars
7.3K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubWarnSocketPassSnykPass
Installed on
opencode612
claude-code597
gemini-cli596
codex581
cursor551
github-copilot549
React 组合模式指南:Vercel 组件架构最佳实践,提升代码可维护性
103,800 周安装
Web界面规范检查工具 - 自动检测代码是否符合最新设计规范
688 周安装
对话记忆AI技能:实现智能记忆存储与检索,优化LLM长期对话体验
689 周安装
Terraform Provider 资源开发指南:使用 Plugin Framework 实现 CRUD 和数据源
689 周安装
Dev Browser:JavaScript沙盒化浏览器控制CLI工具 - 自动化测试与爬虫开发利器
689 周安装
CTF Pwn 二进制漏洞利用技术大全 - 栈溢出、ROP、格式化字符串、内核利用实战指南
690 周安装
Supabase技能创建器指南:构建高效AI技能模块,扩展Claude能力
691 周安装