MLOps与机器学习工程专家 | 端到端ML流水线、模型部署与基础设施自动化指南

ml-engineer by 404kidwiz/claude-supercode-skills

138 周安装量

73 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/404kidwiz/claude-supercode-skills --skill ml-engineer

AI/机器学习自动化开发运维

🇨🇳中文介绍

机器学习工程师

目的

提供 MLOps 和生产级机器学习工程专业知识，专注于端到端 ML 流水线、模型部署和基础设施自动化。通过稳健、可扩展的机器学习系统，连接数据科学和生产工程。

使用场景

构建端到端 ML 流水线（数据 → 训练 → 验证 → 部署）
将模型部署到生产环境（实时 API、批处理或边缘计算）
实施 MLOps 实践（ML 的 CI/CD、实验跟踪）
优化模型性能（延迟、吞吐量、资源使用率）
设置特征存储和模型注册表
实施模型监控（漂移检测、性能跟踪）
扩展训练工作负载（分布式训练）

2. 决策框架

模型服务策略

Need to serve predictions?
│
├─ Real-time (Low Latency)?
│  │
│  ├─ High Throughput? → **Kubernetes (KServe/Seldon)**
│  ├─ Low/Medium Traffic? → **Serverless (Lambda/Cloud Run)**
│  └─ Ultra-low latency (<10ms)? → **C++/Rust Inference Server (Triton)**
│
├─ Batch Processing?
│  │
│  ├─ Large Scale? → **Spark / Ray**
│  └─ Scheduled Jobs? → **Airflow / Prefect**
│
└─ Edge / Client-side?
   │
   ├─ Mobile? → **TFLite / CoreML**
   └─ Browser? → **TensorFlow.js / ONNX Runtime Web**

训练基础设施

Training Environment?
│
├─ Single Node?
│  │
│  ├─ Interactive? → **JupyterHub / SageMaker Notebooks**
│  └─ Automated? → **Docker Container on VM**
│
└─ Distributed?
   │
   ├─ Data Parallelism? → **Ray Train / PyTorch DDP**
   └─ Pipeline orchestration? → **Kubeflow / Airflow / Vertex AI**

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

需求	推荐	理由
简单 / MVP	无需特征存储	使用 SQL/Parquet 文件。特征存储的开销过高。
团队一致性	Feast	开源，管理在线/离线一致性。
企业级 / 托管服务	Tecton / Hopsworks	完整的治理、谱系、托管 SLA。
云原生	Vertex/SageMaker FS	如果已在该云生态系统中，则集成紧密。

3. 核心工作流程

工作流程 1：端到端训练流水线

目标： 使用 MLflow 自动化模型训练、验证和注册。

设置跟踪

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("churn-prediction-prod")

训练脚本 (train.py)

def train(max_depth, n_estimators):
    with mlflow.start_run():
        # Log params
        mlflow.log_param("max_depth", max_depth)
        mlflow.log_param("n_estimators", n_estimators)
        
        # Train
        model = RandomForestClassifier(
            max_depth=max_depth, 
            n_estimators=n_estimators,
            random_state=42
        )
        model.fit(X_train, y_train)
        
        # Evaluate
        preds = model.predict(X_test)
        acc = accuracy_score(y_test, preds)
        prec = precision_score(y_test, preds)
        
        # Log metrics
        mlflow.log_metric("accuracy", acc)
        mlflow.log_metric("precision", prec)
        
        # Log model artifact with signature
        from mlflow.models.signature import infer_signature
        signature = infer_signature(X_train, preds)
        
        mlflow.sklearn.log_model(
            model, 
            "model",
            signature=signature,
            registered_model_name="churn-model"
        )
        
        print(f"Run ID: {mlflow.active_run().info.run_id}")

if __name__ == "__main__":
    train(max_depth=5, n_estimators=100)

流水线编排 (Bash/Airflow)

#!/bin/bash
# Run training
python train.py

# Check if model passed threshold (e.g. via MLflow API)
# If yes, transition to Staging

工作流程 3：漂移检测（监控）

目标： 检测生产数据分布是否已偏离训练数据。

基线生成（训练期间）

import evidently
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

# Calculate baseline profile on training data
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=test_df)
report.save_json("baseline_drift.json")

生产监控任务

# Scheduled daily job
def check_drift():
    # Load production logs (last 24h)
    current_data = load_production_logs()
    reference_data = load_training_data()
    
    report = Report(metrics=[DataDriftPreset()])
    report.run(reference_data=reference_data, current_data=current_data)
    
    result = report.as_dict()
    dataset_drift = result['metrics'][0]['result']['dataset_drift']
    
    if dataset_drift:
        trigger_alert("Data Drift Detected!")
        trigger_retraining()

工作流程 5：使用向量数据库的 RAG 流水线

目标： 使用 Pinecone/Weaviate 和 LangChain 构建生产级检索流水线。

数据摄取（分块与嵌入）

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

# Chunking
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(raw_documents)

# Embedding & Indexing
embeddings = OpenAIEmbeddings()
vectorstore = PineconeVectorStore.from_documents(
    docs, 
    embeddings, 
    index_name="knowledge-base"
)

检索与生成

from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5})
)

response = qa_chain.invoke("How do I reset my password?")
print(response['result'])

优化（混合搜索）
- 将密集检索（向量）与稀疏检索（BM25/关键词）相结合。
- 对前 20 个结果使用重排序（Cohere/Cross-Encoder）以选择最佳的 5 个结果。

5. 反模式与陷阱

❌ 反模式 1：训练-服务偏差

特征逻辑在训练时用 SQL 实现，但在服务时用 Java/Python 重新实现。
"均值插补"值在训练集上计算但未保存；服务时使用了不同的默认值。

模型在生产环境中表现不可预测。
调试极其困难。

使用特征存储或共享库进行转换。
将预处理逻辑封装在模型工件内（例如，Scikit-Learn Pipeline、TensorFlow Transform）。

❌ 反模式 2：手动部署

数据科学家通过电子邮件将 .pkl 文件发送给工程师。
工程师手动将其复制到服务器并重启 Flask 应用。

没有版本控制。
不可复现。
人为错误风险高。

CI/CD 流水线： Git 推送触发构建 → 测试 → 部署。
模型注册表： 从注册表部署特定版本哈希。

❌ 反模式 3：静默失败

模型 API 返回 200 OK，但由于输入数据损坏（例如，全为 Null）导致预测结果是垃圾数据。
模型对所有输入都返回默认类别 0。

应用程序继续运行，但业务价值已丢失。
事件在几周后才被业务相关方发现。

输入模式验证： 拒绝错误的请求（使用 Pydantic/TFX）。
输出监控： 如果预测分布发生变化（例如，模型在 1 小时内预测"欺诈"的概率为 0%），则发出警报。

7. 质量检查清单

健康检查： 已实现 /health 端点（活跃度/就绪度）。
重试机制： 客户端具有带指数退避的重试逻辑。
回退方案： 如果模型失败或超时，存在默认的启发式方法。
验证： 在推理前根据模式验证输入。

延迟： P99 延迟满足 SLA（例如，< 100ms）。
吞吐量： 系统根据负载自动扩展。
批处理： 如果使用 GPU，则对推理请求进行批处理。
镜像大小： Docker 镜像已优化（精简基础镜像，多阶段构建）。

版本控制： 代码、数据和模型版本已关联。
工件： 保存在对象存储（S3/GCS）中，而非本地磁盘。
环境： 依赖项已固定（requirements.txt / conda.yaml）。

技术指标： 延迟、错误率、CPU/内存/GPU 使用率。
功能指标： 预测分布、输入数据漂移。
业务指标： （如果可能）将预测结果归因于业务成果。

问题：训练环境和服务环境中的特征逻辑不同
症状：模型在测试中表现良好，但在生产中表现不佳
解决方案：使用特征存储或将预处理嵌入模型工件中
警告信号：特征计算的不同代码路径、硬编码的常量

问题：在没有自动化或版本控制的情况下部署模型
症状：无迹可循、人为错误、部署失败
解决方案：实施与模型注册表集成的 CI/CD 流水线
警告信号：通过电子邮件/文件传输模型文件、手动重启服务器

问题：模型故障未被检测到
症状：返回错误的预测结果但没有错误指示
解决方案：实施输入验证、输出监控和警报
警告信号：返回 200 OK 响应但数据是垃圾数据、没有异常检测

问题：训练数据包含在预测时不可用的信息
症状：训练准确率高得不切实际、泛化能力差
解决方案：仔细的特征工程和验证集划分审查
警告信号：只有在预测之后才能知道的特征

🇺🇸English

Machine Learning Engineer

Purpose

Provides MLOps and production ML engineering expertise specializing in end-to-end ML pipelines, model deployment, and infrastructure automation. Bridges data science and production engineering with robust, scalable machine learning systems.

When to Use

Building end-to-end ML pipelines (Data → Train → Validate → Deploy)
Deploying models to production (Real-time API, Batch, or Edge)
Implementing MLOps practices (CI/CD for ML, Experiment Tracking)
Optimizing model performance (Latency, Throughput, Resource usage)
Setting up feature stores and model registries
Implementing model monitoring (Drift detection, Performance tracking)
Scaling training workloads (Distributed training)

2. Decision Framework

Model Serving Strategy

Need to serve predictions?
│
├─ Real-time (Low Latency)?
│  │
│  ├─ High Throughput? → **Kubernetes (KServe/Seldon)**
│  ├─ Low/Medium Traffic? → **Serverless (Lambda/Cloud Run)**
│  └─ Ultra-low latency (<10ms)? → **C++/Rust Inference Server (Triton)**
│
├─ Batch Processing?
│  │
│  ├─ Large Scale? → **Spark / Ray**
│  └─ Scheduled Jobs? → **Airflow / Prefect**
│
└─ Edge / Client-side?
   │
   ├─ Mobile? → **TFLite / CoreML**
   └─ Browser? → **TensorFlow.js / ONNX Runtime Web**

Training Infrastructure

Training Environment?
│
├─ Single Node?
│  │
│  ├─ Interactive? → **JupyterHub / SageMaker Notebooks**
│  └─ Automated? → **Docker Container on VM**
│
└─ Distributed?
   │
   ├─ Data Parallelism? → **Ray Train / PyTorch DDP**
   └─ Pipeline orchestration? → **Kubeflow / Airflow / Vertex AI**

Feature Store Decision

Need	Recommendation	Rationale
Simple / MVP	No Feature Store	Use SQL/Parquet files. Overhead of FS is too high.
Team Consistency	Feast	Open source, manages online/offline consistency.
Enterprise / Managed	Tecton / Hopsworks	Full governance, lineage, managed SLA.
Cloud Native	Vertex/SageMaker FS	Tight integration if already in that cloud ecosystem.

Red Flags → Escalate tooracle:

"Real-time" training requirements (online learning) without massive infrastructure budget
Deploying LLMs (7B+ params) on CPU-only infrastructure
Training on PII/PHI data without privacy-preserving techniques (Federated Learning, Differential Privacy)
No validation set or "ground truth" feedback loop mechanism

3. Core Workflows

Workflow 1: End-to-End Training Pipeline

Goal: Automate model training, validation, and registration using MLflow.

Steps:

Setup Tracking

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("churn-prediction-prod")

Training Script (train.py)

def train(max_depth, n_estimators):
    with mlflow.start_run():
        # Log params
        mlflow.log_param("max_depth", max_depth)
        mlflow.log_param("n_estimators", n_estimators)
        
        # Train
        model = RandomForestClassifier(
            max_depth=max_depth, 
            n_estimators=n_estimators,
            random_state=42
        )
        model.fit(X_train, y_train)
        
        # Evaluate
        preds = model.predict(X_test)
        acc = accuracy_score(y_test, preds)
        prec = precision_score(y_test, preds)
        
        # Log metrics
        mlflow.log_metric("accuracy", acc)
        mlflow.log_metric("precision", prec)
        
        # Log model artifact with signature
        from mlflow.models.signature import infer_signature
        signature = infer_signature(X_train, preds)
        
        mlflow.sklearn.log_model(
            model, 
            "model",
            signature=signature,
            registered_model_name="churn-model"
        )
        
        print(f"Run ID: {mlflow.active_run().info.run_id}")

if __name__ == "__main__":
    train(max_depth=5, n_estimators=100)

Workflow 3: Drift Detection (Monitoring)

Goal: Detect if production data distribution has shifted from training data.

Steps:

Baseline Generation (During Training)

import evidently
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

# Calculate baseline profile on training data
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=test_df)
report.save_json("baseline_drift.json")

Production Monitoring Job

# Scheduled daily job
def check_drift():
    # Load production logs (last 24h)
    current_data = load_production_logs()
    reference_data = load_training_data()
    
    report = Report(metrics=[DataDriftPreset()])
    report.run(reference_data=reference_data, current_data=current_data)
    
    result = report.as_dict()
    dataset_drift = result['metrics'][0]['result']['dataset_drift']
    
    if dataset_drift:
        trigger_alert("Data Drift Detected!")
        trigger_retraining()

Workflow 5: RAG Pipeline with Vector Database

Goal: Build a production retrieval pipeline using Pinecone/Weaviate and LangChain.

Steps:

Ingestion (Chunking & Embedding)

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

# Chunking
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(raw_documents)

# Embedding & Indexing
embeddings = OpenAIEmbeddings()
vectorstore = PineconeVectorStore.from_documents(
    docs, 
    embeddings, 
    index_name="knowledge-base"
)

Retrieval & Generation

from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5})
)

response = qa_chain.invoke("How do I reset my password?")
print(response['result'])

Optimization (Hybrid Search)
- Combine Dense Retrieval (Vectors) with Sparse Retrieval (BM25/Keywords).
- Use Reranking (Cohere/Cross-Encoder) on the top 20 results to select best 5.

5. Anti-Patterns & Gotchas

❌ Anti-Pattern 1: Training-Serving Skew

What it looks like:

Feature logic implemented in SQL for training, but re-implemented in Java/Python for serving.
"Mean imputation" value calculated on training set but not saved; serving uses a different default.

Why it fails:

Model behaves unpredictably in production.
Debugging is extremely difficult.

Correct approach:

Use a Feature Store or shared library for transformations.
Wrap preprocessing logic inside the model artifact (e.g., Scikit-Learn Pipeline, TensorFlow Transform).

❌ Anti-Pattern 2: Manual Deployments

What it looks like:

Data Scientist emails a .pkl file to an engineer.
Engineer manually copies it to a server and restarts the flask app.

Why it fails:

No version control.
No reproducibility.
High risk of human error.

Correct approach:

CI/CD Pipeline: Git push triggers build → test → deploy.
Model Registry: Deploy specific version hash from registry.

❌ Anti-Pattern 3: Silent Failures

What it looks like:

Model API returns 200 OK but prediction is garbage because input data was corrupted (e.g., all Nulls).
Model returns default class 0 for everything.

Why it fails:

Application keeps running, but business value is lost.
Incident detected weeks later by business stakeholders.

Correct approach:

Input Schema Validation: Reject bad requests (Pydantic/TFX).
Output Monitoring: Alert if prediction distribution shifts (e.g., if model predicts "Fraud" 0% of time for 1 hour).

7. Quality Checklist

Reliability:

Health Checks: /health endpoint implemented (liveness/readiness).
Retries: Client has retry logic with exponential backoff.
Fallback: Default heuristic exists if model fails or times out.
Validation: Inputs validated against schema before inference.

Performance:

Latency: P99 latency meets SLA (e.g., < 100ms).
Throughput: System autoscales with load.
Batching: Inference requests batched if using GPU.
Image Size: Docker image optimized (slim base, multi-stage build).

Reproducibility:

Versioning: Code, Data, and Model versions linked.
Artifacts: Saved in object storage (S3/GCS), not local disk.
Environment: Dependencies pinned (requirements.txt / conda.yaml).

Monitoring:

Technical: Latency, Error Rate, CPU/Memory/GPU usage.
Functional: Prediction distribution, Input data drift.
Business: (If possible) Attribution of prediction to outcome.

Anti-Patterns

Training-Serving Skew

Problem : Feature logic differs between training and serving environments
Symptoms : Model performs well in testing but poorly in production
Solution : Use feature stores or embed preprocessing in model artifacts
Warning Signs : Different code paths for feature computation, hardcoded constants

Manual Deployment

Problem : Deploying models without automation or version control
Symptoms : No traceability, human errors, deployment failures
Solution : Implement CI/CD pipelines with model registry integration
Warning Signs : Email/file transfers of model files, manual server restarts

Silent Failures

Problem : Model failures go undetected
Symptoms : Bad predictions returned without error indication
Solution : Implement input validation, output monitoring, and alerting
Warning Signs : 200 OK responses with garbage data, no anomaly detection

Data Leakage

Problem : Training data contains information not available at prediction time
Symptoms : Unrealistically high training accuracy, poor generalization
Solution : Careful feature engineering and validation split review
Warning Signs : Features that would only be known after prediction

Weekly Installs

Repository

404kidwiz/claud…e-skills

GitHub Stars

First Seen

Jan 24, 2026

Security Audits

Gen Agent Trust HubWarn SocketPass SnykPass

Installed on

opencode78

gemini-cli71

codex69

claude-code67

cursor62

github-copilot58

Azure RBAC 权限管理工具：查找最小角色、创建自定义角色与自动化分配

131,500 周安装

Pipeline Orchestration (Bash/Airflow)

#!/bin/bash
# Run training
python train.py

# Check if model passed threshold (e.g. via MLflow API)
# If yes, transition to Staging

MLOps与机器学习工程专家 | 端到端ML流水线、模型部署与基础设施自动化指南

🇨🇳中文介绍

机器学习工程师

目的

使用场景

2. 决策框架

模型服务策略

训练基础设施

相关 Skills

特征存储决策

3. 核心工作流程

工作流程 1：端到端训练流水线

工作流程 3：漂移检测（监控）

工作流程 5：使用向量数据库的 RAG 流水线

5. 反模式与陷阱

❌ 反模式 1：训练-服务偏差

❌ 反模式 2：手动部署

❌ 反模式 3：静默失败

7. 质量检查清单

反模式

训练-服务偏差

手动部署

静默失败

数据泄露

🇺🇸English

Machine Learning Engineer

Purpose

When to Use

2. Decision Framework

Model Serving Strategy

Training Infrastructure

Feature Store Decision

3. Core Workflows

Workflow 1: End-to-End Training Pipeline

Workflow 3: Drift Detection (Monitoring)

Workflow 5: RAG Pipeline with Vector Database

5. Anti-Patterns & Gotchas

❌ Anti-Pattern 1: Training-Serving Skew

❌ Anti-Pattern 2: Manual Deployments

❌ Anti-Pattern 3: Silent Failures

7. Quality Checklist

Anti-Patterns

Training-Serving Skew

Manual Deployment

Silent Failures

Data Leakage

最新 Skills