ml-engineer by 404kidwiz/claude-supercode-skills
npx skills add https://github.com/404kidwiz/claude-supercode-skills --skill ml-engineer提供 MLOps 和生产级机器学习工程专业知识,专注于端到端 ML 流水线、模型部署和基础设施自动化。通过稳健、可扩展的机器学习系统,连接数据科学和生产工程。
Need to serve predictions?
│
├─ Real-time (Low Latency)?
│ │
│ ├─ High Throughput? → **Kubernetes (KServe/Seldon)**
│ ├─ Low/Medium Traffic? → **Serverless (Lambda/Cloud Run)**
│ └─ Ultra-low latency (<10ms)? → **C++/Rust Inference Server (Triton)**
│
├─ Batch Processing?
│ │
│ ├─ Large Scale? → **Spark / Ray**
│ └─ Scheduled Jobs? → **Airflow / Prefect**
│
└─ Edge / Client-side?
│
├─ Mobile? → **TFLite / CoreML**
└─ Browser? → **TensorFlow.js / ONNX Runtime Web**
Training Environment?
│
├─ Single Node?
│ │
│ ├─ Interactive? → **JupyterHub / SageMaker Notebooks**
│ └─ Automated? → **Docker Container on VM**
│
└─ Distributed?
│
├─ Data Parallelism? → **Ray Train / PyTorch DDP**
└─ Pipeline orchestration? → **Kubeflow / Airflow / Vertex AI**
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 需求 | 推荐 | 理由 |
|---|---|---|
| 简单 / MVP | 无需特征存储 | 使用 SQL/Parquet 文件。特征存储的开销过高。 |
| 团队一致性 | Feast | 开源,管理在线/离线一致性。 |
| 企业级 / 托管服务 | Tecton / Hopsworks | 完整的治理、谱系、托管 SLA。 |
| 云原生 | Vertex/SageMaker FS | 如果已在该云生态系统中,则集成紧密。 |
危险信号 → 上报给 oracle:
目标: 使用 MLflow 自动化模型训练、验证和注册。
步骤:
设置跟踪
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("churn-prediction-prod")
训练脚本 (train.py)
def train(max_depth, n_estimators):
with mlflow.start_run():
# Log params
mlflow.log_param("max_depth", max_depth)
mlflow.log_param("n_estimators", n_estimators)
# Train
model = RandomForestClassifier(
max_depth=max_depth,
n_estimators=n_estimators,
random_state=42
)
model.fit(X_train, y_train)
# Evaluate
preds = model.predict(X_test)
acc = accuracy_score(y_test, preds)
prec = precision_score(y_test, preds)
# Log metrics
mlflow.log_metric("accuracy", acc)
mlflow.log_metric("precision", prec)
# Log model artifact with signature
from mlflow.models.signature import infer_signature
signature = infer_signature(X_train, preds)
mlflow.sklearn.log_model(
model,
"model",
signature=signature,
registered_model_name="churn-model"
)
print(f"Run ID: {mlflow.active_run().info.run_id}")
if __name__ == "__main__":
train(max_depth=5, n_estimators=100)
流水线编排 (Bash/Airflow)
#!/bin/bash
# Run training
python train.py
# Check if model passed threshold (e.g. via MLflow API)
# If yes, transition to Staging
目标: 检测生产数据分布是否已偏离训练数据。
步骤:
基线生成(训练期间)
import evidently
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
# Calculate baseline profile on training data
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=test_df)
report.save_json("baseline_drift.json")
生产监控任务
# Scheduled daily job
def check_drift():
# Load production logs (last 24h)
current_data = load_production_logs()
reference_data = load_training_data()
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference_data, current_data=current_data)
result = report.as_dict()
dataset_drift = result['metrics'][0]['result']['dataset_drift']
if dataset_drift:
trigger_alert("Data Drift Detected!")
trigger_retraining()
目标: 使用 Pinecone/Weaviate 和 LangChain 构建生产级检索流水线。
步骤:
数据摄取(分块与嵌入)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
# Chunking
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(raw_documents)
# Embedding & Indexing
embeddings = OpenAIEmbeddings()
vectorstore = PineconeVectorStore.from_documents(
docs,
embeddings,
index_name="knowledge-base"
)
检索与生成
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 5})
)
response = qa_chain.invoke("How do I reset my password?")
print(response['result'])
优化(混合搜索)
表现形式:
失败原因:
正确方法:
表现形式:
.pkl 文件发送给工程师。失败原因:
正确方法:
表现形式:
200 OK,但由于输入数据损坏(例如,全为 Null)导致预测结果是垃圾数据。0。失败原因:
正确方法:
可靠性:
/health 端点(活跃度/就绪度)。性能:
可复现性:
requirements.txt / conda.yaml)。监控:
周安装量
93
仓库
GitHub 星标数
45
首次出现
2026年1月24日
安全审计
安装于
opencode78
gemini-cli71
codex69
claude-code67
cursor62
github-copilot58
Provides MLOps and production ML engineering expertise specializing in end-to-end ML pipelines, model deployment, and infrastructure automation. Bridges data science and production engineering with robust, scalable machine learning systems.
Need to serve predictions?
│
├─ Real-time (Low Latency)?
│ │
│ ├─ High Throughput? → **Kubernetes (KServe/Seldon)**
│ ├─ Low/Medium Traffic? → **Serverless (Lambda/Cloud Run)**
│ └─ Ultra-low latency (<10ms)? → **C++/Rust Inference Server (Triton)**
│
├─ Batch Processing?
│ │
│ ├─ Large Scale? → **Spark / Ray**
│ └─ Scheduled Jobs? → **Airflow / Prefect**
│
└─ Edge / Client-side?
│
├─ Mobile? → **TFLite / CoreML**
└─ Browser? → **TensorFlow.js / ONNX Runtime Web**
Training Environment?
│
├─ Single Node?
│ │
│ ├─ Interactive? → **JupyterHub / SageMaker Notebooks**
│ └─ Automated? → **Docker Container on VM**
│
└─ Distributed?
│
├─ Data Parallelism? → **Ray Train / PyTorch DDP**
└─ Pipeline orchestration? → **Kubeflow / Airflow / Vertex AI**
| Need | Recommendation | Rationale |
|---|---|---|
| Simple / MVP | No Feature Store | Use SQL/Parquet files. Overhead of FS is too high. |
| Team Consistency | Feast | Open source, manages online/offline consistency. |
| Enterprise / Managed | Tecton / Hopsworks | Full governance, lineage, managed SLA. |
| Cloud Native | Vertex/SageMaker FS | Tight integration if already in that cloud ecosystem. |
Red Flags → Escalate tooracle:
Goal: Automate model training, validation, and registration using MLflow.
Steps:
Setup Tracking
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("churn-prediction-prod")
Training Script (train.py)
def train(max_depth, n_estimators):
with mlflow.start_run():
# Log params
mlflow.log_param("max_depth", max_depth)
mlflow.log_param("n_estimators", n_estimators)
# Train
model = RandomForestClassifier(
max_depth=max_depth,
n_estimators=n_estimators,
random_state=42
)
model.fit(X_train, y_train)
# Evaluate
preds = model.predict(X_test)
acc = accuracy_score(y_test, preds)
prec = precision_score(y_test, preds)
# Log metrics
mlflow.log_metric("accuracy", acc)
mlflow.log_metric("precision", prec)
# Log model artifact with signature
from mlflow.models.signature import infer_signature
signature = infer_signature(X_train, preds)
mlflow.sklearn.log_model(
model,
"model",
signature=signature,
registered_model_name="churn-model"
)
print(f"Run ID: {mlflow.active_run().info.run_id}")
if __name__ == "__main__":
train(max_depth=5, n_estimators=100)
Goal: Detect if production data distribution has shifted from training data.
Steps:
Baseline Generation (During Training)
import evidently
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
# Calculate baseline profile on training data
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=test_df)
report.save_json("baseline_drift.json")
Production Monitoring Job
# Scheduled daily job
def check_drift():
# Load production logs (last 24h)
current_data = load_production_logs()
reference_data = load_training_data()
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference_data, current_data=current_data)
result = report.as_dict()
dataset_drift = result['metrics'][0]['result']['dataset_drift']
if dataset_drift:
trigger_alert("Data Drift Detected!")
trigger_retraining()
Goal: Build a production retrieval pipeline using Pinecone/Weaviate and LangChain.
Steps:
Ingestion (Chunking & Embedding)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
# Chunking
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(raw_documents)
# Embedding & Indexing
embeddings = OpenAIEmbeddings()
vectorstore = PineconeVectorStore.from_documents(
docs,
embeddings,
index_name="knowledge-base"
)
Retrieval & Generation
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 5})
)
response = qa_chain.invoke("How do I reset my password?")
print(response['result'])
Optimization (Hybrid Search)
What it looks like:
Why it fails:
Correct approach:
What it looks like:
.pkl file to an engineer.Why it fails:
Correct approach:
What it looks like:
200 OK but prediction is garbage because input data was corrupted (e.g., all Nulls).0 for everything.Why it fails:
Correct approach:
Reliability:
/health endpoint implemented (liveness/readiness).Performance:
Reproducibility:
requirements.txt / conda.yaml).Monitoring:
Weekly Installs
93
Repository
GitHub Stars
45
First Seen
Jan 24, 2026
Security Audits
Gen Agent Trust HubWarnSocketPassSnykPass
Installed on
opencode78
gemini-cli71
codex69
claude-code67
cursor62
github-copilot58
Azure RBAC 权限管理工具:查找最小角色、创建自定义角色与自动化分配
131,500 周安装
Pipeline Orchestration (Bash/Airflow)
#!/bin/bash
# Run training
python train.py
# Check if model passed threshold (e.g. via MLflow API)
# If yes, transition to Staging