高级机器学习工程师技能：生产级MLOps、模型部署与LLM集成实战指南

senior-ml-engineer by alirezarezvani/claude-skills

184 周安装量

6,700 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/alirezarezvani/claude-skills --skill senior-ml-engineer

AI/机器学习自动化开发运维

🇨🇳中文介绍

高级机器学习工程师

面向模型部署、MLOps基础设施和LLM集成的生产级机器学习工程模式。

模型部署工作流

将训练好的模型部署到生产环境并进行监控：

将模型导出为标准格式（ONNX、TorchScript、SavedModel）
将模型及其依赖项打包到Docker容器中
部署到预发布环境
针对预发布环境运行集成测试
部署金丝雀版本（5%流量）到生产环境
监控延迟和错误率1小时
如果指标通过，则推广到全量生产环境
验证： p95延迟 < 100ms，错误率 < 0.1%

容器模板

FROM python:3.11-slim

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY model/ /app/model/
COPY src/ /app/src/

HEALTHCHECK CMD curl -f http://localhost:8080/health || exit 1

EXPOSE 8080
CMD ["uvicorn", "src.server:app", "--host", "0.0.0.0", "--port", "8080"]

服务选项

选项	延迟	吞吐量

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

MLOps流水线设置

建立自动化的训练和部署流程：

为训练数据配置特征存储（Feast、Tecton）
设置实验跟踪（MLflow、Weights & Biases）
创建包含超参数记录的训练流水线
在模型注册表中注册模型，并附带版本元数据
配置由注册表事件触发的预发布部署
为模型比较设置A/B测试基础设施
启用带有告警的漂移监控
验证： 新模型自动与基线模型进行评估对比

from feast import Entity, Feature, FeatureView, FileSource

user = Entity(name="user_id", value_type=ValueType.INT64)

user_features = FeatureView(
    name="user_features",
    entities=["user_id"],
    ttl=timedelta(days=1),
    features=[
        Feature(name="purchase_count_30d", dtype=ValueType.INT64),
        Feature(name="avg_order_value", dtype=ValueType.FLOAT),
    ],
    online=True,
    source=FileSource(path="data/user_features.parquet"),
)

重新训练触发条件

触发条件	检测方式	操作
计划任务	Cron（每周/每月）	完全重新训练
性能下降	准确率 < 阈值	立即重新训练
数据漂移	PSI > 0.2	评估，然后重新训练
新数据量	X个新样本	增量更新

将LLM API集成到生产应用程序中：

创建供应商抽象层以实现供应商灵活性
实现带指数退避的重试逻辑
配置回退到备用供应商
设置令牌计数和上下文截断
为重复查询添加响应缓存
实现按请求的成本跟踪
使用Pydantic添加结构化输出验证
验证： 响应解析正确，成本在预算内

from abc import ABC, abstractmethod
from tenacity import retry, stop_after_attempt, wait_exponential

class LLMProvider(ABC):
    @abstractmethod
    def complete(self, prompt: str, **kwargs) -> str:
        pass

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def call_llm_with_retry(provider: LLMProvider, prompt: str) -> str:
    return provider.complete(prompt)

供应商	输入成本	输出成本
GPT-4	$0.03/1K	$0.06/1K
GPT-3.5	$0.0005/1K	$0.0015/1K
Claude 3 Opus	$0.015/1K	$0.075/1K
Claude 3 Haiku	$0.00025/1K	$0.00125/1K

构建检索增强生成流水线：

选择向量数据库（Pinecone、Qdrant、Weaviate）
根据质量/成本权衡选择嵌入模型
实现文档分块策略
创建包含元数据提取的摄取流水线
使用查询嵌入构建检索
添加重排序以提高相关性
格式化上下文并发送给LLM
验证： 响应引用了检索到的上下文，无幻觉

向量数据库选择

数据库	托管方式	扩展性	延迟	最适合
Pinecone	托管	高	低	生产环境，托管服务
Qdrant	两者皆可	高	非常低	性能关键型应用
Weaviate	两者皆可	高	低	混合搜索
Chroma	自托管	中等	低	原型开发
pgvector	自托管	中等	中等	已有Postgres

策略	分块大小	重叠	最适合
固定大小	500-1000个令牌	50-100	通用文本
按句子	3-5个句子	1个句子	结构化文本
语义	可变	基于含义	研究论文
递归	分层	父子关系	长文档

监控生产模型的漂移和性能下降：

设置延迟跟踪（p50、p95、p99）
配置错误率告警
实现输入数据漂移检测
跟踪预测分布变化
在可用时记录真实标签
使用A/B测试指标比较模型版本
设置自动重新训练触发条件
验证： 在用户可感知的性能下降前触发告警

from scipy.stats import ks_2samp

def detect_drift(reference, current, threshold=0.05):
    statistic, p_value = ks_2samp(reference, current)
    return {
        "drift_detected": p_value < threshold,
        "ks_statistic": statistic,
        "p_value": p_value
    }

指标	警告	严重
p95延迟	> 100ms	> 200ms
错误率	> 0.1%	> 1%
PSI（漂移）	> 0.1	> 0.2
准确率下降	> 2%	> 5%

references/mlops_production_patterns.md 包含：

包含Kubernetes清单的模型部署流水线
带有Feast示例的特征存储架构
包含漂移检测代码的模型监控
包含流量分割的A/B测试基础设施
使用MLflow的自动重新训练流水线

references/llm_integration_guide.md 包含：

供应商抽象层模式
使用tenacity的重试和回退策略
提示工程模板（少样本、思维链）
使用tiktoken的令牌优化
成本计算和跟踪

references/rag_system_architecture.md 包含：

包含代码的RAG流水线实现
向量数据库比较和集成
分块策略（固定、语义、递归）
嵌入模型选择指南
混合搜索和重排序模式

模型部署流水线

python scripts/model_deployment_pipeline.py --model model.pkl --target staging

生成部署工件：Dockerfile、Kubernetes清单、健康检查。

python scripts/rag_system_builder.py --config rag_config.yaml --analyze

搭建包含向量存储集成和检索逻辑的RAG流水线脚手架。

python scripts/ml_monitoring_suite.py --config monitoring.yaml --deploy

设置漂移检测、告警和性能仪表板。

类别	工具
机器学习框架	PyTorch, TensorFlow, Scikit-learn, XGBoost
LLM框架	LangChain, LlamaIndex, DSPy
MLOps	MLflow, Weights & Biases, Kubeflow
数据	Spark, Airflow, dbt, Kafka
部署	Docker, Kubernetes, Triton
数据库	PostgreSQL, BigQuery, Pinecone, Redis

🇺🇸English

Senior ML Engineer

Production ML engineering patterns for model deployment, MLOps infrastructure, and LLM integration.

Model Deployment Workflow
MLOps Pipeline Setup
LLM Integration Workflow
RAG System Implementation
Model Monitoring
Reference Documentation
Tools

Model Deployment Workflow

Deploy a trained model to production with monitoring:

Export model to standardized format (ONNX, TorchScript, SavedModel)
Package model with dependencies in Docker container
Deploy to staging environment
Run integration tests against staging
Deploy canary (5% traffic) to production
Monitor latency and error rates for 1 hour
Promote to full production if metrics pass
Validation: p95 latency < 100ms, error rate < 0.1%

Container Template

FROM python:3.11-slim

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY model/ /app/model/
COPY src/ /app/src/

HEALTHCHECK CMD curl -f http://localhost:8080/health || exit 1

EXPOSE 8080
CMD ["uvicorn", "src.server:app", "--host", "0.0.0.0", "--port", "8080"]

Serving Options

Option	Latency	Throughput	Use Case
FastAPI + Uvicorn	Low	Medium	REST APIs, small models
Triton Inference Server	Very Low	Very High	GPU inference, batching
TensorFlow Serving	Low	High	TensorFlow models
TorchServe	Low	High	PyTorch models
Ray Serve	Medium	High	Complex pipelines, multi-model

MLOps Pipeline Setup

Establish automated training and deployment:

Configure feature store (Feast, Tecton) for training data
Set up experiment tracking (MLflow, Weights & Biases)
Create training pipeline with hyperparameter logging
Register model in model registry with version metadata
Configure staging deployment triggered by registry events
Set up A/B testing infrastructure for model comparison
Enable drift monitoring with alerting
Validation: New models automatically evaluated against baseline

Feature Store Pattern

from feast import Entity, Feature, FeatureView, FileSource

user = Entity(name="user_id", value_type=ValueType.INT64)

user_features = FeatureView(
    name="user_features",
    entities=["user_id"],
    ttl=timedelta(days=1),
    features=[
        Feature(name="purchase_count_30d", dtype=ValueType.INT64),
        Feature(name="avg_order_value", dtype=ValueType.FLOAT),
    ],
    online=True,
    source=FileSource(path="data/user_features.parquet"),
)

Retraining Triggers

Trigger	Detection	Action
Scheduled	Cron (weekly/monthly)	Full retrain
Performance drop	Accuracy < threshold	Immediate retrain
Data drift	PSI > 0.2	Evaluate, then retrain
New data volume	X new samples	Incremental update

LLM Integration Workflow

Integrate LLM APIs into production applications:

Create provider abstraction layer for vendor flexibility
Implement retry logic with exponential backoff
Configure fallback to secondary provider
Set up token counting and context truncation
Add response caching for repeated queries
Implement cost tracking per request
Add structured output validation with Pydantic
Validation: Response parses correctly, cost within budget

Provider Abstraction

from abc import ABC, abstractmethod
from tenacity import retry, stop_after_attempt, wait_exponential

class LLMProvider(ABC):
    @abstractmethod
    def complete(self, prompt: str, **kwargs) -> str:
        pass

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def call_llm_with_retry(provider: LLMProvider, prompt: str) -> str:
    return provider.complete(prompt)

Cost Management

Provider	Input Cost	Output Cost
GPT-4	$0.03/1K	$0.06/1K
GPT-3.5	$0.0005/1K	$0.0015/1K
Claude 3 Opus	$0.015/1K	$0.075/1K
Claude 3 Haiku	$0.00025/1K	$0.00125/1K

RAG System Implementation

Build retrieval-augmented generation pipeline:

Choose vector database (Pinecone, Qdrant, Weaviate)
Select embedding model based on quality/cost tradeoff
Implement document chunking strategy
Create ingestion pipeline with metadata extraction
Build retrieval with query embedding
Add reranking for relevance improvement
Format context and send to LLM
Validation: Response references retrieved context, no hallucinations

Vector Database Selection

Database	Hosting	Scale	Latency	Best For
Pinecone	Managed	High	Low	Production, managed
Qdrant	Both	High	Very Low	Performance-critical
Weaviate	Both	High	Low	Hybrid search
Chroma	Self-hosted	Medium	Low	Prototyping
pgvector	Self-hosted	Medium	Medium	Existing Postgres

Chunking Strategies

Strategy	Chunk Size	Overlap	Best For
Fixed	500-1000 tokens	50-100	General text
Sentence	3-5 sentences	1 sentence	Structured text
Semantic	Variable	Based on meaning	Research papers
Recursive	Hierarchical	Parent-child	Long documents

Model Monitoring

Monitor production models for drift and degradation:

Set up latency tracking (p50, p95, p99)
Configure error rate alerting
Implement input data drift detection
Track prediction distribution shifts
Log ground truth when available
Compare model versions with A/B metrics
Set up automated retraining triggers
Validation: Alerts fire before user-visible degradation

Drift Detection

from scipy.stats import ks_2samp

def detect_drift(reference, current, threshold=0.05):
    statistic, p_value = ks_2samp(reference, current)
    return {
        "drift_detected": p_value < threshold,
        "ks_statistic": statistic,
        "p_value": p_value
    }

Alert Thresholds

Metric	Warning	Critical
p95 latency	> 100ms	> 200ms
Error rate	> 0.1%	> 1%
PSI (drift)	> 0.1	> 0.2
Accuracy drop	> 2%	> 5%

Reference Documentation

MLOps Production Patterns

references/mlops_production_patterns.md contains:

Model deployment pipeline with Kubernetes manifests
Feature store architecture with Feast examples
Model monitoring with drift detection code
A/B testing infrastructure with traffic splitting
Automated retraining pipeline with MLflow

LLM Integration Guide

references/llm_integration_guide.md contains:

Provider abstraction layer pattern
Retry and fallback strategies with tenacity
Prompt engineering templates (few-shot, CoT)
Token optimization with tiktoken
Cost calculation and tracking

RAG System Architecture

references/rag_system_architecture.md contains:

RAG pipeline implementation with code
Vector database comparison and integration
Chunking strategies (fixed, semantic, recursive)
Embedding model selection guide
Hybrid search and reranking patterns

Tools

Model Deployment Pipeline

python scripts/model_deployment_pipeline.py --model model.pkl --target staging

Generates deployment artifacts: Dockerfile, Kubernetes manifests, health checks.

RAG System Builder

python scripts/rag_system_builder.py --config rag_config.yaml --analyze

Scaffolds RAG pipeline with vector store integration and retrieval logic.

ML Monitoring Suite

python scripts/ml_monitoring_suite.py --config monitoring.yaml --deploy

Sets up drift detection, alerting, and performance dashboards.

Tech Stack

Category	Tools
ML Frameworks	PyTorch, TensorFlow, Scikit-learn, XGBoost
LLM Frameworks	LangChain, LlamaIndex, DSPy
MLOps	MLflow, Weights & Biases, Kubeflow
Data	Spark, Airflow, dbt, Kafka
Deployment	Docker, Kubernetes, Triton
Databases	PostgreSQL, BigQuery, Pinecone, Redis

Weekly Installs

182

Repository

alirezarezvani/…e-skills

GitHub Stars

4.3K

First Seen

Jan 20, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

claude-code160

opencode138

gemini-cli135

codex127

cursor116

github-copilot112

Azure RBAC 权限管理工具：查找最小角色、创建自定义角色与自动化分配

123,100 周安装

FastAPI + Uvicorn	低	中等	REST API，小型模型
Triton Inference Server	非常低	非常高	GPU推理，批处理
TensorFlow Serving	低	高	TensorFlow模型
TorchServe	低	高	PyTorch模型
Ray Serve	中等	高	复杂流水线，多模型

高级机器学习工程师技能：生产级MLOps、模型部署与LLM集成实战指南

🇨🇳中文介绍

高级机器学习工程师

目录

模型部署工作流

容器模板

服务选项

相关 Skills

MLOps流水线设置

特征存储模式

重新训练触发条件

LLM集成工作流

供应商抽象

成本管理

RAG系统实现