ai-mlops by vasilyu1983/ai-agents-public
npx skills add https://github.com/vasilyu1983/ai-agents-public --skill ai-mlops采用现代化安全实践的生产级机器学习生命周期。
本技能涵盖:
现代化最佳实践 (2026年1月):
它侧重于执行:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
assets/ 目录中可复制粘贴模板的链接| 任务 | 工具/框架 | 命令 | 使用场景 |
|---|---|---|---|
| 数据摄取 | dlt (数据加载工具) | dlt pipeline run, dlt init | 从 API、数据库加载到数据仓库 |
| 批处理部署 | Airflow, Dagster, Prefect | airflow dags trigger, dagster job launch | 对大型数据集进行定时预测 |
| API 部署 | FastAPI, Flask, TorchServe | uvicorn app:app, torchserve --start | 实时推理 (<500ms 延迟) |
| LLM 服务 | vLLM, TGI, BentoML | vllm serve model, bentoml serve | 高吞吐量 LLM 推理 |
| 模型注册表 | MLflow, W&B, ZenML | mlflow.register_model(), zenml model register | 模型版本控制与升级 |
| 漂移检测 | 统计测试 + 监控器 | PSI/KS, 嵌入漂移, 预测漂移 | 检测数据/流程变化并触发审查 |
| 监控 | Prometheus, Grafana | prometheus.yml, Grafana 仪表板 | 指标、告警、SLO 跟踪 |
| AgentOps | AgentOps, Langfuse, LangSmith | agentops.init(), 追踪可视化 | AI 智能体可观测性、会话回放 |
| 事件响应 | 操作手册, PagerDuty | 文档化的操作手册、告警路由 | 处理故障和性能下降 |
当用户询问 ML/LLM/智能体系统的部署、运维、监控、事件处理或治理时,请使用此技能,例如:
如果用户仅询问探索性数据分析、建模或理论,请优先使用:
ai-ml-data-science (探索性数据分析、特征、建模、使用 SQLMesh 的 SQL 转换)ai-llm (提示工程、微调、评估)ai-rag (检索流水线设计)ai-llm-inference (压缩、推测解码、服务内部原理)如果用户询问SQL 转换(数据加载后),请优先使用:
ai-ml-data-science (用于暂存层、中间层、集市层的 SQLMesh 模板)用户需要部署: [ML 系统]
├─ 数据摄取?
│ ├─ 来自 REST API? → dlt REST API 模板
│ ├─ 来自数据库? → dlt 数据库源 (PostgreSQL, MySQL, MongoDB)
│ └─ 增量加载? → dlt 增量模式 (基于时间戳、基于 ID)
│
├─ 模型服务?
│ ├─ 延迟 <500ms? → FastAPI 实时 API
│ ├─ 批处理预测? → Airflow/Dagster 批处理流水线
│ └─ 两者混合? → 混合 (批处理特征 + 在线评分)
│
├─ 监控与运维?
│ ├─ 漂移检测? → Evidently + 自动重训练触发器
│ ├─ 性能跟踪? → Prometheus + Grafana 仪表板
│ └─ 事件响应? → 操作手册 + PagerDuty 告警
│
└─ LLM/RAG 生产?
├─ 成本优化? → 缓存、提示模板、令牌预算
└─ 安全性? → 参见 ai-mlops 技能
应做
应避免
本技能提供生产就绪的模式和指南,组织成全面的参考:
模式 0:数据契约、摄取与血缘 → 参见 数据摄取模式
模式 1:选择部署模式 → 参见 部署模式
模式 2:标准部署生命周期 → 参见 部署生命周期
模式 3:打包与模型注册表 → 参见 模型注册表模式
模式 4:批处理评分流水线 → 参见 部署模式
模式 5:实时 API 评分 → 参见 API 设计模式
模式 6:混合与特征存储集成 → 参见 特征存储模式
模式 7:监控与告警 → 参见 监控最佳实践
模式 8:漂移检测与自动重训练 → 参见 漂移检测指南
模式 9:事件与操作手册 → 参见 事件响应操作手册
模式 10:生产中的 LLM / RAG → 参见 LLM & RAG 生产模式
模式 11:跨区域、数据驻留与回滚 → 参见 多区域模式
模式 12:在线评估与反馈循环 → 参见 在线评估模式
模式 13:AgentOps(AI 智能体运维) → 参见 AgentOps 模式
模式 14:边缘 MLOps 与 TinyML → 参见 边缘 MLOps 模式
全面的运维指南,请参见:
核心基础设施:
服务与 API:
运维与可靠性:
安全与治理:
高级模式:
将这些作为生产工件的复制粘贴起点:
用于将数据加载到数据仓库和流水线:
使用 dlt 的场景:
对于 SQL 转换(摄取后),请使用:
→ ai-ml-data-science 技能(用于暂存层/中间层/集市层的 SQLMesh 模板)
资源
模板
数据
关于以下内容的精选参考资料,请参见 data/sources.json:
关于全面的数据湖/湖仓一体模式(超越 dlt 摄取),请参见 data-lake-platform:
本技能侧重于ML 特定的部署、监控和安全性。通用数据基础设施请使用 data-lake-platform。
当用户询问有关 MLOps 工具的推荐问题时,在回答前请验证时效性。
Production ML lifecycle with modern security practices.
This skill covers:
Modern Best Practices (Jan 2026) :
It is execution-focused:
assets/| Task | Tool/Framework | Command | When to Use |
|---|---|---|---|
| Data Ingestion | dlt (data load tool) | dlt pipeline run, dlt init | Loading from APIs, databases to warehouses |
| Batch Deployment | Airflow, Dagster, Prefect | airflow dags trigger, dagster job launch | Scheduled predictions on large datasets |
| API Deployment | FastAPI, Flask, TorchServe | uvicorn app:app, torchserve --start |
Use this skill when the user asks for deployment, operations, monitoring, incident handling, or governance for ML/LLM/agent systems, e.g.:
If the user is asking only about EDA, modelling, or theory , prefer:
ai-ml-data-science (EDA, features, modelling, SQL transformation with SQLMesh)ai-llm (prompting, fine-tuning, eval)ai-rag (retrieval pipeline design)ai-llm-inference (compression, spec decode, serving internals)If the user is asking about SQL transformation (after data is loaded) , prefer:
ai-ml-data-science (SQLMesh templates for staging, intermediate, marts layers)User needs to deploy: [ML System]
├─ Data Ingestion?
│ ├─ From REST APIs? → dlt REST API templates
│ ├─ From databases? → dlt database sources (PostgreSQL, MySQL, MongoDB)
│ └─ Incremental loading? → dlt incremental patterns (timestamp, ID-based)
│
├─ Model Serving?
│ ├─ Latency <500ms? → FastAPI real-time API
│ ├─ Batch predictions? → Airflow/Dagster batch pipeline
│ └─ Mix of both? → Hybrid (batch features + online scoring)
│
├─ Monitoring & Ops?
│ ├─ Drift detection? → Evidently + automated retraining triggers
│ ├─ Performance tracking? → Prometheus + Grafana dashboards
│ └─ Incident response? → Runbooks + PagerDuty alerts
│
└─ LLM/RAG Production?
├─ Cost optimization? → Caching, prompt templates, token budgets
└─ Safety? → See ai-mlops skill
Do
Avoid
This skill provides production-ready patterns and guides organized into comprehensive references:
Pattern 0: Data Contracts, Ingestion & Lineage → See Data Ingestion Patterns
Pattern 1: Choose Deployment Mode → See Deployment Patterns
Pattern 2: Standard Deployment Lifecycle → See Deployment Lifecycle
Pattern 3: Packaging & Model Registry → See Model Registry Patterns
Pattern 4: Batch Scoring Pipeline → See Deployment Patterns
Pattern 5: Real-Time API Scoring → See API Design Patterns
Pattern 6: Hybrid & Feature Store Integration → See Feature Store Patterns
Pattern 7: Monitoring & Alerting → See Monitoring Best Practices
Pattern 8: Drift Detection & Automated Retraining → See Drift Detection Guide
Pattern 9: Incidents & Runbooks → See Incident Response Playbooks
Pattern 10: LLM / RAG in Production → See LLM & RAG Production Patterns
Pattern 11: Cross-Region, Residency & Rollback → See Multi-Region Patterns
Pattern 12: Online Evaluation & Feedback Loops → See Online Evaluation Patterns
Pattern 13: AgentOps (AI Agent Operations) → See AgentOps Patterns
Pattern 14: Edge MLOps & TinyML → See Edge MLOps Patterns
For comprehensive operational guides, see:
Core Infrastructure:
Serving & APIs:
Operations & Reliability:
Security & Governance:
Advanced Patterns:
Use these as copy-paste starting points for production artifacts:
For loading data into warehouses and pipelines:
Use dlt when:
For SQL transformation (after ingestion), use:
→ ai-ml-data-science skill (SQLMesh templates for staging/intermediate/marts layers)
Resources
Templates
Data
See data/sources.json for curated references on:
For comprehensive data lake/lakehouse patterns (beyond dlt ingestion), see data-lake-platform :
This skill focuses on ML-specific deployment, monitoring, and security. Use data-lake-platform for general-purpose data infrastructure.
When users ask recommendation questions about MLOps tooling, verify recency before answering.
data/sources.json and prefer sources with add_as_web_search: true.data/sources.json, and recommend validation steps (POC + evals + rollout plan).After searching, provide:
For adjacent topics, reference these skills:
Use this skill to turn trained models into reliable services , not to derive the model itself.
Weekly Installs
82
Repository
GitHub Stars
49
First Seen
Jan 22, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
cursor67
codex66
opencode65
gemini-cli63
github-copilot62
amp54
SwiftUI代码审查清单与最佳实践指南 - 提升iOS应用性能与无障碍访问
151 周安装
Grafana仪表盘设计与管理指南:创建生产级监控面板,实现系统可观测性
154 周安装
动态调试器技能:自然语言交互式代码调试,支持Python/C++/Rust等
94 周安装
智能调试工具包:AI驱动调试、根因分析与生产环境故障排除指南
154 周安装
Base UI React - 27+ 个无障碍 React UI 组件库,支持 Tailwind CSS 和完全自定义样式
154 周安装
网站质量审计工具 - 基于Lighthouse的全面性能、SEO、无障碍访问检查与优化建议
154 周安装
| Real-time inference (<500ms latency) |
| LLM Serving | vLLM, TGI, BentoML | vllm serve model, bentoml serve | High-throughput LLM inference |
| Model Registry | MLflow, W&B, ZenML | mlflow.register_model(), zenml model register | Versioning and promoting models |
| Drift Detection | Statistical tests + monitors | PSI/KS, embedding drift, prediction drift | Detect data/process changes and trigger review |
| Monitoring | Prometheus, Grafana | prometheus.yml, Grafana dashboards | Metrics, alerts, SLO tracking |
| AgentOps | AgentOps, Langfuse, LangSmith | agentops.init(), trace visualization | AI agent observability, session replay |
| Incident Response | Runbooks, PagerDuty | Documented playbooks, alert routing | Handling failures and degradation |