2026年MLOps与ML安全完整指南：生产级AI运维、监控与治理最佳实践

ai-mlops by vasilyu1983/ai-agents-public

90 周安装量

53 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/vasilyu1983/ai-agents-public --skill ai-mlops

AI/机器学习开发运维安全

🇨🇳中文介绍

MLOps 与 ML 安全 - 完整参考 (2026年1月)

采用现代化安全实践的生产级机器学习生命周期。

本技能涵盖：

生产运维：数据摄取、部署、漂移检测、监控、事件响应
安全：提示注入、越狱防御、RAG 安全、输出过滤
治理：隐私保护、供应链安全、安全评估

数据摄取 (dlt)：从 API、数据库加载数据到数据仓库
模型部署：批处理作业、实时 API、混合系统、事件驱动自动化
运维：实时监控、漂移检测、自动重训练、事件响应

现代化最佳实践 (2026年1月)：

对所有可能变更的内容进行版本控制：模型工件、数据快照、特征定义、提示/配置和智能体图；要求具备可复现性、回滚和审计日志 (NIST SSDF: https://csrc.nist.gov/pubs/sp/800/218/final)。
通过评估（离线 + 在线）和安全发布（影子/金丝雀/蓝绿）来管控变更；将质量、安全性、延迟和成本的退化视为发布阻碍项。
根据风险状况调整控制措施和文档 (欧盟 AI 法案: https://eur-lex.europa.eu/eli/reg/2024/1689/oj; NIST AI RMF + GenAI 配置文件: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf, https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf)。
安全运维化：对全系统（数据、模型、提示、工具、RAG）进行威胁建模，强化供应链安全（SBOM/签名），并为可靠性和安全事件制定事件处理手册。

它侧重于执行：

数据摄取模式（REST API、数据库复制、增量加载）
部署模式（批处理、在线、混合、流式、事件驱动）
具备实时漂移检测的

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

任务	工具/框架	命令	使用场景
数据摄取	dlt (数据加载工具)	`dlt pipeline run`, `dlt init`	从 API、数据库加载到数据仓库
批处理部署	Airflow, Dagster, Prefect	`airflow dags trigger`, `dagster job launch`	对大型数据集进行定时预测
API 部署	FastAPI, Flask, TorchServe	`uvicorn app:app`, `torchserve --start`	实时推理 (<500ms 延迟)
LLM 服务	vLLM, TGI, BentoML	`vllm serve model`, `bentoml serve`	高吞吐量 LLM 推理
模型注册表	MLflow, W&B, ZenML	`mlflow.register_model()`, `zenml model register`	模型版本控制与升级
漂移检测	统计测试 + 监控器	PSI/KS, 嵌入漂移, 预测漂移	检测数据/流程变化并触发审查
监控	Prometheus, Grafana	`prometheus.yml`, Grafana 仪表板	指标、告警、SLO 跟踪
AgentOps	AgentOps, Langfuse, LangSmith	`agentops.init()`, 追踪可视化	AI 智能体可观测性、会话回放
事件响应	操作手册, PagerDuty	文档化的操作手册、告警路由	处理故障和性能下降

何时使用此技能

当用户询问 ML/LLM/智能体系统的部署、运维、监控、事件处理或治理时，请使用此技能，例如：

"如何将此模型部署到生产环境？"
"设计一个批处理 + 在线评分架构。"
"为我们的模型添加监控和漂移检测。"
"为此 ML 服务编写事件处理手册。"
"将此 LLM/RAG 流水线打包为 API。"
"规划我们的重训练和升级工作流。"
"从 Stripe API 加载数据到 Snowflake。"
"使用 dlt 设置增量数据库复制。"
"构建用于数据仓库加载的 ELT 流水线。"

如果用户仅询问探索性数据分析、建模或理论，请优先使用：

ai-ml-data-science (探索性数据分析、特征、建模、使用 SQLMesh 的 SQL 转换)
ai-llm (提示工程、微调、评估)
ai-rag (检索流水线设计)
ai-llm-inference (压缩、推测解码、服务内部原理)

如果用户询问SQL 转换（数据加载后），请优先使用：

ai-ml-data-science (用于暂存层、中间层、集市层的 SQLMesh 模板)

决策树：选择部署策略

用户需要部署: [ML 系统]
    ├─ 数据摄取?
    │   ├─ 来自 REST API? → dlt REST API 模板
    │   ├─ 来自数据库? → dlt 数据库源 (PostgreSQL, MySQL, MongoDB)
    │   └─ 增量加载? → dlt 增量模式 (基于时间戳、基于 ID)
    │
    ├─ 模型服务?
    │   ├─ 延迟 <500ms? → FastAPI 实时 API
    │   ├─ 批处理预测? → Airflow/Dagster 批处理流水线
    │   └─ 两者混合? → 混合 (批处理特征 + 在线评分)
    │
    ├─ 监控与运维?
    │   ├─ 漂移检测? → Evidently + 自动重训练触发器
    │   ├─ 性能跟踪? → Prometheus + Grafana 仪表板
    │   └─ 事件响应? → 操作手册 + PagerDuty 告警
    │
    └─ LLM/RAG 生产?
        ├─ 成本优化? → 缓存、提示模板、令牌预算
        └─ 安全性? → 参见 ai-mlops 技能

核心概念（与供应商无关）

生命周期循环：训练 → 验证 → 部署 → 监控 → 响应 → 重训练/退役。
风险控制：访问控制、数据最小化、日志记录和变更管理 (NIST AI RMF: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf)。
可观测性层面：系统指标（延迟/错误）、数据指标（新鲜度/漂移）、质量指标（模型性能）。
事件就绪度：检测、遏制、回滚和根本原因分析。

应通过可重复的检查来管控部署：评估通过、负载测试、安全审查、回滚计划。
应对所有内容进行版本控制：代码、数据、特征、模型工件、提示模板、配置。
应在优化前定义 SLO 和预算（延迟/成本/错误率）。

避免没有审计跟踪的手动“点击式”部署。
避免静默升级；模型/提示变更需要评估 + 金丝雀发布。
避免没有后续行动的漂移仪表板；每个告警都需要有负责人和操作手册。

本技能提供生产就绪的模式和指南，组织成全面的参考：

数据与基础设施模式

模式 0：数据契约、摄取与血缘 → 参见数据摄取模式

包含 SLA 和版本控制的数据契约
摄取模式（CDC、批处理、流式）
血缘跟踪和模式演进
重放和回填程序

模式 1：选择部署模式 → 参见部署模式

决策表（批处理、在线、混合、流式）
每种模式的使用时机
部署模式选择清单

模式 2：标准部署生命周期 → 参见部署生命周期

预部署、部署、观察、运维、演进阶段
环境升级（开发 → 预发布 → 生产）
渐进式发布策略（金丝雀、蓝绿）

模式 3：打包与模型注册表 → 参见模型注册表模式

模型注册表结构和元数据
打包策略（Docker、ONNX、MLflow）
升级流程（实验性 → 生产）
版本控制和治理

模式 4：批处理评分流水线 → 参见部署模式

使用 Airflow/Dagster 进行编排
幂等的评分作业
验证和回填程序

模式 5：实时 API 评分 → 参见 API 设计模式

服务设计（HTTP/JSON、gRPC）
输入/输出模式
速率限制、超时、熔断器

模式 6：混合与特征存储集成 → 参见特征存储模式

批处理 vs 在线特征
特征存储架构
训练-服务一致性
时间点正确性

模式 7：监控与告警 → 参见监控最佳实践

数据、性能和技术指标
SLO 定义和跟踪
仪表板设计和告警策略

模式 8：漂移检测与自动重训练 → 参见漂移检测指南

自动重训练触发器
事件驱动的重训练流水线

模式 9：事件与操作手册 → 参见事件响应操作手册

常见故障模式
检测、诊断、解决
事后分析程序

模式 10：生产中的 LLM / RAG → 参见 LLM & RAG 生产模式

提示和配置管理
安全性与合规性（PII、越狱）
成本优化（令牌预算、缓存）
监控与回退机制

模式 11：跨区域、数据驻留与回滚 → 参见多区域模式

多区域部署架构
数据驻留和租户隔离
灾难恢复和故障转移
区域回滚程序

模式 12：在线评估与反馈循环 → 参见在线评估模式

反馈信号收集（隐式、显式）
影子和金丝雀部署
具有统计显著性的 A/B 测试
人在回路标注
自动重训练节奏

模式 13：AgentOps（AI 智能体运维） → 参见 AgentOps 模式

AI 智能体的会话追踪和回放
跨智能体运行的成本和延迟跟踪
多智能体可视化和调试
工具调用监控
与 CrewAI、LangGraph、OpenAI Agents SDK 的集成

模式 14：边缘 MLOps 与 TinyML → 参见边缘 MLOps 模式

设备感知的 CI/CD 流水线
支持回滚的 OTA 模型更新
联邦学习运维
边缘漂移检测
间歇性连接处理

资源（详细指南）

全面的运维指南，请参见：

核心基础设施：

数据摄取模式 - 数据契约、CDC、批处理/流式摄取、血缘、模式演进
部署生命周期 - 预部署验证、环境升级、渐进式发布、回滚
模型注册表模式 - 版本控制、打包、升级工作流、治理
特征存储模式 - 批处理/在线特征、混合架构、一致性、延迟优化

服务与 API：

部署模式 - 批处理、在线、混合、流式部署策略和架构
API 设计模式 - ML/LLM/RAG API 模式、输入/输出模式、可靠性模式、版本控制

运维与可靠性：

监控最佳实践 - 指标收集、告警策略、SLO 定义、仪表板设计
漂移检测指南 - 统计测试、自动检测、重训练触发器、恢复策略
事件响应操作手册 - 常见故障模式的操作手册、诊断、解决步骤

安全与治理：

威胁模型 - 信任边界、攻击面、控制映射
提示注入缓解 - 输入强化、工具/RAG 隔离、最小权限
越狱防御 - 鲁棒的拒绝行为、安全补全模式
RAG 安全 - 检索中毒、上下文注入、敏感数据泄露
输出过滤 - 分层过滤器（PII/毒性/策略）、阻止/重写策略
隐私保护 - PII 处理、数据最小化、保留、同意
供应链安全 - SBOM、依赖项锁定、工件签名
安全评估 - 红队测试、评估集、事件就绪度

LLM & RAG 生产模式 - 提示管理、安全性、成本优化、缓存、监控
多区域模式 - 多区域部署、数据驻留、灾难恢复、回滚
在线评估模式 - A/B 测试、影子部署、反馈循环、自动重训练
AgentOps 模式 - AI 智能体可观测性、会话回放、成本跟踪、多智能体调试
边缘 MLOps 模式 - TinyML、联邦学习、OTA 更新、设备感知的 CI/CD
成本管理与 FinOps - ML/LLM 成本建模、预算护栏、成本分摊、云优化
实验追踪模式 - MLflow/W&B 模式、实验组织、工件管理、团队工作流
自动重训练模式 - 触发策略、验证关卡、安全发布、金丝雀重训练流水线

将这些作为生产工件的复制粘贴起点：

用于将数据加载到数据仓库和流水线：

dlt 基础流水线设置 - 安装、配置、运行基础提取和加载
dlt REST API 源 - 从具有分页、身份验证、速率限制的 REST API 提取
dlt 数据库源 - 从 PostgreSQL、MySQL、MongoDB、SQL Server 复制
dlt 增量加载 - 基于时间戳、基于 ID、合并/更新插入模式、回看窗口
dlt 数据仓库加载 - 加载到 Snowflake、BigQuery、Redshift、Postgres、DuckDB

使用 dlt 的场景：

从 API 加载数据（Stripe、HubSpot、Shopify、自定义 API）
将数据库复制到数据仓库
构建具有增量加载的 ELT 流水线
使用 Python 管理数据摄取

对于 SQL 转换（摄取后），请使用：

→ ai-ml-data-science 技能（用于暂存层/中间层/集市层的 SQLMesh 模板）

部署与 MLOps 模板 - 完整的 MLOps 生命周期、模型注册表、升级工作流
部署就绪清单 - 执行/不执行关卡、监控和回滚计划
API 服务模板 - 使用 FastAPI 的实时 REST/gRPC API、输入验证、速率限制
批处理评分流水线模板 - 使用 Airflow/Dagster 编排的批处理推理、验证、回填

监控与告警模板 - 数据/性能/技术指标、仪表板、SLO 定义
漂移检测与重训练模板 - 自动漂移检测、重训练触发器、升级流水线
事件操作手册模板 - 故障模式操作手册、诊断步骤、解决程序

data/sources.json - 精选的外部参考资料

关于以下内容的精选参考资料，请参见 data/sources.json：

服务框架（FastAPI、Flask、gRPC、TorchServe、KServe、Ray Serve）
编排（Airflow、Dagster、Prefect）
模型注册表和 MLOps（MLflow、W&B、Vertex AI、Sagemaker）
监控和可观测性（Prometheus、Grafana、OpenTelemetry、Evidently）
特征存储（Feast、Tecton、Vertex、Databricks）
流式处理与消息传递（Kafka、Pulsar、Kinesis）
LLMOps & RAG 基础设施（向量数据库、LLM 网关、安全工具）

数据湖与湖仓一体

关于全面的数据湖/湖仓一体模式（超越 dlt 摄取），请参见 data-lake-platform：

表格式： Apache Iceberg、Delta Lake、Apache Hudi
查询引擎： ClickHouse、DuckDB、Apache Doris、StarRocks
替代摄取： Airbyte（基于 GUI 的连接器）
转换： dbt（SQLMesh 的替代方案）
流式处理： Apache Kafka 模式
编排： Dagster、Airflow

本技能侧重于ML 特定的部署、监控和安全性。通用数据基础设施请使用 data-lake-platform。

时效性协议（工具推荐）

当用户询问有关 MLOps 工具的推荐问题时，在回答前请验证时效性。

"对于 [使用场景]，最好的 MLOps 平台是什么？"
"我应该用什么来进行 [部署/监控/漂移检测]？"
"MLOps 的最新动态是什么？"
"[模型注册表/特征存储/可观测性] 的当前最佳实践？"
"[MLflow/Kubeflow/Vertex AI] 在 2026 年仍然相关吗？"
"[MLOps 工具 A] 与 [

🇺🇸English

MLOps & ML Security - Complete Reference (Jan 2026)

Production ML lifecycle with modern security practices.

This skill covers:

Production : Data ingestion, deployment, drift detection, monitoring, incident response
Security : Prompt injection, jailbreak defense, RAG security, output filtering
Governance : Privacy protection, supply chain security, safety evaluation

Data ingestion (dlt): Load data from APIs, databases to warehouses
Model deployment : Batch jobs, real-time APIs, hybrid systems, event-driven automation
Operations : Real-time monitoring, drift detection, automated retraining, incident response

Modern Best Practices (Jan 2026) :

Version everything that can change: model artifacts, data snapshots, feature definitions, prompts/configs, and agent graphs; require reproducibility, rollbacks, and audit logs (NIST SSDF: https://csrc.nist.gov/pubs/sp/800/218/final).
Gate changes with evals (offline + online) and safe rollout (shadow/canary/blue-green); treat regressions in quality, safety, latency, and cost as release blockers.
Align controls and documentation to risk posture (EU AI Act: https://eur-lex.europa.eu/eli/reg/2024/1689/oj; NIST AI RMF + GenAI profile: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf, https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf).
Operationalize security: threat model the full system (data, model, prompts, tools, RAG), harden the supply chain (SBOM/signing), and ship incident playbooks for both reliability and safety events.

It is execution-focused:

Data ingestion patterns (REST APIs, database replication, incremental loading)
Deployment patterns (batch, online, hybrid, streaming, event-driven)
Automated monitoring with real-time drift detection
Automated retraining pipelines (monitor → detect → trigger → validate → deploy)
Incident handling with validated rollback and postmortems
Links to copy-paste templates in assets/

Quick Reference

Task	Tool/Framework	Command	When to Use
Data Ingestion	dlt (data load tool)	`dlt pipeline run`, `dlt init`	Loading from APIs, databases to warehouses
Batch Deployment	Airflow, Dagster, Prefect	`airflow dags trigger`, `dagster job launch`	Scheduled predictions on large datasets
API Deployment	FastAPI, Flask, TorchServe	`uvicorn app:app`, `torchserve --start`

Use This Skill When

Use this skill when the user asks for deployment, operations, monitoring, incident handling, or governance for ML/LLM/agent systems, e.g.:

"How do I deploy this model to prod?"
"Design a batch + online scoring architecture."
"Add monitoring and drift detection to our model."
"Write an incident runbook for this ML service."
"Package this LLM/RAG pipeline as an API."
"Plan our retraining and promotion workflow."
"Load data from Stripe API to Snowflake."
"Set up incremental database replication with dlt."
"Build an ELT pipeline for warehouse loading."

If the user is asking only about EDA, modelling, or theory , prefer:

ai-ml-data-science (EDA, features, modelling, SQL transformation with SQLMesh)
ai-llm (prompting, fine-tuning, eval)
ai-rag (retrieval pipeline design)
ai-llm-inference (compression, spec decode, serving internals)

If the user is asking about SQL transformation (after data is loaded) , prefer:

ai-ml-data-science (SQLMesh templates for staging, intermediate, marts layers)

Decision Tree: Choosing Deployment Strategy

User needs to deploy: [ML System]
    ├─ Data Ingestion?
    │   ├─ From REST APIs? → dlt REST API templates
    │   ├─ From databases? → dlt database sources (PostgreSQL, MySQL, MongoDB)
    │   └─ Incremental loading? → dlt incremental patterns (timestamp, ID-based)
    │
    ├─ Model Serving?
    │   ├─ Latency <500ms? → FastAPI real-time API
    │   ├─ Batch predictions? → Airflow/Dagster batch pipeline
    │   └─ Mix of both? → Hybrid (batch features + online scoring)
    │
    ├─ Monitoring & Ops?
    │   ├─ Drift detection? → Evidently + automated retraining triggers
    │   ├─ Performance tracking? → Prometheus + Grafana dashboards
    │   └─ Incident response? → Runbooks + PagerDuty alerts
    │
    └─ LLM/RAG Production?
        ├─ Cost optimization? → Caching, prompt templates, token budgets
        └─ Safety? → See ai-mlops skill

Core Concepts (Vendor-Agnostic)

Lifecycle loop : train → validate → deploy → monitor → respond → retrain/retire.
Risk controls : access control, data minimization, logging, and change management (NIST AI RMF: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf).
Observability planes : system metrics (latency/errors), data metrics (freshness/drift), quality metrics (model performance).
Incident readiness : detection, containment, rollback, and root-cause analysis.

Do / Avoid

Do gate deployments with repeatable checks: evaluation pass, load test, security review, rollback plan.
Do version everything: code, data, features, model artifact, prompt templates, configuration.
Do define SLOs and budgets (latency/cost/error rate) before optimizing.

Avoid

Avoid manual “clickops” deployments without audit trail.
Avoid silent upgrades; require eval + canary for model/prompt changes.
Avoid drift dashboards without actions; every alert needs an owner and runbook.

Core Patterns Overview

This skill provides production-ready patterns and guides organized into comprehensive references:

Data & Infrastructure Patterns

Pattern 0: Data Contracts, Ingestion & Lineage → See Data Ingestion Patterns

Data contracts with SLAs and versioning
Ingestion modes (CDC, batch, streaming)
Lineage tracking and schema evolution
Replay and backfill procedures

Pattern 1: Choose Deployment Mode → See Deployment Patterns

Decision table (batch, online, hybrid, streaming)
When to use each mode
Deployment mode selection checklist

Pattern 2: Standard Deployment Lifecycle → See Deployment Lifecycle

Pre-deploy, deploy, observe, operate, evolve phases
Environment promotion (dev → staging → prod)
Gradual rollout strategies (canary, blue-green)

Pattern 3: Packaging & Model Registry → See Model Registry Patterns

Model registry structure and metadata
Packaging strategies (Docker, ONNX, MLflow)
Promotion flows (experimental → production)
Versioning and governance

Serving Patterns

Pattern 4: Batch Scoring Pipeline → See Deployment Patterns

Orchestration with Airflow/Dagster
Idempotent scoring jobs
Validation and backfill procedures

Pattern 5: Real-Time API Scoring → See API Design Patterns

Service design (HTTP/JSON, gRPC)
Input/output schemas
Rate limiting, timeouts, circuit breakers

Pattern 6: Hybrid & Feature Store Integration → See Feature Store Patterns

Batch vs online features
Feature store architecture
Training-serving consistency
Point-in-time correctness

Operations Patterns

Pattern 7: Monitoring & Alerting → See Monitoring Best Practices

Data, performance, and technical metrics
SLO definition and tracking
Dashboard design and alerting strategies

Pattern 8: Drift Detection & Automated Retraining → See Drift Detection Guide

Automated retraining triggers
Event-driven retraining pipelines

Pattern 9: Incidents & Runbooks → See Incident Response Playbooks

Common failure modes
Detection, diagnosis, resolution
Post-mortem procedures

Pattern 10: LLM / RAG in Production → See LLM & RAG Production Patterns

Prompt and configuration management
Safety and compliance (PII, jailbreaks)
Cost optimization (token budgets, caching)
Monitoring and fallbacks

Pattern 11: Cross-Region, Residency & Rollback → See Multi-Region Patterns

Multi-region deployment architectures
Data residency and tenant isolation
Disaster recovery and failover
Regional rollback procedures

Pattern 12: Online Evaluation & Feedback Loops → See Online Evaluation Patterns

Feedback signal collection (implicit, explicit)
Shadow and canary deployments
A/B testing with statistical significance
Human-in-the-loop labeling
Automated retraining cadence

Pattern 13: AgentOps (AI Agent Operations) → See AgentOps Patterns

Session tracing and replay for AI agents
Cost and latency tracking across agent runs
Multi-agent visualization and debugging
Tool invocation monitoring
Integration with CrewAI, LangGraph, OpenAI Agents SDK

Pattern 14: Edge MLOps & TinyML → See Edge MLOps Patterns

Device-aware CI/CD pipelines
OTA model updates with rollback
Federated learning operations
Edge drift detection
Intermittent connectivity handling

Resources (Detailed Guides)

For comprehensive operational guides, see:

Core Infrastructure:

Data Ingestion Patterns - Data contracts, CDC, batch/streaming ingestion, lineage, schema evolution
Deployment Lifecycle - Pre-deploy validation, environment promotion, gradual rollout, rollback
Model Registry Patterns - Versioning, packaging, promotion workflows, governance
Feature Store Patterns - Batch/online features, hybrid architectures, consistency, latency optimization

Serving & APIs:

Deployment Patterns - Batch, online, hybrid, streaming deployment strategies and architectures
API Design Patterns - ML/LLM/RAG API patterns, input/output schemas, reliability patterns, versioning

Operations & Reliability:

Monitoring Best Practices - Metrics collection, alerting strategies, SLO definition, dashboard design
Drift Detection Guide - Statistical tests, automated detection, retraining triggers, recovery strategies
Incident Response Playbooks - Runbooks for common failure modes, diagnostics, resolution steps

Security & Governance:

Threat Models - Trust boundaries, attack surface, control mapping
Prompt Injection Mitigation - Input hardening, tool/RAG containment, least privilege
Jailbreak Defense - Robust refusal behavior, safe completion patterns
RAG Security - Retrieval poisoning, context injection, sensitive data leakage
Output Filtering - Layered filters (PII/toxicity/policy), block/rewrite strategies
Privacy Protection - PII handling, data minimization, retention, consent
Supply Chain Security - SBOM, dependency pinning, artifact signing
Safety Evaluation - Red teaming, eval sets, incident readiness

Advanced Patterns:

LLM& RAG Production Patterns - Prompt management, safety, cost optimization, caching, monitoring
Multi-Region Patterns - Multi-region deployment, data residency, disaster recovery, rollback
Online Evaluation Patterns - A/B testing, shadow deployments, feedback loops, automated retraining
AgentOps Patterns - AI agent observability, session replay, cost tracking, multi-agent debugging
Edge MLOps Patterns - TinyML, federated learning, OTA updates, device-aware CI/CD
Cost Management& FinOps - ML/LLM cost modeling, budget guardrails, chargeback, cloud optimization
Experiment Tracking Patterns - MLflow/W&B patterns, experiment organization, artifact management, team workflows
Automated Retraining Patterns - Trigger strategies, validation gates, safe rollout, canary retraining pipelines

Templates

Use these as copy-paste starting points for production artifacts:

Data Ingestion (dlt)

For loading data into warehouses and pipelines:

dlt basic pipeline setup - Install, configure, run basic extraction and loading
dlt REST API sources - Extract from REST APIs with pagination, authentication, rate limiting
dlt database sources - Replicate from PostgreSQL, MySQL, MongoDB, SQL Server
dlt incremental loading - Timestamp-based, ID-based, merge/upsert patterns, lookback windows
dlt warehouse loading - Load to Snowflake, BigQuery, Redshift, Postgres, DuckDB

Use dlt when:

Loading data from APIs (Stripe, HubSpot, Shopify, custom APIs)
Replicating databases to warehouses
Building ELT pipelines with incremental loading
Managing data ingestion with Python

For SQL transformation (after ingestion), use:

→ ai-ml-data-science skill (SQLMesh templates for staging/intermediate/marts layers)

Deployment & Packaging

Deployment& MLOps template - Complete MLOps lifecycle, model registry, promotion workflows
Deployment readiness checklist - Go/No-Go gate, monitoring, and rollback plan
API service template - Real-time REST/gRPC API with FastAPI, input validation, rate limiting
Batch scoring pipeline template - Orchestrated batch inference with Airflow/Dagster, validation, backfill

Monitoring & Operations

Monitoring& alerting template - Data/performance/technical metrics, dashboards, SLO definition
Drift detection& retraining template - Automated drift detection, retraining triggers, promotion pipelines
Incident runbook template - Failure mode playbooks, diagnosis steps, resolution procedures

Navigation

Resources

Templates

Data

data/sources.json - Curated external references

External Resources

See data/sources.json for curated references on:

Serving frameworks (FastAPI, Flask, gRPC, TorchServe, KServe, Ray Serve)
Orchestration (Airflow, Dagster, Prefect)
Model registries and MLOps (MLflow, W&B, Vertex AI, Sagemaker)
Monitoring and observability (Prometheus, Grafana, OpenTelemetry, Evidently)
Feature stores (Feast, Tecton, Vertex, Databricks)
Streaming & messaging (Kafka, Pulsar, Kinesis)
LLMOps & RAG infra (vector DBs, LLM gateways, safety tools)

Data Lake & Lakehouse

For comprehensive data lake/lakehouse patterns (beyond dlt ingestion), see data-lake-platform :

Table formats: Apache Iceberg, Delta Lake, Apache Hudi
Query engines: ClickHouse, DuckDB, Apache Doris, StarRocks
Alternative ingestion: Airbyte (GUI-based connectors)
Transformation: dbt (alternative to SQLMesh)
Streaming: Apache Kafka patterns
Orchestration: Dagster, Airflow

This skill focuses on ML-specific deployment, monitoring, and security. Use data-lake-platform for general-purpose data infrastructure.

Recency Protocol (Tooling Recommendations)

When users ask recommendation questions about MLOps tooling, verify recency before answering.

Trigger Conditions

"What's the best MLOps platform for [use case]?"
"What should I use for [deployment/monitoring/drift detection]?"
"What's the latest in MLOps?"
"Current best practices for [model registry/feature store/observability]?"
"Is [MLflow/Kubeflow/Vertex AI] still relevant in 2026?"
"[MLOps tool A] vs [MLOps tool B]?"
"Best way to deploy [LLM/ML model] to production?"
"What feature store should I use?"

Minimal Recency Check

Start from data/sources.json and prefer sources with add_as_web_search: true.
If web search or browsing is available, confirm at least: (a) the tool’s latest release/docs date, (b) active maintenance signals, (c) a recent comparison/alternatives post.
If live search is not available, state that you are relying on static knowledge + data/sources.json, and recommend validation steps (POC + evals + rollout plan).

What to Report

After searching, provide:

Current landscape : What MLOps tools/platforms are popular NOW
Emerging trends : New approaches gaining traction (LLMOps, GenAI ops)
Deprecated/declining : Tools or approaches losing relevance
Recommendation : Based on fresh data, not just static knowledge

Related Skills

For adjacent topics, reference these skills:

ai-ml-data-science - EDA, feature engineering, modelling, evaluation, SQLMesh transformations
ai-llm - Prompting, fine-tuning, evaluation for LLMs
ai-agents - Agentic workflows, multi-agent systems, LLMOps
ai-rag - RAG pipeline design, chunking, retrieval, evaluation
ai-llm-inference - Model serving optimization, quantization, batching
ai-prompt-engineering - Prompt design patterns and best practices
data-lake-platform - Data lake/lakehouse infrastructure (ClickHouse, Iceberg, Kafka)

Use this skill to turn trained models into reliable services , not to derive the model itself.

Fact-Checking

Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.
Prefer primary sources; report source links and dates for volatile information.
If web access is unavailable, state the limitation and mark guidance as unverified.

Weekly Installs

Repository

vasilyu1983/ai-…s-public

GitHub Stars

First Seen

Jan 22, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

cursor67

codex66

opencode65

gemini-cli63

github-copilot62

amp54