ai-ml-data-science by vasilyu1983/ai-agents-public
npx skills add https://github.com/vasilyu1983/ai-agents-public --skill ai-ml-data-science此技能将原始数据和问题转化为经过验证、有文档记录的模型,为生产环境做好准备:
现代重点(2026年): 特征存储、自动重训练、漂移监控(Evidently)、训练-服务一致性以及智能体 ML 循环(计划 -> 执行 -> 评估 -> 改进)。工具:LightGBM、CatBoost、scikit-learn、PyTorch、Polars(用于大于内存数据集的惰性求值)、用于数据版本控制的 lakeFS。
| 任务 | 工具/框架 | 命令 | 使用场景 |
|---|---|---|---|
| EDA 与数据剖析 | Pandas, Great Expectations | df.describe(), ge.validate() |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 初始数据探索和质量检查 |
| 特征工程 | Pandas, Polars, 特征存储 | df.transform(), Feast materialization | 创建滞后、滚动、分类特征 |
| 模型训练 | 梯度提升、线性模型、scikit-learn | lgb.train(), model.fit() | 表格数据 ML 的强基线 |
| 超参数调优 | Optuna, Ray Tune | optuna.create_study(), tune.run() | 优化模型参数 |
| SQL 转换 | SQLMesh | sqlmesh plan, sqlmesh run | 构建暂存/中间/集市层 |
| 实验跟踪 | MLflow, W&B | mlflow.log_metric(), wandb.log() | 版本化实验和模型 |
| 模型评估 | scikit-learn, 自定义指标 | metrics.roc_auc_score(), 切片分析 | 验证模型性能 |
有关全面的数据湖/湖仓一体模式(超越 SQLMesh 转换),请参阅 data-lake-platform:
此技能侧重于ML 特征工程和建模。通用数据基础设施请使用 data-lake-platform。
相关主题,请参考:
用户需要 ML 解决:[问题类型]
- 表格数据?
- 中小规模(<100 万行)? -> LightGBM(快速、高效)
- 大规模且复杂(>100 万行)? -> 先 LightGBM,必要时再尝试神经网络
- 高维稀疏(文本、计数)? -> 线性模型,然后是浅层神经网络
- 时间序列?
- 有季节性? -> LightGBM,然后参考 ai-ml-timeseries
- 有长期依赖关系? -> Transformers(参考 ai-ml-timeseries)
- 文本或多模态数据?
- LLMs/Transformers -> 参考 ai-llm
- SQL 转换?
- SQLMesh(暂存/中间/集市层)
经验法则: 对于表格数据,基于树的梯度提升是一个强基线,但必须根据替代方案和约束条件进行验证。
应做
避免
使用场景: 启动或重构任何数据科学/机器学习项目。
阶段:
详细指南: EDA 最佳实践
使用场景: 在建模前或模型改进期间设计特征。
按数据类型:
关键现代实践: 使用特征存储(Feast、Tecton、Databricks)进行版本控制、共享和训练-服务一致性。
详细指南: 特征工程模式
使用场景: 构建具有数据质量要求的生产级机器学习系统。
组件:
详细指南: 数据契约与血缘关系
使用场景: 选择模型系列并开始实验。
决策指南(现代基准):
详细指南: 建模模式
使用场景: 确定最终模型候选或移交生产。
关键组件:
详细指南: 评估模式
使用场景: 确保实验可复现且为生产做好准备。
现代 MLOps(CI/CD/CT/CM):
版本控制:
详细指南: 可复现性检查清单
使用场景: 管理实时特征和流处理管道。
组件:
详细指南: 特征新鲜度与流处理
使用场景: 捕获生产信号并实施持续改进。
组件:
详细指南: 生产反馈循环
有关全面的操作模式和检查清单,请参阅:
将这些作为复制粘贴的起点:
assets/project/template-standard.mdassets/project/template-quick.mdassets/features/template-feature-engineering.mdassets/eda/template-eda.mdassets/evaluation/template-evaluation-report.mdassets/evaluation/template-model-card.mdassets/review/experiment-review-template.md用于基于 SQL 的数据转换和特征工程:
../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-project.md../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-model.md(FULL, INCREMENTAL, VIEW)../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-incremental.md../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-dag.md../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-testing.md使用 SQLMesh 的场景:
对于数据摄取(加载原始数据),请使用:
资源
模板
数据
请参阅 data/sources.json 获取精选的基础和实现参考资料:
使用此技能端到端执行数据科学项目:具体的检查清单、模式和模板,而非理论。
每周安装次数
129
代码库
GitHub 星标
47
首次出现
2026年1月23日
安全审计
安装于
codex110
gemini-cli108
opencode108
cursor106
github-copilot103
cline90
This skill turns raw data and questions into validated, documented models ready for production:
Modern emphasis (2026): Feature stores, automated retraining, drift monitoring (Evidently), train-serve parity, and agentic ML loops (plan -> execute -> evaluate -> improve). Tools: LightGBM, CatBoost, scikit-learn, PyTorch, Polars (lazy eval for larger-than-RAM datasets), lakeFS for data versioning.
| Task | Tool/Framework | Command | When to Use |
|---|---|---|---|
| EDA & Profiling | Pandas, Great Expectations | df.describe(), ge.validate() | Initial data exploration and quality checks |
| Feature Engineering | Pandas, Polars, Feature Stores | df.transform(), Feast materialization | Creating lag, rolling, categorical features |
| Model Training | Gradient boosting, linear models, scikit-learn | lgb.train(), model.fit() | Strong baselines for tabular ML |
| Hyperparameter Tuning | Optuna, Ray Tune | optuna.create_study(), tune.run() | Optimizing model parameters |
| SQL Transformation | SQLMesh | sqlmesh plan, sqlmesh run | Building staging/intermediate/marts layers |
| Experiment Tracking | MLflow, W&B | mlflow.log_metric(), wandb.log() | Versioning experiments and models |
| Model Evaluation | scikit-learn, custom metrics | metrics.roc_auc_score(), slice analysis | Validating model performance |
For comprehensive data lake/lakehouse patterns (beyond SQLMesh transformation), see data-lake-platform :
This skill focuses on ML feature engineering and modeling. Use data-lake-platform for general-purpose data infrastructure.
For adjacent topics, reference:
User needs ML for: [Problem Type]
- Tabular data?
- Small-medium (<1M rows)? -> LightGBM (fast, efficient)
- Large and complex (>1M rows)? -> LightGBM first, then NN if needed
- High-dim sparse (text, counts)? -> Linear models, then shallow NN
- Time series?
- Seasonality? -> LightGBM, then see ai-ml-timeseries
- Long-term dependencies? -> Transformers (see ai-ml-timeseries)
- Text or mixed modalities?
- LLMs/Transformers -> See ai-llm
- SQL transformations?
- SQLMesh (staging/intermediate/marts layers)
Rule of thumb: For tabular data, tree-based gradient boosting is a strong baseline, but must be validated against alternatives and constraints.
Do
Avoid
Use when: Starting or restructuring any DS/ML project.
Stages:
Detailed guide: EDA Best Practices
Use when: Designing features before modelling or during model improvement.
By data type:
Key Modern Practice: Use feature stores (Feast, Tecton, Databricks) for versioning, sharing, and train-serve parity.
Detailed guide: Feature Engineering Patterns
Use when: Building production ML systems with data quality requirements.
Components:
Detailed guide: Data Contracts & Lineage
Use when: Picking model families and starting experiments.
Decision guide (modern benchmarks):
Detailed guide: Modelling Patterns
Use when: Finalizing a model candidate or handing over to production.
Key components:
Detailed guide: Evaluation Patterns
Use when: Ensuring experiments are reproducible and production-ready.
Modern MLOps (CI/CD/CT/CM):
Versioning:
Detailed guide: Reproducibility Checklist
Use when: Managing real-time features and streaming pipelines.
Components:
Detailed guide: Feature Freshness & Streaming
Use when: Capturing production signals and implementing continuous improvement.
Components:
Detailed guide: Production Feedback Loops
For comprehensive operational patterns and checklists, see:
Use these as copy-paste starting points:
assets/project/template-standard.mdassets/project/template-quick.mdassets/features/template-feature-engineering.mdassets/eda/template-eda.mdassets/evaluation/template-evaluation-report.mdassets/evaluation/template-model-card.mdassets/review/experiment-review-template.mdFor SQL-based data transformation and feature engineering:
../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-project.md../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-model.md (FULL, INCREMENTAL, VIEW)../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-incremental.md../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-dag.md../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-testing.mdUse SQLMesh when:
For data ingestion (loading raw data), use:
Resources
Templates
Data
See data/sources.json for curated foundational and implementation references:
Use this skill to execute data science projects end-to-end : concrete checklists, patterns, and templates, not theory.
Weekly Installs
129
Repository
GitHub Stars
47
First Seen
Jan 23, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
codex110
gemini-cli108
opencode108
cursor106
github-copilot103
cline90
Firebase Authentication 身份验证基础教程 - 用户管理与安全登录配置指南
14,800 周安装
Claude 文件规划技能:使用 Markdown 文件作为持久工作记忆,提升 AI 协作效率
14,000 周安装
Obsidian Bases 插件教程:创建自定义数据库视图,高效管理知识库笔记
14,300 周安装
Firecrawl Scrape:智能网页抓取工具,一键提取LLM优化Markdown内容
14,200 周安装
Firebase 基础入门指南:环境设置、CLI 使用与最佳实践
15,000 周安装
知识漫画创作器 - AI漫画生成工具,支持多种画风和基调组合
14,200 周安装