csv-analyzer by casper-studios/casper-marketplace
npx skills add https://github.com/casper-studios/casper-marketplace --skill csv-analyzer全面的 CSV 数据分析和可视化引擎。运行脚本后,使用本指南来解释结果并向用户提供见解。
cd ~/.claude/skills/csv-analyzer/scripts
export $(grep -v '^#' /path/to/project/.env | xargs 2>/dev/null)
python3 analyze_csv.py /path/to/data.csv
重要提示:根据用户需要理解的内容选择图表:
用户试图理解什么?
│
├── "我的数据看起来怎么样?" (概览)
│ └── 使用默认设置运行 → overview_dashboard.png
│
├── "我的数据干净吗?" (质量)
│ └── 检查:quality_score、missing_values、duplicates
│ └── 显示:如果存在问题,则显示 missing_values.png
│
├── "分布情况如何?" (单变量)
│ ├── 数值型 → numeric_distributions.png (直方图 + KDE)
│ ├── 分类型 → categorical_distributions.png (条形图)
│ └── 基于时间 → time_series.png
│
├── "是否存在异常值?" (异常)
│ └── box_plots.png → 须线之外的点即为异常值
│
├── "变量之间有何关联?" (关系)
│ ├── 2 个数值变量 → correlation_heatmap.png
│ ├── 2-6 个数值变量 → pairplot.png (散点矩阵)
│ ├── 数值型 vs 分类型 → violin_plot.png
│ └── 所有数值型 → correlation_heatmap.png
│
└── "我能从 Y 预测 X 吗?" (预测性)
└── correlation_heatmap.png → |r| > 0.5 表明具有预测能力
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 分数 | 等级 | 告知用户的内容 |
|---|
| 90-100 | A | "您的数据质量极佳 - 已准备好进行分析" |
| 80-89 | B | "数据质量良好,存在一些值得注意的小问题" |
| 70-79 | C | "质量中等 - 在进行关键分析前请处理缺失值" |
| 60-69 | D | "存在显著的质量问题 - 建议先进行数据清洗" |
| <60 | F | "存在严重问题 - 数据需要进行大量清洗" |
|r| 值 | 强度 | 应说明的内容
---|---|---
0.9 - 1.0 | 非常强 | "X 和 Y 关联性非常强 - 几乎是确定性的"
0.7 - 0.9 | 强 | "X 和 Y 有很强的关联性 - X 可能有助于预测 Y"
0.5 - 0.7 | 中等 | "X 和 Y 呈中等程度相关 - 具有一定的预测价值"
0.3 - 0.5 | 弱 | "X 和 Y 关联性较弱 - 预测能力有限"
0.0 - 0.3 | 可忽略 | "X 和 Y 似乎不相关"
符号很重要:
| 偏度 | 分布形状 | 建议 |
|---|---|---|
| < -1 | 左尾较重 | "大多数值较高,存在一些极低的异常值" |
| -1 到 -0.5 | 轻度左偏 | "低异常值略多于高异常值" |
| -0.5 到 0.5 | 对称 | "分布非常均衡 - 适合大多数分析" |
| 0.5 到 1 | 轻度右偏 | "高异常值略多于低异常值" |
1 | 右尾较重 | "大多数值较低,存在一些极高的异常值。建模时考虑进行对数变换。"
报告异常值时:
运行分析后,按以下顺序提供见解:
"您的数据集包含 [rows] 条记录和 [cols] 列:
- [n] 个数值列:[列出前 3 个]
- [n] 个分类列:[列出前 3 个]
- 数据质量分数:[score]/100 ([grade])"
如果存在质量问题:
"我注意到一些数据质量问题:
- [column] 列有 [X]% 的缺失值 - [建议:删除/插补/调查]
- 检测到 [N] 个重复行 - [建议:保留第一个/全部删除/调查]"
如果发现强相关性:
"我发现了一些有趣的关系:
- [col1] 和 [col2] 强相关 (r=[value]) - [解释]
- 这表明 [可操作的见解]"
如果检测到异常值:
"我在 [columns] 列中检测到异常值:
- [column]:[n] 个值超出正常范围 ([min outlier] 到 [max outlier])
- 这些可能是 [数据错误 / 真实极端值 / 值得调查]"
如果分布存在偏斜:
"[Column] 列呈 [右/左]偏分布:
- 大多数值集中在 [median] 附近
- 但存在高达 [max] 的极端值
- 对于建模,考虑 [对数变换 / 稳健方法]"
| 发现 | 建议 |
|---|---|
| 列中缺失值 >20% | "考虑删除此列或调查其缺失的原因" |
| 缺失值 <5% 且分散 | "可以安全地用中位数(数值型)或众数(分类型)进行插补" |
| 高相关性 (>0.9) | "这些列可能是冗余的 - 考虑只保留一个" |
| 大量异常值 | "使用稳健统计量(用中位数代替均值)或调查数据收集过程" |
| 高度偏斜 | "在线性建模前应用对数变换" |
| 质量分数低 | "在分析前优先进行数据清洗" |
当用户要求"仪表板"或"综合视图"时:
# 生成所有可视化图表
python3 analyze_csv.py data.csv --format html --max-charts 10
然后按以下顺序呈现图表:
python3 analyze_csv.py data.csv
python3 analyze_csv.py data.csv --format markdown --max-charts 10
python3 analyze_csv.py data.csv --no-charts
python3 analyze_csv.py huge.csv --sample 50000
python3 analyze_csv.py data.csv --date-columns created_at updated_at
python3 analyze_csv.py data.csv --format json --no-charts
python3 analyze_csv.py data.csv --output-dir /path/to/project/.tmp/analysis
| 图表 | 何时显示 | 如何描述 |
|---|---|---|
| overview_dashboard.png | 首次查看时始终显示 | "这是您数据的鸟瞰图" |
| missing_values.png | 如果存在缺失数据 | "这显示了您数据中存在缺口的位置" |
| numeric_distributions.png | 探索分布时 | "这显示了您的数值是如何分布的" |
| box_plots.png | 检查异常值时 | "盒子外面的点是潜在的异常值" |
| correlation_heatmap.png | 探索关系时 | "颜色越深 = 关系越强" |
| categorical_distributions.png | 进行类别分析时 | "这显示了您类别的细分情况" |
| time_series.png | 处理时间序列数据时 | "这是您的数据随时间变化的情况" |
| pairplot.png | 进行多变量探索时 | "每个单元格显示两个变量之间的关系" |
| violin_plot.png | 比较组别时 | "这显示了不同组之间的分布差异" |
| 用户说 | 操作 |
|---|---|
| "分析这个 CSV" | 运行完整分析,显示概览 + 关键见解 |
| "我的数据干净吗?" | 关注 quality_score、缺失值、重复项 |
| "寻找模式" | 显示 correlation_heatmap,突出强相关性 |
| "是否存在异常值?" | 显示 box_plots,列出每列的异常值数量 |
| "比较 X 在 Y 上的差异" | 为数值型 X 与分类型 Y 生成 violin_plot |
| "显示趋势" | 如果存在日期时间列,则生成 time_series |
| "创建仪表板" | 生成所有图表,呈现有组织的摘要 |
| "我应该清洗什么?" | 列出缺失值 >5%、有重复项、有异常值的列 |
图表保存到:
~/.claude/skills/csv-analyzer/scripts/.tmp/csv_analysis/--output-dir /path/to/project/.tmp/analysis始终将图表复制到用户项目的 .tmp 目录以便查看:
cp ~/.claude/skills/csv-analyzer/scripts/.tmp/csv_analysis/*.png /path/to/project/.tmp/csv_analysis/
免费 - 完全本地运行,使用 pandas、matplotlib、seaborn、scipy。
pip install pandas matplotlib seaborn scipy numpy
每周安装次数
95
代码仓库
GitHub 星标数
9
首次出现
14 天前
安全审计
安装于
opencode95
github-copilot95
codex95
kimi-cli95
gemini-cli95
cursor95
Comprehensive CSV data analysis and visualization engine. Run the script, then use this guide to interpret results and provide insights to users.
cd ~/.claude/skills/csv-analyzer/scripts
export $(grep -v '^#' /path/to/project/.env | xargs 2>/dev/null)
python3 analyze_csv.py /path/to/data.csv
IMPORTANT : Choose charts based on what the user needs to understand:
What is the user trying to understand?
│
├── "What does my data look like?" (Overview)
│ └── Run with defaults → overview_dashboard.png
│
├── "Is my data clean?" (Quality)
│ └── Check: quality_score, missing_values, duplicates
│ └── Show: missing_values.png if problems exist
│
├── "What's the distribution?" (Single Variable)
│ ├── Numeric → numeric_distributions.png (histogram + KDE)
│ ├── Categorical → categorical_distributions.png (bar chart)
│ └── Time-based → time_series.png
│
├── "Are there outliers?" (Anomalies)
│ └── box_plots.png → points beyond whiskers are outliers
│
├── "How are variables related?" (Relationships)
│ ├── 2 numeric vars → correlation_heatmap.png
│ ├── 2-6 numeric vars → pairplot.png (scatter matrix)
│ ├── Numeric vs Categorical → violin_plot.png
│ └── All numeric → correlation_heatmap.png
│
└── "Can I predict X from Y?" (Predictive)
└── correlation_heatmap.png → |r| > 0.5 suggests predictive power
| Score | Grade | What to Tell User |
|---|---|---|
| 90-100 | A | "Your data is excellent quality - ready for analysis" |
| 80-89 | B | "Good quality data with minor issues worth noting" |
| 70-79 | C | "Moderate quality - address missing values before critical analysis" |
| 60-69 | D | "Significant quality issues - recommend data cleaning first" |
| <60 | F | "Critical issues - data needs substantial cleaning" |
|r| Value | Strength | What to Say
---|---|---
0.9 - 1.0 | Very Strong | "X and Y are very strongly related - almost deterministic"
0.7 - 0.9 | Strong | "X and Y have a strong relationship - X could help predict Y"
0.5 - 0.7 | Moderate | "X and Y are moderately correlated - some predictive value"
0.3 - 0.5 | Weak | "X and Y have a weak relationship - limited predictive power"
0.0 - 0.3 | Negligible | "X and Y appear unrelated"
Sign matters:
| Skewness | Distribution Shape | Recommendation |
|---|---|---|
| < -1 | Heavy left tail | "Most values are high, with some very low outliers" |
| -1 to -0.5 | Mild left skew | "Slightly more low outliers than high" |
| -0.5 to 0.5 | Symmetric | "Nicely balanced distribution - good for most analyses" |
| 0.5 to 1 | Mild right skew | "Slightly more high outliers than low" |
1 | Heavy right tail | "Most values are low, with some very high outliers. Consider log transform for modeling."
When reporting outliers:
After running analysis, provide insights in this order:
"Your dataset has [rows] records and [cols] columns:
- [n] numeric columns: [list top 3]
- [n] categorical columns: [list top 3]
- Data quality score: [score]/100 ([grade])"
If quality issues exist:
"I noticed some data quality concerns:
- [X]% missing values in [column] - [recommend: drop/impute/investigate]
- [N] duplicate rows detected - [recommend: keep first/remove all/investigate]"
If strong correlations found:
"Interesting relationships I found:
- [col1] and [col2] are strongly correlated (r=[value]) - [interpretation]
- This suggests [actionable insight]"
If outliers detected:
"I detected outliers in [columns]:
- [column]: [n] values beyond normal range ([min outlier] to [max outlier])
- These could be [data errors / genuine extremes / worth investigating]"
If skewed distributions:
"[Column] has a [right/left]-skewed distribution:
- Most values cluster around [median]
- But there are extreme values up to [max]
- For modeling, consider [log transform / robust methods]"
| Finding | Recommendation |
|---|---|
| Missing >20% in column | "Consider dropping this column or investigating why it's missing" |
| Missing <5% scattered | "Safe to impute with median (numeric) or mode (categorical)" |
| High correlation (>0.9) | "These columns may be redundant - consider keeping only one" |
| Many outliers | "Use robust statistics (median instead of mean) or investigate data collection" |
| Highly skewed | "Apply log transform before linear modeling" |
| Low quality score | "Prioritize data cleaning before analysis" |
When user asks for a "dashboard" or "comprehensive view":
# Generate all visualizations
python3 analyze_csv.py data.csv --format html --max-charts 10
Then present charts in this order:
python3 analyze_csv.py data.csv
python3 analyze_csv.py data.csv --format markdown --max-charts 10
python3 analyze_csv.py data.csv --no-charts
python3 analyze_csv.py huge.csv --sample 50000
python3 analyze_csv.py data.csv --date-columns created_at updated_at
python3 analyze_csv.py data.csv --format json --no-charts
python3 analyze_csv.py data.csv --output-dir /path/to/project/.tmp/analysis
| Chart | When to Show | How to Describe |
|---|---|---|
| overview_dashboard.png | Always for first look | "Here's a bird's eye view of your data" |
| missing_values.png | If missing data exists | "This shows where your data has gaps" |
| numeric_distributions.png | When exploring distributions | "This shows how your numeric values are spread out" |
| box_plots.png | When checking for outliers | "The dots outside the boxes are potential outliers" |
| correlation_heatmap.png | When exploring relationships | "Darker colors = stronger relationships" |
| categorical_distributions.png | For category analysis | "This shows the breakdown of your categories" |
| time_series.png | For temporal data | "Here's how your data changes over time" |
| pairplot.png | For multivariate exploration | "Each cell shows how two variables relate" |
| User Says | Action |
|---|---|
| "Analyze this CSV" | Run full analysis, show overview + key insights |
| "Is my data clean?" | Focus on quality_score, missing values, duplicates |
| "Find patterns" | Show correlation_heatmap, highlight strong correlations |
| "Are there outliers?" | Show box_plots, list outlier counts per column |
| "Compare X across Y" | Generate violin_plot for numeric X vs categorical Y |
| "Show me trends" | Generate time_series if datetime column exists |
| "Create a dashboard" | Generate all charts, present organized summary |
| "What should I clean?" | List columns with missing >5%, duplicates, outliers |
Charts are saved to:
~/.claude/skills/csv-analyzer/scripts/.tmp/csv_analysis/--output-dir /path/to/project/.tmp/analysisAlways copy charts to user's project .tmp for visibility:
cp ~/.claude/skills/csv-analyzer/scripts/.tmp/csv_analysis/*.png /path/to/project/.tmp/csv_analysis/
Free - runs entirely locally using pandas, matplotlib, seaborn, scipy.
pip install pandas matplotlib seaborn scipy numpy
Weekly Installs
95
Repository
GitHub Stars
9
First Seen
14 days ago
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode95
github-copilot95
codex95
kimi-cli95
gemini-cli95
cursor95
DOCX文件创建、编辑与分析完整指南 - 使用docx-js、Pandoc和Python脚本
51,800 周安装
Claude Code历史文件查找器:恢复丢失代码、搜索会话历史、分析文件操作的工具
221 周安装
AI图像增强器 - 智能提升截图清晰度,锐化模糊照片,优化社交媒体图片
216 周安装
CLIP模型:OpenAI图像文本对比预训练,零样本分类与跨模态检索指南
215 周安装
Angular SignalStore 最佳实践 - NgRx 信号状态管理规则与技巧
222 周安装
lp-agent:自动化流动性提供策略工具 | Hummingbot API 与 Solana DEX 集成
217 周安装
SkyPilot 多云编排指南:跨 AWS/GCP/Azure 自动优化机器学习成本与分布式训练
215 周安装
| violin_plot.png | Comparing groups | "This shows how distributions differ across groups" |