CSV数据分析工具 - 自动分析、可视化与质量评估，Python脚本一键生成报告

csv-analyzer by casper-studios/casper-marketplace

173 周安装量

10 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/casper-studios/casper-marketplace --skill csv-analyzer

Python Web框架数据可视化数据分析

🇨🇳中文介绍

CSV 分析器

概述

全面的 CSV 数据分析和可视化引擎。运行脚本后，使用本指南来解释结果并向用户提供见解。

快速开始

cd ~/.claude/skills/csv-analyzer/scripts
export $(grep -v '^#' /path/to/project/.env | xargs 2>/dev/null)
python3 analyze_csv.py /path/to/data.csv

图表选择决策树

重要提示：根据用户需要理解的内容选择图表：

用户试图理解什么？
│
├── "我的数据看起来怎么样？" (概览)
│   └── 使用默认设置运行 → overview_dashboard.png
│
├── "我的数据干净吗？" (质量)
│   └── 检查：quality_score、missing_values、duplicates
│   └── 显示：如果存在问题，则显示 missing_values.png
│
├── "分布情况如何？" (单变量)
│   ├── 数值型 → numeric_distributions.png (直方图 + KDE)
│   ├── 分类型 → categorical_distributions.png (条形图)
│   └── 基于时间 → time_series.png
│
├── "是否存在异常值？" (异常)
│   └── box_plots.png → 须线之外的点即为异常值
│
├── "变量之间有何关联？" (关系)
│   ├── 2 个数值变量 → correlation_heatmap.png
│   ├── 2-6 个数值变量 → pairplot.png (散点矩阵)
│   ├── 数值型 vs 分类型 → violin_plot.png
│   └── 所有数值型 → correlation_heatmap.png
│
└── "我能从 Y 预测 X 吗？" (预测性)
    └── correlation_heatmap.png → |r| > 0.5 表明具有预测能力

如何解释结果 (供 Claude 使用)

质量分数解释

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

偏度	分布形状	建议
< -1	左尾较重	"大多数值较高，存在一些极低的异常值"
-1 到 -0.5	轻度左偏	"低异常值略多于高异常值"
-0.5 到 0.5	对称	"分布非常均衡 - 适合大多数分析"
0.5 到 1	轻度右偏	"高异常值略多于低异常值"

1. 数据概览 (始终提供)

"您的数据集包含 [rows] 条记录和 [cols] 列：
- [n] 个数值列：[列出前 3 个]
- [n] 个分类列：[列出前 3 个]
- 数据质量分数：[score]/100 ([grade])"

2. 关键发现 (选择最相关的)

如果存在质量问题：

"我注意到一些数据质量问题：
- [column] 列有 [X]% 的缺失值 - [建议：删除/插补/调查]
- 检测到 [N] 个重复行 - [建议：保留第一个/全部删除/调查]"

如果发现强相关性：

"我发现了一些有趣的关系：
- [col1] 和 [col2] 强相关 (r=[value]) - [解释]
- 这表明 [可操作的见解]"

如果检测到异常值：

"我在 [columns] 列中检测到异常值：
- [column]：[n] 个值超出正常范围 ([min outlier] 到 [max outlier])
- 这些可能是 [数据错误 / 真实极端值 / 值得调查]"

如果分布存在偏斜：

"[Column] 列呈 [右/左]偏分布：
- 大多数值集中在 [median] 附近
- 但存在高达 [max] 的极端值
- 对于建模，考虑 [对数变换 / 稳健方法]"

3. 建议 (基于发现)

发现	建议
列中缺失值 >20%	"考虑删除此列或调查其缺失的原因"
缺失值 <5% 且分散	"可以安全地用中位数（数值型）或众数（分类型）进行插补"
高相关性 (>0.9)	"这些列可能是冗余的 - 考虑只保留一个"
大量异常值	"使用稳健统计量（用中位数代替均值）或调查数据收集过程"
高度偏斜	"在线性建模前应用对数变换"
质量分数低	"在分析前优先进行数据清洗"

多图表仪表板请求

当用户要求"仪表板"或"综合视图"时：

# 生成所有可视化图表
python3 analyze_csv.py data.csv --format html --max-charts 10

然后按以下顺序呈现图表：

overview_dashboard.png - "这是您数据的一览视图"
correlation_heatmap.png - "变量之间的关键关系"
numeric_distributions.png - "您的数值数据分布情况"
box_plots.png - "异常值分析"
categorical_distributions.png - "分类数据细分" (如果适用)

python3 analyze_csv.py data.csv

包含所有图表的完整报告

python3 analyze_csv.py data.csv --format markdown --max-charts 10

快速分析 (无图表)

python3 analyze_csv.py data.csv --no-charts

python3 analyze_csv.py huge.csv --sample 50000

python3 analyze_csv.py data.csv --date-columns created_at updated_at

用于编程的 JSON 格式

python3 analyze_csv.py data.csv --format json --no-charts

自定义输出位置

python3 analyze_csv.py data.csv --output-dir /path/to/project/.tmp/analysis

图表描述 (用于向用户解释)

图表	何时显示	如何描述
overview_dashboard.png	首次查看时始终显示	"这是您数据的鸟瞰图"
missing_values.png	如果存在缺失数据	"这显示了您数据中存在缺口的位置"
numeric_distributions.png	探索分布时	"这显示了您的数值是如何分布的"
box_plots.png	检查异常值时	"盒子外面的点是潜在的异常值"
correlation_heatmap.png	探索关系时	"颜色越深 = 关系越强"
categorical_distributions.png	进行类别分析时	"这显示了您类别的细分情况"
time_series.png	处理时间序列数据时	"这是您的数据随时间变化的情况"
pairplot.png	进行多变量探索时	"每个单元格显示两个变量之间的关系"
violin_plot.png	比较组别时	"这显示了不同组之间的分布差异"

常见用户问题 → 操作

用户说	操作
"分析这个 CSV"	运行完整分析，显示概览 + 关键见解
"我的数据干净吗？"	关注 quality_score、缺失值、重复项
"寻找模式"	显示 correlation_heatmap，突出强相关性
"是否存在异常值？"	显示 box_plots，列出每列的异常值数量
"比较 X 在 Y 上的差异"	为数值型 X 与分类型 Y 生成 violin_plot
"显示趋势"	如果存在日期时间列，则生成 time_series
"创建仪表板"	生成所有图表，呈现有组织的摘要
"我应该清洗什么？"	列出缺失值 >5%、有重复项、有异常值的列

默认：~/.claude/skills/csv-analyzer/scripts/.tmp/csv_analysis/
自定义：使用 --output-dir /path/to/project/.tmp/analysis

始终将图表复制到用户项目的 .tmp 目录以便查看：

cp ~/.claude/skills/csv-analyzer/scripts/.tmp/csv_analysis/*.png /path/to/project/.tmp/csv_analysis/

免费 - 完全本地运行，使用 pandas、matplotlib、seaborn、scipy。

pip install pandas matplotlib seaborn scipy numpy

🇺🇸English

CSV Analyzer

Overview

Comprehensive CSV data analysis and visualization engine. Run the script, then use this guide to interpret results and provide insights to users.

Quick Start

cd ~/.claude/skills/csv-analyzer/scripts
export $(grep -v '^#' /path/to/project/.env | xargs 2>/dev/null)
python3 analyze_csv.py /path/to/data.csv

Chart Selection Decision Tree

IMPORTANT : Choose charts based on what the user needs to understand:

What is the user trying to understand?
│
├── "What does my data look like?" (Overview)
│   └── Run with defaults → overview_dashboard.png
│
├── "Is my data clean?" (Quality)
│   └── Check: quality_score, missing_values, duplicates
│   └── Show: missing_values.png if problems exist
│
├── "What's the distribution?" (Single Variable)
│   ├── Numeric → numeric_distributions.png (histogram + KDE)
│   ├── Categorical → categorical_distributions.png (bar chart)
│   └── Time-based → time_series.png
│
├── "Are there outliers?" (Anomalies)
│   └── box_plots.png → points beyond whiskers are outliers
│
├── "How are variables related?" (Relationships)
│   ├── 2 numeric vars → correlation_heatmap.png
│   ├── 2-6 numeric vars → pairplot.png (scatter matrix)
│   ├── Numeric vs Categorical → violin_plot.png
│   └── All numeric → correlation_heatmap.png
│
└── "Can I predict X from Y?" (Predictive)
    └── correlation_heatmap.png → |r| > 0.5 suggests predictive power

How to Interpret Results (For Claude)

Quality Score Interpretation

Score	Grade	What to Tell User
90-100	A	"Your data is excellent quality - ready for analysis"
80-89	B	"Good quality data with minor issues worth noting"
70-79	C	"Moderate quality - address missing values before critical analysis"
60-69	D	"Significant quality issues - recommend data cleaning first"
<60	F	"Critical issues - data needs substantial cleaning"

Correlation Interpretation

|r| Value | Strength | What to Say
---|---|---
0.9 - 1.0 | Very Strong | "X and Y are very strongly related - almost deterministic"
0.7 - 0.9 | Strong | "X and Y have a strong relationship - X could help predict Y"
0.5 - 0.7 | Moderate | "X and Y are moderately correlated - some predictive value"
0.3 - 0.5 | Weak | "X and Y have a weak relationship - limited predictive power"
0.0 - 0.3 | Negligible | "X and Y appear unrelated"

Sign matters:

Positive: "As X increases, Y tends to increase"
Negative: "As X increases, Y tends to decrease"

Skewness Interpretation

Skewness	Distribution Shape	Recommendation
< -1	Heavy left tail	"Most values are high, with some very low outliers"
-1 to -0.5	Mild left skew	"Slightly more low outliers than high"
-0.5 to 0.5	Symmetric	"Nicely balanced distribution - good for most analyses"
0.5 to 1	Mild right skew	"Slightly more high outliers than low"

1 | Heavy right tail | "Most values are low, with some very high outliers. Consider log transform for modeling."

Outlier Assessment

When reporting outliers:

Few outliers ( <1%): "A few extreme values that may warrant investigation"
Moderate outliers (1-5%) : "Notable outliers - check if they're errors or genuine extremes"
Many outliers ( >5%): "High outlier rate suggests either data issues or a non-normal distribution"

Insight Generation Framework

After running analysis, provide insights in this order:

1. Data Overview (Always)

"Your dataset has [rows] records and [cols] columns:
- [n] numeric columns: [list top 3]
- [n] categorical columns: [list top 3]
- Data quality score: [score]/100 ([grade])"

2. Key Findings (Pick most relevant)

If quality issues exist:

"I noticed some data quality concerns:
- [X]% missing values in [column] - [recommend: drop/impute/investigate]
- [N] duplicate rows detected - [recommend: keep first/remove all/investigate]"

If strong correlations found:

"Interesting relationships I found:
- [col1] and [col2] are strongly correlated (r=[value]) - [interpretation]
- This suggests [actionable insight]"

If outliers detected:

"I detected outliers in [columns]:
- [column]: [n] values beyond normal range ([min outlier] to [max outlier])
- These could be [data errors / genuine extremes / worth investigating]"

If skewed distributions:

"[Column] has a [right/left]-skewed distribution:
- Most values cluster around [median]
- But there are extreme values up to [max]
- For modeling, consider [log transform / robust methods]"

3. Recommendations (Based on findings)

Finding	Recommendation
Missing >20% in column	"Consider dropping this column or investigating why it's missing"
Missing <5% scattered	"Safe to impute with median (numeric) or mode (categorical)"
High correlation (>0.9)	"These columns may be redundant - consider keeping only one"
Many outliers	"Use robust statistics (median instead of mean) or investigate data collection"
Highly skewed	"Apply log transform before linear modeling"
Low quality score	"Prioritize data cleaning before analysis"

Multi-Chart Dashboard Requests

When user asks for a "dashboard" or "comprehensive view":

# Generate all visualizations
python3 analyze_csv.py data.csv --format html --max-charts 10

Then present charts in this order:

overview_dashboard.png - "Here's your data at a glance"
correlation_heatmap.png - "Key relationships between variables"
numeric_distributions.png - "How your numeric data is distributed"
box_plots.png - "Outlier analysis"
categorical_distributions.png - "Category breakdowns" (if applicable)

Command Reference

Basic Analysis

python3 analyze_csv.py data.csv

Full Report with All Charts

python3 analyze_csv.py data.csv --format markdown --max-charts 10

Quick Analysis (No Charts)

python3 analyze_csv.py data.csv --no-charts

Large Files (>100MB)

python3 analyze_csv.py huge.csv --sample 50000

Specific Date Columns

python3 analyze_csv.py data.csv --date-columns created_at updated_at

JSON for Programmatic Use

python3 analyze_csv.py data.csv --format json --no-charts

Custom Output Location

python3 analyze_csv.py data.csv --output-dir /path/to/project/.tmp/analysis

Chart Descriptions (For Explaining to Users)

Chart	When to Show	How to Describe
overview_dashboard.png	Always for first look	"Here's a bird's eye view of your data"
missing_values.png	If missing data exists	"This shows where your data has gaps"
numeric_distributions.png	When exploring distributions	"This shows how your numeric values are spread out"
box_plots.png	When checking for outliers	"The dots outside the boxes are potential outliers"
correlation_heatmap.png	When exploring relationships	"Darker colors = stronger relationships"
categorical_distributions.png	For category analysis	"This shows the breakdown of your categories"
time_series.png	For temporal data	"Here's how your data changes over time"
pairplot.png	For multivariate exploration	"Each cell shows how two variables relate"

Common User Questions → Actions

User Says	Action
"Analyze this CSV"	Run full analysis, show overview + key insights
"Is my data clean?"	Focus on quality_score, missing values, duplicates
"Find patterns"	Show correlation_heatmap, highlight strong correlations
"Are there outliers?"	Show box_plots, list outlier counts per column
"Compare X across Y"	Generate violin_plot for numeric X vs categorical Y
"Show me trends"	Generate time_series if datetime column exists
"Create a dashboard"	Generate all charts, present organized summary
"What should I clean?"	List columns with missing >5%, duplicates, outliers

Output Locations

Charts are saved to:

Default: ~/.claude/skills/csv-analyzer/scripts/.tmp/csv_analysis/
Custom: Use --output-dir /path/to/project/.tmp/analysis

Always copy charts to user's project .tmp for visibility:

cp ~/.claude/skills/csv-analyzer/scripts/.tmp/csv_analysis/*.png /path/to/project/.tmp/csv_analysis/

Cost

Free - runs entirely locally using pandas, matplotlib, seaborn, scipy.

Dependencies

pip install pandas matplotlib seaborn scipy numpy

Weekly Installs

Repository

casper-studios/…ketplace

GitHub Stars

First Seen

14 days ago

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode95

github-copilot95

codex95

kimi-cli95

gemini-cli95

cursor95

DOCX文件创建、编辑与分析完整指南 - 使用docx-js、Pandoc和Python脚本

51,800 周安装

90-100	A	"您的数据质量极佳 - 已准备好进行分析"
80-89	B	"数据质量良好，存在一些值得注意的小问题"
70-79	C	"质量中等 - 在进行关键分析前请处理缺失值"
60-69	D	"存在显著的质量问题 - 建议先进行数据清洗"
<60	F	"存在严重问题 - 数据需要进行大量清洗"