科学数据探索性分析工具：自动检测200+格式，生成EDA报告与可视化建议

exploratory-data-analysis by davila7/claude-code-templates

801 周安装量

23,500 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill exploratory-data-analysis

数据可视化自动化数据分析

🇨🇳中文介绍

探索性数据分析

概述

对跨多个领域的科学数据文件执行全面的探索性数据分析（EDA）。此技能提供自动文件类型检测、特定格式分析、数据质量评估，并生成适合文档记录和下游分析规划的详细 Markdown 报告。

核心能力：

自动检测和分析 200 多种科学文件格式
全面的特定格式元数据提取
数据质量和完整性评估
统计摘要和分布分析
可视化建议
下游分析建议
Markdown 报告生成

何时使用此技能

在以下情况使用此技能：

用户提供科学数据文件的路径进行分析
用户要求"探索"、"分析"或"总结"一个数据文件
用户想要理解科学数据的结构和内容
用户在分析前需要数据集的全面报告
用户想要评估数据质量或完整性
用户询问哪种分析适合某个文件

支持的文件类别

此技能全面覆盖科学文件格式，分为六大类别：

1. 化学和分子格式（60+ 扩展名）

结构文件、计算化学输出、分子动力学轨迹和化学数据库。

文件类型包括： .pdb, .cif, .mol, .mol2, , , , , , , , , , , , 等。

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

2. 生物信息学和基因组学格式（50+ 扩展名）

序列数据、比对、注释、变异和表达数据。

文件类型包括： .fasta, .fastq, .sam, .bam, .vcf, .bed, .gff, .gtf, .bigwig, .h5ad, .loom, .counts, .mtx 等。

参考文件： references/bioinformatics_genomics_formats.md

3. 显微镜和成像格式（45+ 扩展名）

显微镜图像、医学影像、全切片成像和电子显微镜。

文件类型包括： .tif, .nd2, .lif, .czi, .ims, .dcm, .nii, .mrc, .dm3, .vsi, .svs, .ome.tiff 等。

参考文件： references/microscopy_imaging_formats.md

4. 光谱学和分析化学格式（35+ 扩展名）

NMR、质谱、IR/Raman、UV-Vis、X射线、色谱和其他分析技术。

文件类型包括： .fid, .mzML, .mzXML, .raw, .mgf, .spc, .jdx, .xy, .cif（晶体学）, .wdf 等。

参考文件： references/spectroscopy_analytical_formats.md

5. 蛋白质组学和代谢组学格式（30+ 扩展名）

质谱蛋白质组学、代谢组学、脂质组学和多组学数据。

文件类型包括： .mzML, .pepXML, .protXML, .mzid, .mzTab, .sky, .mgf, .msp, .h5ad 等。

参考文件： references/proteomics_metabolomics_formats.md

6. 通用科学数据格式（30+ 扩展名）

数组、表格、分层数据、压缩归档和常见科学格式。

文件类型包括： .npy, .npz, .csv, .xlsx, .json, .hdf5, .zarr, .parquet, .mat, .fits, .nc, .xml 等。

参考文件： references/general_scientific_formats.md

步骤 1：文件类型检测

当用户提供文件路径时，首先识别文件类型：

提取文件扩展名
在相应的参考文件中查找扩展名
识别文件类别和格式描述
加载特定格式信息

User: "Analyze data.fastq"
→ Extension: .fastq
→ Category: bioinformatics_genomics
→ Format: FASTQ Format (sequence data with quality scores)
→ Reference: references/bioinformatics_genomics_formats.md

步骤 2：加载特定格式信息

基于文件类型，读取相应的参考文件以了解：

典型数据： 此格式包含何种数据
用例： 此格式的常见应用
Python 库： 如何在 Python 中读取该文件
EDA 方法： 对此数据类型适合进行哪些分析

在参考文件中搜索特定的扩展名（例如，在 bioinformatics_genomics_formats.md 中搜索 "### .fastq"）。

步骤 3：执行数据分析

使用 scripts/eda_analyzer.py 脚本或实现自定义分析：

选项 A：使用分析器脚本

# The script automatically:
# 1. Detects file type
# 2. Loads reference information
# 3. Performs format-specific analysis
# 4. Generates markdown report

python scripts/eda_analyzer.py <filepath> [output.md]

选项 B：在对话中进行自定义分析 根据参考文件中的格式信息，执行适当的分析：

对于表格数据（CSV, TSV, Excel）：

使用 pandas 加载
检查维度、数据类型
分析缺失值
计算汇总统计量
识别异常值
检查重复项

对于序列数据（FASTA, FASTQ）：

统计序列数量
分析长度分布
计算 GC 含量
评估质量分数（FASTQ）

对于图像（TIFF, ND2, CZI）：

检查维度（X, Y, Z, C, T）
分析位深度和值范围
提取元数据（通道、时间戳、空间校准）
计算强度统计量

对于数组（NPY, HDF5）：

检查形状和维度
分析数据类型
计算统计摘要
检查缺失/无效值

步骤 4：生成全面报告

创建包含以下部分的 Markdown 报告：

标题和元数据
- 文件名和时间戳
- 文件大小和位置
基本信息
- 文件属性
- 格式标识
文件类型详情
- 参考中的格式描述
- 典型数据内容
- 常见用例
- 用于读取的 Python 库
数据分析
- 结构和维度
- 统计摘要
- 质量评估
- 数据特征
关键发现
- 显著模式
- 潜在问题
- 质量指标
建议
- 预处理步骤
- 适当的分析
- 工具和方法
- 可视化方法

使用 assets/report_template.md 作为报告结构的指南。

步骤 5：保存报告

使用描述性文件名保存 Markdown 报告：

模式：{原始文件名}_eda_report.md
示例：experiment_data.fastq → experiment_data_eda_report.md

每个参考文件包含数十种文件类型的全面信息。要查找特定格式的信息：

根据扩展名确定类别
读取相应的参考文件
搜索与扩展名匹配的章节标题（例如 "### .pdb"）
提取格式信息

每个格式条目包括：

描述： 格式是什么
典型数据： 包含什么内容
用例： 常见应用
Python 库： 如何读取（附代码示例）
EDA 方法： 要执行的特定分析

### .pdb - Protein Data Bank
**描述：** 生物大分子 3D 结构的标准格式
**典型数据：** 原子坐标、残基信息、二级结构
**用例：** 蛋白质结构分析、分子可视化、对接
**Python 库：**
- `Biopython`: `Bio.PDB`
- `MDAnalysis`: `MDAnalysis.Universe('file.pdb')`
**EDA 方法：**
- 结构验证（键长、角度）
- B 因子分布
- 缺失残基检测
- 拉氏图

参考文件很大（每个 10,000+ 字）。要高效使用它们：

按扩展名搜索： 使用 grep 查找特定格式

import re
with open('references/chemistry_molecular_formats.md', 'r') as f:
    content = f.read()
    pattern = r'### \.pdb[^#]*?(?=###|\Z)'
    match = re.search(pattern, content, re.IGNORECASE | re.DOTALL)

提取相关部分： 不要不必要地将整个参考文件加载到上下文中
缓存格式信息： 如果分析多个相同类型的文件，请重用格式信息

对大文件采样： 对于包含数百万条记录的文件，分析代表性样本
优雅地处理错误： 许多科学格式需要特定的库；提供清晰的安装说明
验证元数据： 交叉检查元数据一致性（例如，声明的维度与实际数据）
考虑数据来源： 注意仪器、软件版本、处理步骤

全面： 包含下游分析所需的所有相关信息
具体： 根据文件类型提供具体建议
可操作： 建议具体的后续步骤和工具
包含代码示例： 展示如何加载和处理数据

示例 1：分析 FASTQ 文件

# User provides: "Analyze reads.fastq"

# 1. Detect file type
extension = '.fastq'
category = 'bioinformatics_genomics'

# 2. Read reference info
# Search references/bioinformatics_genomics_formats.md for "### .fastq"

# 3. Perform analysis
from Bio import SeqIO
sequences = list(SeqIO.parse('reads.fastq', 'fastq'))
# Calculate: read count, length distribution, quality scores, GC content

# 4. Generate report
# Include: format description, analysis results, QC recommendations

# 5. Save as: reads_eda_report.md

示例 2：分析 CSV 数据集

# User provides: "Explore experiment_results.csv"

# 1. Detect: .csv → general_scientific

# 2. Load reference for CSV format

# 3. Analyze
import pandas as pd
df = pd.read_csv('experiment_results.csv')
# Dimensions, dtypes, missing values, statistics, correlations

# 4. Generate report with:
# - Data structure
# - Missing value patterns
# - Statistical summaries
# - Correlation matrix
# - Outlier detection results

# 5. Save report

示例 3：分析显微镜数据

# User provides: "Analyze cells.nd2"

# 1. Detect: .nd2 → microscopy_imaging (Nikon format)

# 2. Read reference for ND2 format
# Learn: multi-dimensional (XYZCT), requires nd2reader

# 3. Analyze
from nd2reader import ND2Reader
with ND2Reader('cells.nd2') as images:
    # Extract: dimensions, channels, timepoints, metadata
    # Calculate: intensity statistics, frame info

# 4. Generate report with:
# - Image dimensions (XY, Z-stacks, time, channels)
# - Channel wavelengths
# - Pixel size and calibration
# - Recommendations for image analysis

# 5. Save report

许多科学格式需要专门的库：

问题： 尝试读取文件时出现导入错误

解决方案： 提供清晰的安装说明

try:
    from Bio import SeqIO
except ImportError:
    print("Install Biopython: uv pip install biopython")

按类别划分的常见需求：

生物信息学： biopython, pysam, pyBigWig
化学： rdkit, mdanalysis, cclib
显微镜： tifffile, nd2reader, aicsimageio, pydicom
光谱学： nmrglue, pymzml, pyteomics
通用： pandas, numpy, h5py, scipy

如果文件扩展名不在参考文件中：

询问用户文件格式
检查是否是供应商特定的变体
尝试基于文件结构（文本 vs 二进制）进行通用分析
提供一般性建议

对于非常大的文件：

使用采样策略（前 N 条记录）
使用内存映射访问（对于 HDF5, NPY）
分块处理（对于 CSV, FASTQ）
基于样本提供估计值

可以直接使用 scripts/eda_analyzer.py：

# Basic usage
python scripts/eda_analyzer.py data.csv

# Specify output file
python scripts/eda_analyzer.py data.csv output_report.md

# The script will:
# 1. Auto-detect file type
# 2. Load format references
# 3. Perform appropriate analysis
# 4. Generate markdown report

该脚本支持对许多常见格式进行自动分析，但在对话中进行自定义分析提供了更大的灵活性和特定领域的洞察力。

分析多个相关文件时：

对每个文件执行单独的 EDA
创建摘要比较报告
识别关系和依赖关系
建议集成策略

用于数据质量评估：

检查格式合规性
验证元数据一致性
评估完整性
识别异常值和异常情况
与预期范围/分布进行比较

根据数据特征，建议：

标准化策略
缺失值插补
异常值处理
批次校正
格式转换

eda_analyzer.py：可直接运行或导入的全面分析脚本

chemistry_molecular_formats.md：60+ 化学/分子文件格式
bioinformatics_genomics_formats.md：50+ 生物信息学格式
microscopy_imaging_formats.md：45+ 成像格式
spectroscopy_analytical_formats.md：35+ 光谱学格式
proteomics_metabolomics_formats.md：30+ 组学格式
general_scientific_formats.md：30+ 通用格式

report_template.md：用于 EDA 报告的全面 Markdown 模板

🇺🇸English

Exploratory Data Analysis

Overview

Perform comprehensive exploratory data analysis (EDA) on scientific data files across multiple domains. This skill provides automated file type detection, format-specific analysis, data quality assessment, and generates detailed markdown reports suitable for documentation and downstream analysis planning.

Key Capabilities:

Automatic detection and analysis of 200+ scientific file formats
Comprehensive format-specific metadata extraction
Data quality and integrity assessment
Statistical summaries and distributions
Visualization recommendations
Downstream analysis suggestions
Markdown report generation

When to Use This Skill

Use this skill when:

User provides a path to a scientific data file for analysis
User asks to "explore", "analyze", or "summarize" a data file
User wants to understand the structure and content of scientific data
User needs a comprehensive report of a dataset before analysis
User wants to assess data quality or completeness
User asks what type of analysis is appropriate for a file

Supported File Categories

The skill has comprehensive coverage of scientific file formats organized into six major categories:

1. Chemistry and Molecular Formats (60+ extensions)

Structure files, computational chemistry outputs, molecular dynamics trajectories, and chemical databases.

File types include: .pdb, .cif, .mol, .mol2, .sdf, .xyz, .smi, .gro, .log, .fchk, .cube, .dcd, .xtc, .trr, .prmtop, .psf, and more.

Reference file: references/chemistry_molecular_formats.md

2. Bioinformatics and Genomics Formats (50+ extensions)

Sequence data, alignments, annotations, variants, and expression data.

File types include: .fasta, .fastq, .sam, .bam, .vcf, .bed, .gff, .gtf, .bigwig, .h5ad, .loom, .counts, , and more.

Reference file: references/bioinformatics_genomics_formats.md

3. Microscopy and Imaging Formats (45+ extensions)

Microscopy images, medical imaging, whole slide imaging, and electron microscopy.

File types include: .tif, .nd2, .lif, .czi, .ims, .dcm, .nii, .mrc, .dm3, .vsi, .svs, .ome.tiff, and more.

Reference file: references/microscopy_imaging_formats.md

4. Spectroscopy and Analytical Chemistry Formats (35+ extensions)

NMR, mass spectrometry, IR/Raman, UV-Vis, X-ray, chromatography, and other analytical techniques.

File types include: .fid, .mzML, .mzXML, .raw, .mgf, .spc, .jdx, .xy, .cif (crystallography), .wdf, and more.

Reference file: references/spectroscopy_analytical_formats.md

5. Proteomics and Metabolomics Formats (30+ extensions)

Mass spec proteomics, metabolomics, lipidomics, and multi-omics data.

File types include: .mzML, .pepXML, .protXML, .mzid, .mzTab, .sky, .mgf, .msp, .h5ad, and more.

Reference file: references/proteomics_metabolomics_formats.md

6. General Scientific Data Formats (30+ extensions)

Arrays, tables, hierarchical data, compressed archives, and common scientific formats.

File types include: .npy, .npz, .csv, .xlsx, .json, .hdf5, .zarr, .parquet, .mat, .fits, .nc, .xml, and more.

Reference file: references/general_scientific_formats.md

Workflow

Step 1: File Type Detection

When a user provides a file path, first identify the file type:

Extract the file extension
Look up the extension in the appropriate reference file
Identify the file category and format description
Load format-specific information

Example:

User: "Analyze data.fastq"
→ Extension: .fastq
→ Category: bioinformatics_genomics
→ Format: FASTQ Format (sequence data with quality scores)
→ Reference: references/bioinformatics_genomics_formats.md

Step 2: Load Format-Specific Information

Based on the file type, read the corresponding reference file to understand:

Typical Data: What kind of data this format contains
Use Cases: Common applications for this format
Python Libraries: How to read the file in Python
EDA Approach: What analyses are appropriate for this data type

Search the reference file for the specific extension (e.g., search for "### .fastq" in bioinformatics_genomics_formats.md).

Step 3: Perform Data Analysis

Use the scripts/eda_analyzer.py script OR implement custom analysis:

Option A: Use the analyzer script

# The script automatically:
# 1. Detects file type
# 2. Loads reference information
# 3. Performs format-specific analysis
# 4. Generates markdown report

python scripts/eda_analyzer.py <filepath> [output.md]

Option B: Custom analysis in the conversation Based on the format information from the reference file, perform appropriate analysis:

For tabular data (CSV, TSV, Excel):

Load with pandas
Check dimensions, data types
Analyze missing values
Calculate summary statistics
Identify outliers
Check for duplicates

For sequence data (FASTA, FASTQ):

Count sequences
Analyze length distributions
Calculate GC content
Assess quality scores (FASTQ)

For images (TIFF, ND2, CZI):

Check dimensions (X, Y, Z, C, T)
Analyze bit depth and value range
Extract metadata (channels, timestamps, spatial calibration)
Calculate intensity statistics

For arrays (NPY, HDF5):

Check shape and dimensions
Analyze data type
Calculate statistical summaries
Check for missing/invalid values

Step 4: Generate Comprehensive Report

Create a markdown report with the following sections:

Required Sections:

Title and Metadata
- Filename and timestamp
- File size and location
Basic Information
- File properties
- Format identification
File Type Details
- Format description from reference
- Typical data content
- Common use cases
- Python libraries for reading
Data Analysis
- Structure and dimensions
- Statistical summaries
- Quality assessment
- Data characteristics
Key Findings
- Notable patterns
- Potential issues
- Quality metrics
Recommendations
- Preprocessing steps
- Appropriate analyses
- Tools and methods
- Visualization approaches

Template Location

Use assets/report_template.md as a guide for report structure.

Step 5: Save Report

Save the markdown report with a descriptive filename:

Pattern: {original_filename}_eda_report.md
Example: experiment_data.fastq → experiment_data_eda_report.md

Detailed Format References

Each reference file contains comprehensive information for dozens of file types. To find information about a specific format:

Identify the category from the extension
Read the appropriate reference file
Search for the section heading matching the extension (e.g., "### .pdb")
Extract the format information

Reference File Structure

Each format entry includes:

Description: What the format is
Typical Data: What it contains
Use Cases: Common applications
Python Libraries: How to read it (with code examples)
EDA Approach: Specific analyses to perform

Example lookup:

### .pdb - Protein Data Bank
**Description:** Standard format for 3D structures of biological macromolecules
**Typical Data:** Atomic coordinates, residue information, secondary structure
**Use Cases:** Protein structure analysis, molecular visualization, docking
**Python Libraries:**
- `Biopython`: `Bio.PDB`
- `MDAnalysis`: `MDAnalysis.Universe('file.pdb')`
**EDA Approach:**
- Structure validation (bond lengths, angles)
- B-factor distribution
- Missing residues detection
- Ramachandran plots

Best Practices

Reading Reference Files

Reference files are large (10,000+ words each). To efficiently use them:

Search by extension: Use grep to find the specific format

import re
with open('references/chemistry_molecular_formats.md', 'r') as f:
    content = f.read()
    pattern = r'### \.pdb[^#]*?(?=###|\Z)'
    match = re.search(pattern, content, re.IGNORECASE | re.DOTALL)

Extract relevant sections: Don't load entire reference files into context unnecessarily
Cache format info: If analyzing multiple files of the same type, reuse the format information

Data Analysis

Sample large files: For files with millions of records, analyze a representative sample
Handle errors gracefully: Many scientific formats require specific libraries; provide clear installation instructions
Validate metadata: Cross-check metadata consistency (e.g., stated dimensions vs actual data)
Consider data provenance: Note instrument, software versions, processing steps

Report Generation

Be comprehensive: Include all relevant information for downstream analysis
Be specific: Provide concrete recommendations based on the file type
Be actionable: Suggest specific next steps and tools
Include code examples: Show how to load and work with the data

Examples

Example 1: Analyzing a FASTQ file

# User provides: "Analyze reads.fastq"

# 1. Detect file type
extension = '.fastq'
category = 'bioinformatics_genomics'

# 2. Read reference info
# Search references/bioinformatics_genomics_formats.md for "### .fastq"

# 3. Perform analysis
from Bio import SeqIO
sequences = list(SeqIO.parse('reads.fastq', 'fastq'))
# Calculate: read count, length distribution, quality scores, GC content

# 4. Generate report
# Include: format description, analysis results, QC recommendations

# 5. Save as: reads_eda_report.md

Example 2: Analyzing a CSV dataset

# User provides: "Explore experiment_results.csv"

# 1. Detect: .csv → general_scientific

# 2. Load reference for CSV format

# 3. Analyze
import pandas as pd
df = pd.read_csv('experiment_results.csv')
# Dimensions, dtypes, missing values, statistics, correlations

# 4. Generate report with:
# - Data structure
# - Missing value patterns
# - Statistical summaries
# - Correlation matrix
# - Outlier detection results

# 5. Save report

Example 3: Analyzing microscopy data

# User provides: "Analyze cells.nd2"

# 1. Detect: .nd2 → microscopy_imaging (Nikon format)

# 2. Read reference for ND2 format
# Learn: multi-dimensional (XYZCT), requires nd2reader

# 3. Analyze
from nd2reader import ND2Reader
with ND2Reader('cells.nd2') as images:
    # Extract: dimensions, channels, timepoints, metadata
    # Calculate: intensity statistics, frame info

# 4. Generate report with:
# - Image dimensions (XY, Z-stacks, time, channels)
# - Channel wavelengths
# - Pixel size and calibration
# - Recommendations for image analysis

# 5. Save report

Troubleshooting

Missing Libraries

Many scientific formats require specialized libraries:

Problem: Import error when trying to read a file

Solution: Provide clear installation instructions

try:
    from Bio import SeqIO
except ImportError:
    print("Install Biopython: uv pip install biopython")

Common requirements by category:

Bioinformatics: biopython, pysam, pyBigWig
Chemistry: rdkit, mdanalysis, cclib
Microscopy: tifffile, nd2reader, aicsimageio, pydicom
Spectroscopy: , ,

Unknown File Types

If a file extension is not in the references:

Ask the user about the file format
Check if it's a vendor-specific variant
Attempt generic analysis based on file structure (text vs binary)
Provide general recommendations

Large Files

For very large files:

Use sampling strategies (first N records)
Use memory-mapped access (for HDF5, NPY)
Process in chunks (for CSV, FASTQ)
Provide estimates based on samples

Script Usage

The scripts/eda_analyzer.py can be used directly:

# Basic usage
python scripts/eda_analyzer.py data.csv

# Specify output file
python scripts/eda_analyzer.py data.csv output_report.md

# The script will:
# 1. Auto-detect file type
# 2. Load format references
# 3. Perform appropriate analysis
# 4. Generate markdown report

The script supports automatic analysis for many common formats, but custom analysis in the conversation provides more flexibility and domain-specific insights.

Advanced Usage

Multi-File Analysis

When analyzing multiple related files:

Perform individual EDA on each file
Create a summary comparison report
Identify relationships and dependencies
Suggest integration strategies

Quality Control

For data quality assessment:

Check format compliance
Validate metadata consistency
Assess completeness
Identify outliers and anomalies
Compare to expected ranges/distributions

Preprocessing Recommendations

Based on data characteristics, recommend:

Normalization strategies
Missing value imputation
Outlier handling
Batch correction
Format conversions

Resources

scripts/

eda_analyzer.py: Comprehensive analysis script that can be run directly or imported

references/

chemistry_molecular_formats.md: 60+ chemistry/molecular file formats
bioinformatics_genomics_formats.md: 50+ bioinformatics formats
microscopy_imaging_formats.md: 45+ imaging formats
spectroscopy_analytical_formats.md: 35+ spectroscopy formats
proteomics_metabolomics_formats.md: 30+ omics formats
general_scientific_formats.md: 30+ general formats

assets/

report_template.md: Comprehensive markdown template for EDA reports

Weekly Installs

766

Repository

davila7/claude-…emplates

GitHub Stars

23.4K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode658

gemini-cli620

codex610

cursor573

github-copilot551

kimi-cli481

DOCX文件创建、编辑与分析完整指南 - 使用docx-js、Pandoc和Python脚本

41,800 周安装

General: pandas, numpy, h5py, scipy