exploratory-data-analysis by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill exploratory-data-analysis对跨多个领域的科学数据文件执行全面的探索性数据分析(EDA)。此技能提供自动文件类型检测、特定格式分析、数据质量评估,并生成适合文档记录和下游分析规划的详细 Markdown 报告。
核心能力:
在以下情况使用此技能:
此技能全面覆盖科学文件格式,分为六大类别:
结构文件、计算化学输出、分子动力学轨迹和化学数据库。
文件类型包括: .pdb, .cif, .mol, .mol2, , , , , , , , , , , , 等。
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
.sdf.xyz.smi.gro.log.fchk.cube.dcd.xtc.trr.prmtop.psf参考文件: references/chemistry_molecular_formats.md
序列数据、比对、注释、变异和表达数据。
文件类型包括: .fasta, .fastq, .sam, .bam, .vcf, .bed, .gff, .gtf, .bigwig, .h5ad, .loom, .counts, .mtx 等。
参考文件: references/bioinformatics_genomics_formats.md
显微镜图像、医学影像、全切片成像和电子显微镜。
文件类型包括: .tif, .nd2, .lif, .czi, .ims, .dcm, .nii, .mrc, .dm3, .vsi, .svs, .ome.tiff 等。
参考文件: references/microscopy_imaging_formats.md
NMR、质谱、IR/Raman、UV-Vis、X射线、色谱和其他分析技术。
文件类型包括: .fid, .mzML, .mzXML, .raw, .mgf, .spc, .jdx, .xy, .cif(晶体学), .wdf 等。
参考文件: references/spectroscopy_analytical_formats.md
质谱蛋白质组学、代谢组学、脂质组学和多组学数据。
文件类型包括: .mzML, .pepXML, .protXML, .mzid, .mzTab, .sky, .mgf, .msp, .h5ad 等。
参考文件: references/proteomics_metabolomics_formats.md
数组、表格、分层数据、压缩归档和常见科学格式。
文件类型包括: .npy, .npz, .csv, .xlsx, .json, .hdf5, .zarr, .parquet, .mat, .fits, .nc, .xml 等。
参考文件: references/general_scientific_formats.md
当用户提供文件路径时,首先识别文件类型:
示例:
User: "Analyze data.fastq"
→ Extension: .fastq
→ Category: bioinformatics_genomics
→ Format: FASTQ Format (sequence data with quality scores)
→ Reference: references/bioinformatics_genomics_formats.md
基于文件类型,读取相应的参考文件以了解:
在参考文件中搜索特定的扩展名(例如,在 bioinformatics_genomics_formats.md 中搜索 "### .fastq")。
使用 scripts/eda_analyzer.py 脚本或实现自定义分析:
选项 A:使用分析器脚本
# The script automatically:
# 1. Detects file type
# 2. Loads reference information
# 3. Performs format-specific analysis
# 4. Generates markdown report
python scripts/eda_analyzer.py <filepath> [output.md]
选项 B:在对话中进行自定义分析 根据参考文件中的格式信息,执行适当的分析:
对于表格数据(CSV, TSV, Excel):
对于序列数据(FASTA, FASTQ):
对于图像(TIFF, ND2, CZI):
对于数组(NPY, HDF5):
创建包含以下部分的 Markdown 报告:
使用 assets/report_template.md 作为报告结构的指南。
使用描述性文件名保存 Markdown 报告:
{原始文件名}_eda_report.mdexperiment_data.fastq → experiment_data_eda_report.md每个参考文件包含数十种文件类型的全面信息。要查找特定格式的信息:
每个格式条目包括:
示例查找:
### .pdb - Protein Data Bank
**描述:** 生物大分子 3D 结构的标准格式
**典型数据:** 原子坐标、残基信息、二级结构
**用例:** 蛋白质结构分析、分子可视化、对接
**Python 库:**
- `Biopython`: `Bio.PDB`
- `MDAnalysis`: `MDAnalysis.Universe('file.pdb')`
**EDA 方法:**
- 结构验证(键长、角度)
- B 因子分布
- 缺失残基检测
- 拉氏图
参考文件很大(每个 10,000+ 字)。要高效使用它们:
按扩展名搜索: 使用 grep 查找特定格式
import re
with open('references/chemistry_molecular_formats.md', 'r') as f:
content = f.read()
pattern = r'### \.pdb[^#]*?(?=###|\Z)'
match = re.search(pattern, content, re.IGNORECASE | re.DOTALL)
提取相关部分: 不要不必要地将整个参考文件加载到上下文中
缓存格式信息: 如果分析多个相同类型的文件,请重用格式信息
# User provides: "Analyze reads.fastq"
# 1. Detect file type
extension = '.fastq'
category = 'bioinformatics_genomics'
# 2. Read reference info
# Search references/bioinformatics_genomics_formats.md for "### .fastq"
# 3. Perform analysis
from Bio import SeqIO
sequences = list(SeqIO.parse('reads.fastq', 'fastq'))
# Calculate: read count, length distribution, quality scores, GC content
# 4. Generate report
# Include: format description, analysis results, QC recommendations
# 5. Save as: reads_eda_report.md
# User provides: "Explore experiment_results.csv"
# 1. Detect: .csv → general_scientific
# 2. Load reference for CSV format
# 3. Analyze
import pandas as pd
df = pd.read_csv('experiment_results.csv')
# Dimensions, dtypes, missing values, statistics, correlations
# 4. Generate report with:
# - Data structure
# - Missing value patterns
# - Statistical summaries
# - Correlation matrix
# - Outlier detection results
# 5. Save report
# User provides: "Analyze cells.nd2"
# 1. Detect: .nd2 → microscopy_imaging (Nikon format)
# 2. Read reference for ND2 format
# Learn: multi-dimensional (XYZCT), requires nd2reader
# 3. Analyze
from nd2reader import ND2Reader
with ND2Reader('cells.nd2') as images:
# Extract: dimensions, channels, timepoints, metadata
# Calculate: intensity statistics, frame info
# 4. Generate report with:
# - Image dimensions (XY, Z-stacks, time, channels)
# - Channel wavelengths
# - Pixel size and calibration
# - Recommendations for image analysis
# 5. Save report
许多科学格式需要专门的库:
问题: 尝试读取文件时出现导入错误
解决方案: 提供清晰的安装说明
try:
from Bio import SeqIO
except ImportError:
print("Install Biopython: uv pip install biopython")
按类别划分的常见需求:
biopython, pysam, pyBigWigrdkit, mdanalysis, cclibtifffile, nd2reader, aicsimageio, pydicomnmrglue, pymzml, pyteomicspandas, numpy, h5py, scipy如果文件扩展名不在参考文件中:
对于非常大的文件:
可以直接使用 scripts/eda_analyzer.py:
# Basic usage
python scripts/eda_analyzer.py data.csv
# Specify output file
python scripts/eda_analyzer.py data.csv output_report.md
# The script will:
# 1. Auto-detect file type
# 2. Load format references
# 3. Perform appropriate analysis
# 4. Generate markdown report
该脚本支持对许多常见格式进行自动分析,但在对话中进行自定义分析提供了更大的灵活性和特定领域的洞察力。
分析多个相关文件时:
用于数据质量评估:
根据数据特征,建议:
eda_analyzer.py:可直接运行或导入的全面分析脚本chemistry_molecular_formats.md:60+ 化学/分子文件格式bioinformatics_genomics_formats.md:50+ 生物信息学格式microscopy_imaging_formats.md:45+ 成像格式spectroscopy_analytical_formats.md:35+ 光谱学格式proteomics_metabolomics_formats.md:30+ 组学格式general_scientific_formats.md:30+ 通用格式report_template.md:用于 EDA 报告的全面 Markdown 模板每周安装
766
仓库
GitHub 星标
23.4K
首次出现
Jan 21, 2026
安全审计
安装于
opencode658
gemini-cli620
codex610
cursor573
github-copilot551
kimi-cli481
Perform comprehensive exploratory data analysis (EDA) on scientific data files across multiple domains. This skill provides automated file type detection, format-specific analysis, data quality assessment, and generates detailed markdown reports suitable for documentation and downstream analysis planning.
Key Capabilities:
Use this skill when:
The skill has comprehensive coverage of scientific file formats organized into six major categories:
Structure files, computational chemistry outputs, molecular dynamics trajectories, and chemical databases.
File types include: .pdb, .cif, .mol, .mol2, .sdf, .xyz, .smi, .gro, .log, .fchk, .cube, .dcd, .xtc, .trr, .prmtop, .psf, and more.
Reference file: references/chemistry_molecular_formats.md
Sequence data, alignments, annotations, variants, and expression data.
File types include: .fasta, .fastq, .sam, .bam, .vcf, .bed, .gff, .gtf, .bigwig, .h5ad, .loom, .counts, , and more.
Reference file: references/bioinformatics_genomics_formats.md
Microscopy images, medical imaging, whole slide imaging, and electron microscopy.
File types include: .tif, .nd2, .lif, .czi, .ims, .dcm, .nii, .mrc, .dm3, .vsi, .svs, .ome.tiff, and more.
Reference file: references/microscopy_imaging_formats.md
NMR, mass spectrometry, IR/Raman, UV-Vis, X-ray, chromatography, and other analytical techniques.
File types include: .fid, .mzML, .mzXML, .raw, .mgf, .spc, .jdx, .xy, .cif (crystallography), .wdf, and more.
Reference file: references/spectroscopy_analytical_formats.md
Mass spec proteomics, metabolomics, lipidomics, and multi-omics data.
File types include: .mzML, .pepXML, .protXML, .mzid, .mzTab, .sky, .mgf, .msp, .h5ad, and more.
Reference file: references/proteomics_metabolomics_formats.md
Arrays, tables, hierarchical data, compressed archives, and common scientific formats.
File types include: .npy, .npz, .csv, .xlsx, .json, .hdf5, .zarr, .parquet, .mat, .fits, .nc, .xml, and more.
Reference file: references/general_scientific_formats.md
When a user provides a file path, first identify the file type:
Example:
User: "Analyze data.fastq"
→ Extension: .fastq
→ Category: bioinformatics_genomics
→ Format: FASTQ Format (sequence data with quality scores)
→ Reference: references/bioinformatics_genomics_formats.md
Based on the file type, read the corresponding reference file to understand:
Search the reference file for the specific extension (e.g., search for "### .fastq" in bioinformatics_genomics_formats.md).
Use the scripts/eda_analyzer.py script OR implement custom analysis:
Option A: Use the analyzer script
# The script automatically:
# 1. Detects file type
# 2. Loads reference information
# 3. Performs format-specific analysis
# 4. Generates markdown report
python scripts/eda_analyzer.py <filepath> [output.md]
Option B: Custom analysis in the conversation Based on the format information from the reference file, perform appropriate analysis:
For tabular data (CSV, TSV, Excel):
For sequence data (FASTA, FASTQ):
For images (TIFF, ND2, CZI):
For arrays (NPY, HDF5):
Create a markdown report with the following sections:
Title and Metadata
Basic Information
File Type Details
Data Analysis
Key Findings
Recommendations
Use assets/report_template.md as a guide for report structure.
Save the markdown report with a descriptive filename:
{original_filename}_eda_report.mdexperiment_data.fastq → experiment_data_eda_report.mdEach reference file contains comprehensive information for dozens of file types. To find information about a specific format:
Each format entry includes:
Example lookup:
### .pdb - Protein Data Bank
**Description:** Standard format for 3D structures of biological macromolecules
**Typical Data:** Atomic coordinates, residue information, secondary structure
**Use Cases:** Protein structure analysis, molecular visualization, docking
**Python Libraries:**
- `Biopython`: `Bio.PDB`
- `MDAnalysis`: `MDAnalysis.Universe('file.pdb')`
**EDA Approach:**
- Structure validation (bond lengths, angles)
- B-factor distribution
- Missing residues detection
- Ramachandran plots
Reference files are large (10,000+ words each). To efficiently use them:
Search by extension: Use grep to find the specific format
import re
with open('references/chemistry_molecular_formats.md', 'r') as f:
content = f.read()
pattern = r'### \.pdb[^#]*?(?=###|\Z)'
match = re.search(pattern, content, re.IGNORECASE | re.DOTALL)
Extract relevant sections: Don't load entire reference files into context unnecessarily
Cache format info: If analyzing multiple files of the same type, reuse the format information
# User provides: "Analyze reads.fastq"
# 1. Detect file type
extension = '.fastq'
category = 'bioinformatics_genomics'
# 2. Read reference info
# Search references/bioinformatics_genomics_formats.md for "### .fastq"
# 3. Perform analysis
from Bio import SeqIO
sequences = list(SeqIO.parse('reads.fastq', 'fastq'))
# Calculate: read count, length distribution, quality scores, GC content
# 4. Generate report
# Include: format description, analysis results, QC recommendations
# 5. Save as: reads_eda_report.md
# User provides: "Explore experiment_results.csv"
# 1. Detect: .csv → general_scientific
# 2. Load reference for CSV format
# 3. Analyze
import pandas as pd
df = pd.read_csv('experiment_results.csv')
# Dimensions, dtypes, missing values, statistics, correlations
# 4. Generate report with:
# - Data structure
# - Missing value patterns
# - Statistical summaries
# - Correlation matrix
# - Outlier detection results
# 5. Save report
# User provides: "Analyze cells.nd2"
# 1. Detect: .nd2 → microscopy_imaging (Nikon format)
# 2. Read reference for ND2 format
# Learn: multi-dimensional (XYZCT), requires nd2reader
# 3. Analyze
from nd2reader import ND2Reader
with ND2Reader('cells.nd2') as images:
# Extract: dimensions, channels, timepoints, metadata
# Calculate: intensity statistics, frame info
# 4. Generate report with:
# - Image dimensions (XY, Z-stacks, time, channels)
# - Channel wavelengths
# - Pixel size and calibration
# - Recommendations for image analysis
# 5. Save report
Many scientific formats require specialized libraries:
Problem: Import error when trying to read a file
Solution: Provide clear installation instructions
try:
from Bio import SeqIO
except ImportError:
print("Install Biopython: uv pip install biopython")
Common requirements by category:
biopython, pysam, pyBigWigrdkit, mdanalysis, cclibtifffile, nd2reader, aicsimageio, pydicomIf a file extension is not in the references:
For very large files:
The scripts/eda_analyzer.py can be used directly:
# Basic usage
python scripts/eda_analyzer.py data.csv
# Specify output file
python scripts/eda_analyzer.py data.csv output_report.md
# The script will:
# 1. Auto-detect file type
# 2. Load format references
# 3. Perform appropriate analysis
# 4. Generate markdown report
The script supports automatic analysis for many common formats, but custom analysis in the conversation provides more flexibility and domain-specific insights.
When analyzing multiple related files:
For data quality assessment:
Based on data characteristics, recommend:
eda_analyzer.py: Comprehensive analysis script that can be run directly or importedchemistry_molecular_formats.md: 60+ chemistry/molecular file formatsbioinformatics_genomics_formats.md: 50+ bioinformatics formatsmicroscopy_imaging_formats.md: 45+ imaging formatsspectroscopy_analytical_formats.md: 35+ spectroscopy formatsproteomics_metabolomics_formats.md: 30+ omics formatsgeneral_scientific_formats.md: 30+ general formatsreport_template.md: Comprehensive markdown template for EDA reportsWeekly Installs
766
Repository
GitHub Stars
23.4K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode658
gemini-cli620
codex610
cursor573
github-copilot551
kimi-cli481
DOCX文件创建、编辑与分析完整指南 - 使用docx-js、Pandoc和Python脚本
41,800 周安装
.mtxnmrgluepymzmlpyteomicspandas, numpy, h5py, scipy