Vaex Python库：处理十亿级大数据的核外DataFrame，实现高性能数据分析和可视化

vaex by davila7/claude-code-templates

167 周安装量

24,100 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill vaex

数据可视化数据分析数据处理

🇨🇳中文介绍

Vaex

概述

Vaex 是一个高性能的 Python 库，专为惰性、核外（out-of-core）DataFrame 而设计，用于处理和可视化因过大而无法放入 RAM 的表格数据集。Vaex 每秒可处理超过十亿行数据，使得对具有数十亿行数据的数据集进行交互式数据探索和分析成为可能。

何时使用此技能

在以下情况使用 Vaex：

处理大于可用 RAM 的表格数据集（从 GB 到 TB 级别）
对海量数据集执行快速统计聚合
为大型数据集创建可视化和热图
在大数据上构建机器学习流水线
在数据格式（CSV、HDF5、Arrow、Parquet）之间进行转换
需要惰性求值和虚拟列以避免内存开销
处理天文数据、金融时间序列或其他大规模科学数据集

核心能力

Vaex 提供六个主要能力领域，每个领域在 references 目录中都有详细文档：

1. DataFrame 与数据加载

从各种来源加载和创建 Vaex DataFrame，包括文件（HDF5、CSV、Arrow、Parquet）、pandas DataFrame、NumPy 数组和字典。参考 references/core_dataframes.md 了解：

高效打开大文件
从 pandas/NumPy/Arrow 转换
使用示例数据集
理解 DataFrame 结构

2. 数据处理与操作

执行过滤、创建虚拟列、使用表达式以及聚合数据，而无需将所有内容加载到内存中。参考 references/data_processing.md 了解：

过滤与选择
虚拟列与表达式
分组操作与聚合
字符串操作与日期时间处理
处理缺失数据

3. 性能与优化

利用 Vaex 的惰性求值、缓存策略和内存高效操作。参考了解：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

5. 机器学习集成

使用转换器、编码器以及与 scikit-learn、XGBoost 等框架的集成来构建 ML 流水线。参考 references/machine_learning.md 了解：

特征缩放与编码
PCA 与降维
K-means 聚类
与 scikit-learn/XGBoost/CatBoost 集成
模型序列化与部署

以各种格式高效读写数据，并获得最佳性能。参考 references/io_operations.md 了解：

文件格式推荐
导出策略
使用 Apache Arrow
大文件的 CSV 处理
服务器和远程数据访问

对于大多数 Vaex 任务，请遵循以下模式：

import vaex

# 1. 打开或创建 DataFrame
df = vaex.open('large_file.hdf5')  # 或 .csv, .arrow, .parquet
# 或者
df = vaex.from_pandas(pandas_df)

# 2. 探索数据
print(df)  # 显示首尾行和列信息
df.describe()  # 统计摘要

# 3. 创建虚拟列（无内存开销）
df['new_column'] = df.x ** 2 + df.y

# 4. 使用选择进行过滤
df_filtered = df[df.age > 25]

# 5. 计算统计量（快速，惰性求值）
mean_val = df.x.mean()
stats = df.groupby('category').agg({'value': 'sum'})

# 6. 可视化
df.plot1d(df.x, limits=[0, 100])
df.plot(df.x, df.y, limits='99.7%')

# 7. 如有需要则导出
df.export_hdf5('output.hdf5')

参考文件包含每个能力领域的详细信息。根据具体任务将参考资料加载到上下文中：

基本操作：从 references/core_dataframes.md 和 references/data_processing.md 开始
性能问题：查看 references/performance.md
可视化任务：使用 references/visualization.md
ML 流水线：参考 references/machine_learning.md
文件 I/O：查阅 references/io_operations.md

使用 HDF5 或 Apache Arrow 格式 以获得大型数据集的最佳性能
利用虚拟列 而不是物化数据以节省内存
批处理操作：在执行多个计算时使用 delay=True
导出为高效格式 而不是将数据保留在 CSV 中
使用表达式 进行复杂计算，无需中间存储
使用 df.stat() 进行性能分析 以了解内存使用情况并优化操作

模式：将大型 CSV 转换为 HDF5

import vaex

# 打开大型 CSV（自动分块处理）
df = vaex.from_csv('large_file.csv')

# 导出到 HDF5 以便未来更快访问
df.export_hdf5('large_file.hdf5')

# 后续加载是即时的
df = vaex.open('large_file.hdf5')

模式：高效聚合

# 使用 delay=True 批处理多个操作
mean_x = df.x.mean(delay=True)
std_y = df.y.std(delay=True)
sum_z = df.z.sum(delay=True)

# 一次性执行所有操作
results = vaex.execute([mean_x, std_y, sum_z])

模式：用于特征工程的虚拟列

# 无内存开销 - 动态计算
df['age_squared'] = df.age ** 2
df['full_name'] = df.first_name + ' ' + df.last_name
df['is_adult'] = df.age >= 18

此技能包含 references/ 目录中的参考文档：

core_dataframes.md - DataFrame 创建、加载和基本结构
data_processing.md - 过滤、表达式、聚合和转换
performance.md - 优化策略和惰性求值
visualization.md - 绘图和交互式可视化
machine_learning.md - ML 流水线和模型集成
io_operations.md - 文件格式和数据导入/导出

🇺🇸English

Vaex

Overview

Vaex is a high-performance Python library designed for lazy, out-of-core DataFrames to process and visualize tabular datasets that are too large to fit into RAM. Vaex can process over a billion rows per second, enabling interactive data exploration and analysis on datasets with billions of rows.

When to Use This Skill

Use Vaex when:

Processing tabular datasets larger than available RAM (gigabytes to terabytes)
Performing fast statistical aggregations on massive datasets
Creating visualizations and heatmaps of large datasets
Building machine learning pipelines on big data
Converting between data formats (CSV, HDF5, Arrow, Parquet)
Needing lazy evaluation and virtual columns to avoid memory overhead
Working with astronomical data, financial time series, or other large-scale scientific datasets

Core Capabilities

Vaex provides six primary capability areas, each documented in detail in the references directory:

1. DataFrames and Data Loading

Load and create Vaex DataFrames from various sources including files (HDF5, CSV, Arrow, Parquet), pandas DataFrames, NumPy arrays, and dictionaries. Reference references/core_dataframes.md for:

Opening large files efficiently
Converting from pandas/NumPy/Arrow
Working with example datasets
Understanding DataFrame structure

2. Data Processing and Manipulation

Perform filtering, create virtual columns, use expressions, and aggregate data without loading everything into memory. Reference references/data_processing.md for:

Filtering and selections
Virtual columns and expressions
Groupby operations and aggregations
String operations and datetime handling
Working with missing data

3. Performance and Optimization

Leverage Vaex's lazy evaluation, caching strategies, and memory-efficient operations. Reference references/performance.md for:

Understanding lazy evaluation
Using delay=True for batching operations
Materializing columns when needed
Caching strategies
Asynchronous operations

4. Data Visualization

Create interactive visualizations of large datasets including heatmaps, histograms, and scatter plots. Reference references/visualization.md for:

Creating 1D and 2D plots
Heatmap visualizations
Working with selections
Customizing plots and subplots

5. Machine Learning Integration

Build ML pipelines with transformers, encoders, and integration with scikit-learn, XGBoost, and other frameworks. Reference references/machine_learning.md for:

Feature scaling and encoding
PCA and dimensionality reduction
K-means clustering
Integration with scikit-learn/XGBoost/CatBoost
Model serialization and deployment

6. I/O Operations

Efficiently read and write data in various formats with optimal performance. Reference references/io_operations.md for:

File format recommendations
Export strategies
Working with Apache Arrow
CSV handling for large files
Server and remote data access

Quick Start Pattern

For most Vaex tasks, follow this pattern:

import vaex

# 1. Open or create DataFrame
df = vaex.open('large_file.hdf5')  # or .csv, .arrow, .parquet
# OR
df = vaex.from_pandas(pandas_df)

# 2. Explore the data
print(df)  # Shows first/last rows and column info
df.describe()  # Statistical summary

# 3. Create virtual columns (no memory overhead)
df['new_column'] = df.x ** 2 + df.y

# 4. Filter with selections
df_filtered = df[df.age > 25]

# 5. Compute statistics (fast, lazy evaluation)
mean_val = df.x.mean()
stats = df.groupby('category').agg({'value': 'sum'})

# 6. Visualize
df.plot1d(df.x, limits=[0, 100])
df.plot(df.x, df.y, limits='99.7%')

# 7. Export if needed
df.export_hdf5('output.hdf5')

Working with References

The reference files contain detailed information about each capability area. Load references into context based on the specific task:

Basic operations : Start with references/core_dataframes.md and references/data_processing.md
Performance issues : Check references/performance.md
Visualization tasks : Use references/visualization.md
ML pipelines : Reference references/machine_learning.md
File I/O : Consult references/io_operations.md

Best Practices

Use HDF5 or Apache Arrow formats for optimal performance with large datasets
Leverage virtual columns instead of materializing data to save memory
Batch operations using delay=True when performing multiple calculations
Export to efficient formats rather than keeping data in CSV
Use expressions for complex calculations without intermediate storage
Profile withdf.stat() to understand memory usage and optimize operations

Common Patterns

Pattern: Converting Large CSV to HDF5

import vaex

# Open large CSV (processes in chunks automatically)
df = vaex.from_csv('large_file.csv')

# Export to HDF5 for faster future access
df.export_hdf5('large_file.hdf5')

# Future loads are instant
df = vaex.open('large_file.hdf5')

Pattern: Efficient Aggregations

# Use delay=True to batch multiple operations
mean_x = df.x.mean(delay=True)
std_y = df.y.std(delay=True)
sum_z = df.z.sum(delay=True)

# Execute all at once
results = vaex.execute([mean_x, std_y, sum_z])

Pattern: Virtual Columns for Feature Engineering

# No memory overhead - computed on the fly
df['age_squared'] = df.age ** 2
df['full_name'] = df.first_name + ' ' + df.last_name
df['is_adult'] = df.age >= 18

Resources

This skill includes reference documentation in the references/ directory:

core_dataframes.md - DataFrame creation, loading, and basic structure
data_processing.md - Filtering, expressions, aggregations, and transformations
performance.md - Optimization strategies and lazy evaluation
visualization.md - Plotting and interactive visualizations
machine_learning.md - ML pipelines and model integration
io_operations.md - File formats and data import/export

Weekly Installs

116

Repository

davila7/claude-…emplates

GitHub Stars

22.6K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

claude-code101

opencode90

cursor87

gemini-cli86

antigravity83

codex75

Excel财务建模规范与xlsx文件处理指南：专业格式、零错误公式与数据分析

45,000 周安装

Vaex Python库：处理十亿级大数据的核外DataFrame，实现高性能数据分析和可视化

🇨🇳中文介绍

Vaex

概述

何时使用此技能

核心能力

1. DataFrame 与数据加载

2. 数据处理与操作

3. 性能与优化

相关 Skills

4. 数据可视化

5. 机器学习集成

6. I/O 操作

快速入门模式

使用参考资料

最佳实践

常见模式

模式：将大型 CSV 转换为 HDF5

模式：高效聚合

模式：用于特征工程的虚拟列

资源

🇺🇸English

Vaex

Overview

When to Use This Skill

Core Capabilities

1. DataFrames and Data Loading

2. Data Processing and Manipulation

3. Performance and Optimization

4. Data Visualization

5. Machine Learning Integration

6. I/O Operations

Quick Start Pattern

Working with References

Best Practices

Common Patterns

Pattern: Converting Large CSV to HDF5

Pattern: Efficient Aggregations

Pattern: Virtual Columns for Feature Engineering

Resources

最新 Skills