⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

Polars 极速 DataFrame 库：Python/Rust 高性能数据处理与 pandas 迁移指南

polars by k-dense-ai/claude-scientific-skills

123 周安装量

18,100 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/k-dense-ai/claude-scientific-skills --skill polars

Python Web框架数据处理 Rust

🇨🇳中文介绍

Polars

概述

Polars 是一个基于 Apache Arrow 构建的、适用于 Python 和 Rust 的极速 DataFrame 库。利用 Polars 基于表达式的 API、惰性求值框架和高性能数据操作能力，可以实现高效的数据处理、pandas 迁移和数据管道优化。

快速开始

安装与基本用法

安装 Polars：

uv pip install polars

基本 DataFrame 创建与操作：

import polars as pl

# 创建 DataFrame
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "city": ["NY", "LA", "SF"]
})

# 选择列
df.select("name", "age")

# 筛选行
df.filter(pl.col("age") > 25)

# 添加计算列
df.with_columns(
    age_plus_10=pl.col("age") + 10
)

核心概念

表达式

表达式是 Polars 操作的基本构建块。它们描述了数据的转换，并且可以被组合、重用和优化。

关键原则：

使用 pl.col("column_name") 来引用列
链式调用方法以构建复杂的转换
表达式是惰性的，只在特定上下文（select、with_columns、filter、group_by）中执行

示例：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

惰性求值与即时求值

即时求值（DataFrame）： 操作立即执行

df = pl.read_csv("file.csv")  # 立即读取
result = df.filter(pl.col("age") > 25)  # 立即执行

惰性求值（LazyFrame）： 操作构建一个查询计划，在执行前进行优化

lf = pl.scan_csv("file.csv")  # 尚未读取
result = lf.filter(pl.col("age") > 25).select("name", "age")
df = result.collect()  # 现在执行优化后的查询

何时使用惰性求值：

处理大型数据集时
复杂的查询管道
仅需要部分列/行时
性能至关重要时

惰性求值的优势：

自动查询优化
谓词下推
投影下推
并行执行

有关详细概念，请加载 references/core_concepts.md。

选择和操作列：

# 选择特定列
df.select("name", "age")

# 使用表达式选择
df.select(
    pl.col("name"),
    (pl.col("age") * 2).alias("double_age")
)

# 选择匹配模式的所有列
df.select(pl.col("^.*_id$"))

根据条件筛选行：

# 单个条件
df.filter(pl.col("age") > 25)

# 多个条件（比使用 & 更清晰）
df.filter(
    pl.col("age") > 25,
    pl.col("city") == "NY"
)

# 复杂条件
df.filter(
    (pl.col("age") > 25) | (pl.col("city") == "LA")
)

在保留现有列的同时添加或修改列：

# 添加新列
df.with_columns(
    age_plus_10=pl.col("age") + 10,
    name_upper=pl.col("name").str.to_uppercase()
)

# 并行计算（所有列并行计算）
df.with_columns(
    pl.col("value") * 10,
    pl.col("value") * 100,
)

对数据进行分组并计算聚合：

# 基本分组
df.group_by("city").agg(
    pl.col("age").mean().alias("avg_age"),
    pl.len().alias("count")
)

# 多个分组键
df.group_by("city", "department").agg(
    pl.col("salary").sum()
)

# 条件聚合
df.group_by("city").agg(
    (pl.col("age") > 30).sum().alias("over_30")
)

有关详细操作模式，请加载 references/operations.md。

聚合与窗口函数

group_by 上下文中的常见聚合：

pl.len() - 统计行数
pl.col("x").sum() - 求和
pl.col("x").mean() - 平均值
pl.col("x").min() / pl.col("x").max() - 极值
pl.first() / pl.last() - 第一个/最后一个值

使用 `over()` 的窗口函数

在保留行数的同时应用聚合：

# 将分组统计信息添加到每一行
df.with_columns(
    avg_age_by_city=pl.col("age").mean().over("city"),
    rank_in_city=pl.col("salary").rank().over("city")
)

# 多个分组列
df.with_columns(
    group_avg=pl.col("value").mean().over("category", "region")
)

group_to_rows（默认）：保留原始行顺序
explode：更快，但将行分组在一起
join：创建列表列

Polars 支持读取和写入：

CSV、Parquet、JSON、Excel
数据库（通过连接器）
云存储（S3、Azure、GCS）
Google BigQuery
多个/分区文件

常见输入/输出操作

# 即时求值
df = pl.read_csv("file.csv")
df.write_csv("output.csv")

# 惰性求值（处理大文件时首选）
lf = pl.scan_csv("file.csv")
result = lf.filter(...).select(...).collect()

Parquet（推荐用于性能）：

df = pl.read_parquet("file.parquet")
df.write_parquet("output.parquet")

df = pl.read_json("file.json")
df.write_json("output.json")

有关全面的输入/输出文档，请加载 references/io_guide.md。

# 内连接
df1.join(df2, on="id", how="inner")

# 左连接
df1.join(df2, on="id", how="left")

# 在不同列名上连接
df1.join(df2, left_on="user_id", right_on="id")

# 垂直拼接（堆叠行）
pl.concat([df1, df2], how="vertical")

# 水平拼接（添加列）
pl.concat([df1, df2], how="horizontal")

# 对角线拼接（合并不同模式）
pl.concat([df1, df2], how="diagonal")

数据透视与逆透视

# 数据透视（宽格式）
df.pivot(values="sales", index="date", columns="product")

# 逆透视（长格式）
df.unpivot(index="id", on=["col1", "col2"])

有关详细的转换示例，请加载 references/transformations.md。

Polars 相比 pandas 提供了显著的性能改进和更简洁的 API。主要区别：

无索引：Polars 仅使用整数位置
严格类型：无静默类型转换
惰性求值：通过 LazyFrame 提供
默认并行：操作自动并行化

操作	Pandas	Polars
选择列	`df["col"]`	`df.select("col")`
筛选	`df[df["col"] > 10]`	`df.filter(pl.col("col") > 10)`
添加列	`df.assign(x=...)`	`df.with_columns(x=...)`
分组	`df.groupby("col").agg(...)`	`df.group_by("col").agg(...)`
窗口	`df.groupby("col").transform(...)`	`df.with_columns(...).over("col")`

Pandas 顺序执行（慢）：

df.assign(
    col_a=lambda df_: df_.value * 10,
    col_b=lambda df_: df_.value * 100
)

Polars 并行执行（快）：

df.with_columns(
    col_a=pl.col("value") * 10,
    col_b=pl.col("value") * 100,
)

有关全面的迁移指南，请加载 references/pandas_migration.md。

对大型数据集使用惰性求值：

lf = pl.scan_csv("large.csv")  # 不要使用 read_csv
result = lf.filter(...).select(...).collect()
```

2. 避免在热点路径中使用 Python 函数：

 * 保持在表达式 API 内以实现并行化
 * 仅在必要时使用 `.map_elements()`
 * 优先使用原生 Polars 操作

3. 对非常大的数据使用流式处理：

```python
lf.collect(streaming=True)
```

4. 尽早选择仅需要的列：

```python
# 好：尽早选择列
lf.select("col1", "col2").filter(...)

# 差：先在所有列上筛选
lf.filter(...).select("col1", "col2")
```

5. 使用适当的数据类型：

 * 对低基数字符串使用分类类型
 * 合适的整数大小（i32 与 i64）
 * 对时间数据使用日期类型

pl.when(condition).then(value).otherwise(other_value)

跨多列的列操作：

df.select(pl.col("^.*_value$") * 2)  # 正则表达式模式

pl.col("x").fill_null(0)
pl.col("x").is_null()
pl.col("x").drop_nulls()

有关更多最佳实践和模式，请加载 references/best_practices.md。

此技能包含全面的参考文档：

core_concepts.md - 表达式、惰性求值和类型系统的详细解释
operations.md - 包含示例的所有常见操作的综合指南
pandas_migration.md - 从 pandas 迁移到 Polars 的完整指南
io_guide.md - 所有支持格式的数据输入/输出操作
transformations.md - 连接、拼接、数据透视和重塑操作
best_practices.md - 性能优化技巧和常见模式

当用户需要有关特定主题的详细信息时，请根据需要加载这些参考资料。

🇺🇸English

Polars

Overview

Polars is a lightning-fast DataFrame library for Python and Rust built on Apache Arrow. Work with Polars' expression-based API, lazy evaluation framework, and high-performance data manipulation capabilities for efficient data processing, pandas migration, and data pipeline optimization.

Quick Start

Installation and Basic Usage

Install Polars:

uv pip install polars

Basic DataFrame creation and operations:

import polars as pl

# Create DataFrame
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "city": ["NY", "LA", "SF"]
})

# Select columns
df.select("name", "age")

# Filter rows
df.filter(pl.col("age") > 25)

# Add computed columns
df.with_columns(
    age_plus_10=pl.col("age") + 10
)

Core Concepts

Expressions

Expressions are the fundamental building blocks of Polars operations. They describe transformations on data and can be composed, reused, and optimized.

Key principles:

Use pl.col("column_name") to reference columns
Chain methods to build complex transformations
Expressions are lazy and only execute within contexts (select, with_columns, filter, group_by)

Example:

# Expression-based computation
df.select(
    pl.col("name"),
    (pl.col("age") * 12).alias("age_in_months")
)

Lazy vs Eager Evaluation

Eager (DataFrame): Operations execute immediately

df = pl.read_csv("file.csv")  # Reads immediately
result = df.filter(pl.col("age") > 25)  # Executes immediately

Lazy (LazyFrame): Operations build a query plan, optimized before execution

lf = pl.scan_csv("file.csv")  # Doesn't read yet
result = lf.filter(pl.col("age") > 25).select("name", "age")
df = result.collect()  # Now executes optimized query

When to use lazy:

Working with large datasets
Complex query pipelines
When only some columns/rows are needed
Performance is critical

Benefits of lazy evaluation:

Automatic query optimization
Predicate pushdown
Projection pushdown
Parallel execution

For detailed concepts, load references/core_concepts.md.

Common Operations

Select

Select and manipulate columns:

# Select specific columns
df.select("name", "age")

# Select with expressions
df.select(
    pl.col("name"),
    (pl.col("age") * 2).alias("double_age")
)

# Select all columns matching a pattern
df.select(pl.col("^.*_id$"))

Filter

Filter rows by conditions:

# Single condition
df.filter(pl.col("age") > 25)

# Multiple conditions (cleaner than using &)
df.filter(
    pl.col("age") > 25,
    pl.col("city") == "NY"
)

# Complex conditions
df.filter(
    (pl.col("age") > 25) | (pl.col("city") == "LA")
)

With Columns

Add or modify columns while preserving existing ones:

# Add new columns
df.with_columns(
    age_plus_10=pl.col("age") + 10,
    name_upper=pl.col("name").str.to_uppercase()
)

# Parallel computation (all columns computed in parallel)
df.with_columns(
    pl.col("value") * 10,
    pl.col("value") * 100,
)

Group By and Aggregations

Group data and compute aggregations:

# Basic grouping
df.group_by("city").agg(
    pl.col("age").mean().alias("avg_age"),
    pl.len().alias("count")
)

# Multiple group keys
df.group_by("city", "department").agg(
    pl.col("salary").sum()
)

# Conditional aggregations
df.group_by("city").agg(
    (pl.col("age") > 30).sum().alias("over_30")
)

For detailed operation patterns, load references/operations.md.

Aggregations and Window Functions

Aggregation Functions

Common aggregations within group_by context:

pl.len() - count rows
pl.col("x").sum() - sum values
pl.col("x").mean() - average
pl.col("x").min() / pl.col("x").max() - extremes
pl.first() / pl.last() - first/last values

Window Functions with `over()`

Apply aggregations while preserving row count:

# Add group statistics to each row
df.with_columns(
    avg_age_by_city=pl.col("age").mean().over("city"),
    rank_in_city=pl.col("salary").rank().over("city")
)

# Multiple grouping columns
df.with_columns(
    group_avg=pl.col("value").mean().over("category", "region")
)

Mapping strategies:

group_to_rows (default): Preserves original row order
explode: Faster but groups rows together
join: Creates list columns

Data I/O

Supported Formats

Polars supports reading and writing:

CSV, Parquet, JSON, Excel
Databases (via connectors)
Cloud storage (S3, Azure, GCS)
Google BigQuery
Multiple/partitioned files

Common I/O Operations

CSV:

# Eager
df = pl.read_csv("file.csv")
df.write_csv("output.csv")

# Lazy (preferred for large files)
lf = pl.scan_csv("file.csv")
result = lf.filter(...).select(...).collect()

Parquet (recommended for performance):

df = pl.read_parquet("file.parquet")
df.write_parquet("output.parquet")

JSON:

df = pl.read_json("file.json")
df.write_json("output.json")

For comprehensive I/O documentation, load references/io_guide.md.

Transformations

Joins

Combine DataFrames:

# Inner join
df1.join(df2, on="id", how="inner")

# Left join
df1.join(df2, on="id", how="left")

# Join on different column names
df1.join(df2, left_on="user_id", right_on="id")

Concatenation

Stack DataFrames:

# Vertical (stack rows)
pl.concat([df1, df2], how="vertical")

# Horizontal (add columns)
pl.concat([df1, df2], how="horizontal")

# Diagonal (union with different schemas)
pl.concat([df1, df2], how="diagonal")

Pivot and Unpivot

Reshape data:

# Pivot (wide format)
df.pivot(values="sales", index="date", columns="product")

# Unpivot (long format)
df.unpivot(index="id", on=["col1", "col2"])

For detailed transformation examples, load references/transformations.md.

Pandas Migration

Polars offers significant performance improvements over pandas with a cleaner API. Key differences:

Conceptual Differences

No index : Polars uses integer positions only
Strict typing : No silent type conversions
Lazy evaluation : Available via LazyFrame
Parallel by default : Operations parallelized automatically

Common Operation Mappings

Operation	Pandas	Polars
Select column	`df["col"]`	`df.select("col")`
Filter	`df[df["col"] > 10]`	`df.filter(pl.col("col") > 10)`
Add column	`df.assign(x=...)`	`df.with_columns(x=...)`
Group by	`df.groupby("col").agg(...)`

Key Syntax Patterns

Pandas sequential (slow):

df.assign(
    col_a=lambda df_: df_.value * 10,
    col_b=lambda df_: df_.value * 100
)

Polars parallel (fast):

df.with_columns(
    col_a=pl.col("value") * 10,
    col_b=pl.col("value") * 100,
)

For comprehensive migration guide, load references/pandas_migration.md.

Best Practices

Performance Optimization

Use lazy evaluation for large datasets:

lf = pl.scan_csv("large.csv") # Don't use read_csv result = lf.filter(...).select(...).collect()
Avoid Python functions in hot paths:
- Stay within expression API for parallelization
- Use .map_elements() only when necessary
- Prefer native Polars operations
Use streaming for very large data:

lf.collect(streaming=True)
Select only needed columns early:

Good: Select columns early

lf.select("col1", "col2").filter(...)

# Bad: Filter on all columns first
lf.filter(...).select("col1", "col2")

5. Use appropriate data types:

 * Categorical for low-cardinality strings
 * Appropriate integer sizes (i32 vs i64)
 * Date types for temporal data

Expression Patterns

Conditional operations:

pl.when(condition).then(value).otherwise(other_value)

Column operations across multiple columns:

df.select(pl.col("^.*_value$") * 2)  # Regex pattern

Null handling:

pl.col("x").fill_null(0)
pl.col("x").is_null()
pl.col("x").drop_nulls()

For additional best practices and patterns, load references/best_practices.md.

Resources

This skill includes comprehensive reference documentation:

references/

core_concepts.md - Detailed explanations of expressions, lazy evaluation, and type system
operations.md - Comprehensive guide to all common operations with examples
pandas_migration.md - Complete migration guide from pandas to Polars
io_guide.md - Data I/O operations for all supported formats
transformations.md - Joins, concatenation, pivots, and reshaping operations
best_practices.md - Performance optimization tips and common patterns

Load these references as needed when users require detailed information about specific topics.

Weekly Installs

Repository

k-dense-ai/clau…c-skills

GitHub Stars

17.3K

First Seen

Jan 20, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode49

codex48

gemini-cli48

claude-code47

cursor46

github-copilot44

免费AI数据抓取智能体：自动化收集、丰富与存储网站/API数据

1,400 周安装

Polars 极速 DataFrame 库：Python/Rust 高性能数据处理与 pandas 迁移指南

🇨🇳中文介绍

Polars

概述

快速开始

安装与基本用法

核心概念

表达式

相关 Skills

惰性求值与即时求值

常见操作

选择

筛选

添加列

分组与聚合

聚合与窗口函数

聚合函数

使用 over() 的窗口函数

数据输入/输出

支持的格式

常见输入/输出操作

转换

连接

拼接

数据透视与逆透视

Pandas 迁移

概念差异

常见操作映射

关键语法模式

最佳实践

性能优化

表达式模式

资源

references/

🇺🇸English

Polars

Overview

Quick Start

Installation and Basic Usage

Core Concepts

Expressions

Lazy vs Eager Evaluation

Common Operations

Select

Filter

With Columns

Group By and Aggregations

Aggregations and Window Functions

Aggregation Functions

Window Functions with over()

Data I/O

Supported Formats

Common I/O Operations

Transformations

Joins

Concatenation

Pivot and Unpivot

Pandas Migration

Conceptual Differences

Common Operation Mappings

Key Syntax Patterns

Best Practices

Performance Optimization

Good: Select columns early

Expression Patterns

Resources

references/

最新 Skills

使用 `over()` 的窗口函数

Window Functions with `over()`