Polars极速DataFrame库：Python/Rust高性能数据处理与pandas迁移指南

polars by davila7/claude-code-templates

274 周安装量

23,500 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill polars

Python Web框架数据分析数据处理

🇨🇳中文介绍

Polars

概述

Polars 是一个基于 Apache Arrow 构建的极速 DataFrame 库，适用于 Python 和 Rust。利用 Polars 基于表达式的 API、惰性求值框架和高性能数据操作能力，实现高效的数据处理、pandas 迁移和数据管道优化。

快速开始

安装与基本使用

安装 Polars：

uv pip install polars

基本 DataFrame 创建与操作：

import polars as pl

# 创建 DataFrame
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "city": ["NY", "LA", "SF"]
})

# 选择列
df.select("name", "age")

# 筛选行
df.filter(pl.col("age") > 25)

# 添加计算列
df.with_columns(
    age_plus_10=pl.col("age") + 10
)

核心概念

表达式

表达式是 Polars 操作的基本构建块。它们描述数据上的转换，并且可以组合、重用和优化。

关键原则：

使用 pl.col("column_name") 引用列
链式调用方法以构建复杂转换
表达式是惰性的，仅在特定上下文中执行（select、with_columns、filter、group_by）

示例：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

惰性求值与即时求值

即时求值（DataFrame）： 操作立即执行

df = pl.read_csv("file.csv")  # 立即读取
result = df.filter(pl.col("age") > 25)  # 立即执行

惰性求值（LazyFrame）： 操作构建查询计划，在执行前进行优化

lf = pl.scan_csv("file.csv")  # 尚未读取
result = lf.filter(pl.col("age") > 25).select("name", "age")
df = result.collect()  # 现在执行优化后的查询

何时使用惰性求值：

处理大型数据集时
复杂查询管道
仅需要部分列/行时
性能至关重要时

惰性求值的优势：

自动查询优化
谓词下推
投影下推
并行执行

有关详细概念，请加载 references/core_concepts.md。

选择和操作列：

# 选择特定列
df.select("name", "age")

# 使用表达式选择
df.select(
    pl.col("name"),
    (pl.col("age") * 2).alias("double_age")
)

# 选择匹配模式的所有列
df.select(pl.col("^.*_id$"))

按条件筛选行：

# 单个条件
df.filter(pl.col("age") > 25)

# 多个条件（比使用 & 更清晰）
df.filter(
    pl.col("age") > 25,
    pl.col("city") == "NY"
)

# 复杂条件
df.filter(
    (pl.col("age") > 25) | (pl.col("city") == "LA")
)

在保留现有列的同时添加或修改列：

# 添加新列
df.with_columns(
    age_plus_10=pl.col("age") + 10,
    name_upper=pl.col("name").str.to_uppercase()
)

# 并行计算（所有列并行计算）
df.with_columns(
    pl.col("value") * 10,
    pl.col("value") * 100,
)

分组数据并计算聚合：

# 基本分组
df.group_by("city").agg(
    pl.col("age").mean().alias("avg_age"),
    pl.len().alias("count")
)

# 多个分组键
df.group_by("city", "department").agg(
    pl.col("salary").sum()
)

# 条件聚合
df.group_by("city").agg(
    (pl.col("age") > 30).sum().alias("over_30")
)

有关详细操作模式，请加载 references/operations.md。

聚合与窗口函数

group_by 上下文中的常见聚合：

pl.len() - 行数统计
pl.col("x").sum() - 值求和
pl.col("x").mean() - 平均值
pl.col("x").min() / pl.col("x").max() - 极值
pl.first() / pl.last() - 第一个/最后一个值

使用 `over()` 的窗口函数

在保持行数的同时应用聚合：

# 将分组统计添加到每一行
df.with_columns(
    avg_age_by_city=pl.col("age").mean().over("city"),
    rank_in_city=pl.col("salary").rank().over("city")
)

# 多个分组列
df.with_columns(
    group_avg=pl.col("value").mean().over("category", "region")
)

group_to_rows（默认）：保留原始行顺序
explode：更快但将行分组在一起
join：创建列表列

Polars 支持读写：

CSV、Parquet、JSON、Excel
数据库（通过连接器）
云存储（S3、Azure、GCS）
Google BigQuery
多个/分区文件

常见输入/输出操作

# 即时求值
df = pl.read_csv("file.csv")
df.write_csv("output.csv")

# 惰性求值（大型文件首选）
lf = pl.scan_csv("file.csv")
result = lf.filter(...).select(...).collect()

Parquet（性能推荐）：

df = pl.read_parquet("file.parquet")
df.write_parquet("output.parquet")

df = pl.read_json("file.json")
df.write_json("output.json")

有关全面的输入/输出文档，请加载 references/io_guide.md。

# 内连接
df1.join(df2, on="id", how="inner")

# 左连接
df1.join(df2, on="id", how="left")

# 在不同列名上连接
df1.join(df2, left_on="user_id", right_on="id")

# 垂直（堆叠行）
pl.concat([df1, df2], how="vertical")

# 水平（添加列）
pl.concat([df1, df2], how="horizontal")

# 对角（不同模式的并集）
pl.concat([df1, df2], how="diagonal")

# 透视（宽格式）
df.pivot(values="sales", index="date", columns="product")

# 逆透视（长格式）
df.unpivot(index="id", on=["col1", "col2"])

有关详细转换示例，请加载 references/transformations.md。

Polars 提供了比 pandas 显著的性能改进和更简洁的 API。主要区别：

无索引：Polars 仅使用整数位置
严格类型：无静默类型转换
惰性求值：通过 LazyFrame 提供
默认并行：操作自动并行化

操作	Pandas	Polars
选择列	`df["col"]`	`df.select("col")`
筛选	`df[df["col"] > 10]`	`df.filter(pl.col("col") > 10)`
添加列	`df.assign(x=...)`	`df.with_columns(x=...)`
分组	`df.groupby("col").agg(...)`	`df.group_by("col").agg(...)`
窗口	`df.groupby("col").transform(...)`	`df.with_columns(...).over("col")`

Pandas 顺序（慢）：

df.assign(
    col_a=lambda df_: df_.value * 10,
    col_b=lambda df_: df_.value * 100
)

Polars 并行（快）：

df.with_columns(
    col_a=pl.col("value") * 10,
    col_b=pl.col("value") * 100,
)

有关全面的迁移指南，请加载 references/pandas_migration.md。

对大型数据集使用惰性求值：

lf = pl.scan_csv("large.csv")  # 不要使用 read_csv
result = lf.filter(...).select(...).collect()

避免在热点路径中使用 Python 函数：
- 保持在表达式 API 内以实现并行化
- 仅在必要时使用 .map_elements()
- 优先使用原生 Polars 操作
对非常大的数据使用流式处理：
```
lf.collect(streaming=True)
```

尽早选择仅需要的列：

# 好：尽早选择列
lf.select("col1", "col2").filter(...)

# 差：先在所有列上筛选
lf.filter(...).select("col1", "col2")

使用适当的数据类型：
- 对低基数字符串使用分类类型
- 适当的整数大小（i32 与 i64）
- 对时间数据使用日期类型

pl.when(condition).then(value).otherwise(other_value)

跨多列的列操作：

df.select(pl.col("^.*_value$") * 2)  # 正则表达式模式

pl.col("x").fill_null(0)
pl.col("x").is_null()
pl.col("x").drop_nulls()

有关更多最佳实践和模式，请加载 references/best_practices.md。

此技能包含全面的参考文档：

core_concepts.md - 表达式、惰性求值和类型系统的详细解释
operations.md - 所有常见操作的全面指南及示例
pandas_migration.md - 从 pandas 迁移到 Polars 的完整指南
io_guide.md - 所有支持格式的数据输入/输出操作
transformations.md - 连接、连接、透视和重塑操作
best_practices.md - 性能优化技巧和常见模式

当用户需要特定主题的详细信息时，请根据需要加载这些参考资料。

🇺🇸English

Polars

Overview

Polars is a lightning-fast DataFrame library for Python and Rust built on Apache Arrow. Work with Polars' expression-based API, lazy evaluation framework, and high-performance data manipulation capabilities for efficient data processing, pandas migration, and data pipeline optimization.

Quick Start

Installation and Basic Usage

Install Polars:

uv pip install polars

Basic DataFrame creation and operations:

import polars as pl

# Create DataFrame
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "city": ["NY", "LA", "SF"]
})

# Select columns
df.select("name", "age")

# Filter rows
df.filter(pl.col("age") > 25)

# Add computed columns
df.with_columns(
    age_plus_10=pl.col("age") + 10
)

Core Concepts

Expressions

Expressions are the fundamental building blocks of Polars operations. They describe transformations on data and can be composed, reused, and optimized.

Key principles:

Use pl.col("column_name") to reference columns
Chain methods to build complex transformations
Expressions are lazy and only execute within contexts (select, with_columns, filter, group_by)

Example:

# Expression-based computation
df.select(
    pl.col("name"),
    (pl.col("age") * 12).alias("age_in_months")
)

Lazy vs Eager Evaluation

Eager (DataFrame): Operations execute immediately

df = pl.read_csv("file.csv")  # Reads immediately
result = df.filter(pl.col("age") > 25)  # Executes immediately

Lazy (LazyFrame): Operations build a query plan, optimized before execution

lf = pl.scan_csv("file.csv")  # Doesn't read yet
result = lf.filter(pl.col("age") > 25).select("name", "age")
df = result.collect()  # Now executes optimized query

When to use lazy:

Working with large datasets
Complex query pipelines
When only some columns/rows are needed
Performance is critical

Benefits of lazy evaluation:

Automatic query optimization
Predicate pushdown
Projection pushdown
Parallel execution

For detailed concepts, load references/core_concepts.md.

Common Operations

Select

Select and manipulate columns:

# Select specific columns
df.select("name", "age")

# Select with expressions
df.select(
    pl.col("name"),
    (pl.col("age") * 2).alias("double_age")
)

# Select all columns matching a pattern
df.select(pl.col("^.*_id$"))

Filter

Filter rows by conditions:

# Single condition
df.filter(pl.col("age") > 25)

# Multiple conditions (cleaner than using &)
df.filter(
    pl.col("age") > 25,
    pl.col("city") == "NY"
)

# Complex conditions
df.filter(
    (pl.col("age") > 25) | (pl.col("city") == "LA")
)

With Columns

Add or modify columns while preserving existing ones:

# Add new columns
df.with_columns(
    age_plus_10=pl.col("age") + 10,
    name_upper=pl.col("name").str.to_uppercase()
)

# Parallel computation (all columns computed in parallel)
df.with_columns(
    pl.col("value") * 10,
    pl.col("value") * 100,
)

Group By and Aggregations

Group data and compute aggregations:

# Basic grouping
df.group_by("city").agg(
    pl.col("age").mean().alias("avg_age"),
    pl.len().alias("count")
)

# Multiple group keys
df.group_by("city", "department").agg(
    pl.col("salary").sum()
)

# Conditional aggregations
df.group_by("city").agg(
    (pl.col("age") > 30).sum().alias("over_30")
)

For detailed operation patterns, load references/operations.md.

Aggregations and Window Functions

Aggregation Functions

Common aggregations within group_by context:

pl.len() - count rows
pl.col("x").sum() - sum values
pl.col("x").mean() - average
pl.col("x").min() / pl.col("x").max() - extremes
pl.first() / pl.last() - first/last values

Window Functions with `over()`

Apply aggregations while preserving row count:

# Add group statistics to each row
df.with_columns(
    avg_age_by_city=pl.col("age").mean().over("city"),
    rank_in_city=pl.col("salary").rank().over("city")
)

# Multiple grouping columns
df.with_columns(
    group_avg=pl.col("value").mean().over("category", "region")
)

Mapping strategies:

group_to_rows (default): Preserves original row order
explode: Faster but groups rows together
join: Creates list columns

Data I/O

Supported Formats

Polars supports reading and writing:

CSV, Parquet, JSON, Excel
Databases (via connectors)
Cloud storage (S3, Azure, GCS)
Google BigQuery
Multiple/partitioned files

Common I/O Operations

CSV:

# Eager
df = pl.read_csv("file.csv")
df.write_csv("output.csv")

# Lazy (preferred for large files)
lf = pl.scan_csv("file.csv")
result = lf.filter(...).select(...).collect()

Parquet (recommended for performance):

df = pl.read_parquet("file.parquet")
df.write_parquet("output.parquet")

JSON:

df = pl.read_json("file.json")
df.write_json("output.json")

For comprehensive I/O documentation, load references/io_guide.md.

Transformations

Joins

Combine DataFrames:

# Inner join
df1.join(df2, on="id", how="inner")

# Left join
df1.join(df2, on="id", how="left")

# Join on different column names
df1.join(df2, left_on="user_id", right_on="id")

Concatenation

Stack DataFrames:

# Vertical (stack rows)
pl.concat([df1, df2], how="vertical")

# Horizontal (add columns)
pl.concat([df1, df2], how="horizontal")

# Diagonal (union with different schemas)
pl.concat([df1, df2], how="diagonal")

Pivot and Unpivot

Reshape data:

# Pivot (wide format)
df.pivot(values="sales", index="date", columns="product")

# Unpivot (long format)
df.unpivot(index="id", on=["col1", "col2"])

For detailed transformation examples, load references/transformations.md.

Pandas Migration

Polars offers significant performance improvements over pandas with a cleaner API. Key differences:

Conceptual Differences

No index : Polars uses integer positions only
Strict typing : No silent type conversions
Lazy evaluation : Available via LazyFrame
Parallel by default : Operations parallelized automatically

Common Operation Mappings

Operation	Pandas	Polars
Select column	`df["col"]`	`df.select("col")`
Filter	`df[df["col"] > 10]`	`df.filter(pl.col("col") > 10)`
Add column	`df.assign(x=...)`	`df.with_columns(x=...)`
Group by	`df.groupby("col").agg(...)`

Key Syntax Patterns

Pandas sequential (slow):

df.assign(
    col_a=lambda df_: df_.value * 10,
    col_b=lambda df_: df_.value * 100
)

Polars parallel (fast):

df.with_columns(
    col_a=pl.col("value") * 10,
    col_b=pl.col("value") * 100,
)

For comprehensive migration guide, load references/pandas_migration.md.

Best Practices

Performance Optimization

Use lazy evaluation for large datasets:

lf = pl.scan_csv("large.csv")  # Don't use read_csv
result = lf.filter(...).select(...).collect()

Avoid Python functions in hot paths:
- Stay within expression API for parallelization
- Use .map_elements() only when necessary
- Prefer native Polars operations
Use streaming for very large data:
```
lf.collect(streaming=True)
```

Select only needed columns early:

# Good: Select columns early
lf.select("col1", "col2").filter(...)

# Bad: Filter on all columns first
lf.filter(...).select("col1", "col2")

Use appropriate data types:
- Categorical for low-cardinality strings
- Appropriate integer sizes (i32 vs i64)
- Date types for temporal data

Expression Patterns

Conditional operations:

pl.when(condition).then(value).otherwise(other_value)

Column operations across multiple columns:

df.select(pl.col("^.*_value$") * 2)  # Regex pattern

Null handling:

pl.col("x").fill_null(0)
pl.col("x").is_null()
pl.col("x").drop_nulls()

For additional best practices and patterns, load references/best_practices.md.

Resources

This skill includes comprehensive reference documentation:

references/

core_concepts.md - Detailed explanations of expressions, lazy evaluation, and type system
operations.md - Comprehensive guide to all common operations with examples
pandas_migration.md - Complete migration guide from pandas to Polars
io_guide.md - Data I/O operations for all supported formats
transformations.md - Joins, concatenation, pivots, and reshaping operations
best_practices.md - Performance optimization tips and common patterns

Load these references as needed when users require detailed information about specific topics.

Weekly Installs

274

Repository

davila7/claude-…emplates

GitHub Stars

23.5K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode236

gemini-cli223

codex213

cursor204

claude-code203

github-copilot203

DOCX文件创建、编辑与分析完整指南 - 使用docx-js、Pandoc和Python脚本

43,600 周安装

Polars极速DataFrame库：Python/Rust高性能数据处理与pandas迁移指南

🇨🇳中文介绍

Polars

概述

快速开始

安装与基本使用

核心概念

表达式

相关 Skills

惰性求值与即时求值

常用操作

选择

筛选

添加列

分组与聚合

聚合与窗口函数

聚合函数

使用 over() 的窗口函数

数据输入/输出

支持的格式

常见输入/输出操作

转换

连接

连接

透视与逆透视

Pandas 迁移

概念差异

常见操作映射

关键语法模式

最佳实践

性能优化

表达式模式

资源

references/

🇺🇸English

Polars

Overview

Quick Start

Installation and Basic Usage

Core Concepts

Expressions

Lazy vs Eager Evaluation

Common Operations

Select

Filter

With Columns

Group By and Aggregations

Aggregations and Window Functions

Aggregation Functions

Window Functions with over()

Data I/O

Supported Formats

Common I/O Operations

Transformations

Joins

Concatenation

Pivot and Unpivot

Pandas Migration

Conceptual Differences

Common Operation Mappings

Key Syntax Patterns

Best Practices

Performance Optimization

Expression Patterns

Resources

references/

最新 Skills

使用 `over()` 的窗口函数

Window Functions with `over()`