重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
polars by k-dense-ai/claude-scientific-skills
npx skills add https://github.com/k-dense-ai/claude-scientific-skills --skill polarsPolars 是一个基于 Apache Arrow 构建的、适用于 Python 和 Rust 的极速 DataFrame 库。利用 Polars 基于表达式的 API、惰性求值框架和高性能数据操作能力,可以实现高效的数据处理、pandas 迁移和数据管道优化。
安装 Polars:
uv pip install polars
基本 DataFrame 创建与操作:
import polars as pl
# 创建 DataFrame
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"city": ["NY", "LA", "SF"]
})
# 选择列
df.select("name", "age")
# 筛选行
df.filter(pl.col("age") > 25)
# 添加计算列
df.with_columns(
age_plus_10=pl.col("age") + 10
)
表达式是 Polars 操作的基本构建块。它们描述了数据的转换,并且可以被组合、重用和优化。
关键原则:
pl.col("column_name") 来引用列示例:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
# 基于表达式的计算
df.select(
pl.col("name"),
(pl.col("age") * 12).alias("age_in_months")
)
即时求值(DataFrame): 操作立即执行
df = pl.read_csv("file.csv") # 立即读取
result = df.filter(pl.col("age") > 25) # 立即执行
惰性求值(LazyFrame): 操作构建一个查询计划,在执行前进行优化
lf = pl.scan_csv("file.csv") # 尚未读取
result = lf.filter(pl.col("age") > 25).select("name", "age")
df = result.collect() # 现在执行优化后的查询
何时使用惰性求值:
惰性求值的优势:
有关详细概念,请加载 references/core_concepts.md。
选择和操作列:
# 选择特定列
df.select("name", "age")
# 使用表达式选择
df.select(
pl.col("name"),
(pl.col("age") * 2).alias("double_age")
)
# 选择匹配模式的所有列
df.select(pl.col("^.*_id$"))
根据条件筛选行:
# 单个条件
df.filter(pl.col("age") > 25)
# 多个条件(比使用 & 更清晰)
df.filter(
pl.col("age") > 25,
pl.col("city") == "NY"
)
# 复杂条件
df.filter(
(pl.col("age") > 25) | (pl.col("city") == "LA")
)
在保留现有列的同时添加或修改列:
# 添加新列
df.with_columns(
age_plus_10=pl.col("age") + 10,
name_upper=pl.col("name").str.to_uppercase()
)
# 并行计算(所有列并行计算)
df.with_columns(
pl.col("value") * 10,
pl.col("value") * 100,
)
对数据进行分组并计算聚合:
# 基本分组
df.group_by("city").agg(
pl.col("age").mean().alias("avg_age"),
pl.len().alias("count")
)
# 多个分组键
df.group_by("city", "department").agg(
pl.col("salary").sum()
)
# 条件聚合
df.group_by("city").agg(
(pl.col("age") > 30).sum().alias("over_30")
)
有关详细操作模式,请加载 references/operations.md。
group_by 上下文中的常见聚合:
pl.len() - 统计行数pl.col("x").sum() - 求和pl.col("x").mean() - 平均值pl.col("x").min() / pl.col("x").max() - 极值pl.first() / pl.last() - 第一个/最后一个值over() 的窗口函数在保留行数的同时应用聚合:
# 将分组统计信息添加到每一行
df.with_columns(
avg_age_by_city=pl.col("age").mean().over("city"),
rank_in_city=pl.col("salary").rank().over("city")
)
# 多个分组列
df.with_columns(
group_avg=pl.col("value").mean().over("category", "region")
)
映射策略:
group_to_rows(默认):保留原始行顺序explode:更快,但将行分组在一起join:创建列表列Polars 支持读取和写入:
CSV:
# 即时求值
df = pl.read_csv("file.csv")
df.write_csv("output.csv")
# 惰性求值(处理大文件时首选)
lf = pl.scan_csv("file.csv")
result = lf.filter(...).select(...).collect()
Parquet(推荐用于性能):
df = pl.read_parquet("file.parquet")
df.write_parquet("output.parquet")
JSON:
df = pl.read_json("file.json")
df.write_json("output.json")
有关全面的输入/输出文档,请加载 references/io_guide.md。
合并 DataFrame:
# 内连接
df1.join(df2, on="id", how="inner")
# 左连接
df1.join(df2, on="id", how="left")
# 在不同列名上连接
df1.join(df2, left_on="user_id", right_on="id")
堆叠 DataFrame:
# 垂直拼接(堆叠行)
pl.concat([df1, df2], how="vertical")
# 水平拼接(添加列)
pl.concat([df1, df2], how="horizontal")
# 对角线拼接(合并不同模式)
pl.concat([df1, df2], how="diagonal")
重塑数据:
# 数据透视(宽格式)
df.pivot(values="sales", index="date", columns="product")
# 逆透视(长格式)
df.unpivot(index="id", on=["col1", "col2"])
有关详细的转换示例,请加载 references/transformations.md。
Polars 相比 pandas 提供了显著的性能改进和更简洁的 API。主要区别:
| 操作 | Pandas | Polars |
|---|---|---|
| 选择列 | df["col"] | df.select("col") |
| 筛选 | df[df["col"] > 10] | df.filter(pl.col("col") > 10) |
| 添加列 | df.assign(x=...) | df.with_columns(x=...) |
| 分组 | df.groupby("col").agg(...) | df.group_by("col").agg(...) |
| 窗口 | df.groupby("col").transform(...) | df.with_columns(...).over("col") |
Pandas 顺序执行(慢):
df.assign(
col_a=lambda df_: df_.value * 10,
col_b=lambda df_: df_.value * 100
)
Polars 并行执行(快):
df.with_columns(
col_a=pl.col("value") * 10,
col_b=pl.col("value") * 100,
)
有关全面的迁移指南,请加载 references/pandas_migration.md。
```python
lf = pl.scan_csv("large.csv") # 不要使用 read_csv
result = lf.filter(...).select(...).collect()
```
2. 避免在热点路径中使用 Python 函数:
* 保持在表达式 API 内以实现并行化
* 仅在必要时使用 `.map_elements()`
* 优先使用原生 Polars 操作
3. 对非常大的数据使用流式处理:
```python
lf.collect(streaming=True)
```
4. 尽早选择仅需要的列:
```python
# 好:尽早选择列
lf.select("col1", "col2").filter(...)
# 差:先在所有列上筛选
lf.filter(...).select("col1", "col2")
```
5. 使用适当的数据类型:
* 对低基数字符串使用分类类型
* 合适的整数大小(i32 与 i64)
* 对时间数据使用日期类型
条件操作:
pl.when(condition).then(value).otherwise(other_value)
跨多列的列操作:
df.select(pl.col("^.*_value$") * 2) # 正则表达式模式
空值处理:
pl.col("x").fill_null(0)
pl.col("x").is_null()
pl.col("x").drop_nulls()
有关更多最佳实践和模式,请加载 references/best_practices.md。
此技能包含全面的参考文档:
core_concepts.md - 表达式、惰性求值和类型系统的详细解释operations.md - 包含示例的所有常见操作的综合指南pandas_migration.md - 从 pandas 迁移到 Polars 的完整指南io_guide.md - 所有支持格式的数据输入/输出操作transformations.md - 连接、拼接、数据透视和重塑操作best_practices.md - 性能优化技巧和常见模式当用户需要有关特定主题的详细信息时,请根据需要加载这些参考资料。
每周安装数
58
仓库
GitHub 星标数
17.3K
首次出现
2026年1月20日
安全审计
安装于
opencode49
codex48
gemini-cli48
claude-code47
cursor46
github-copilot44
Polars is a lightning-fast DataFrame library for Python and Rust built on Apache Arrow. Work with Polars' expression-based API, lazy evaluation framework, and high-performance data manipulation capabilities for efficient data processing, pandas migration, and data pipeline optimization.
Install Polars:
uv pip install polars
Basic DataFrame creation and operations:
import polars as pl
# Create DataFrame
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"city": ["NY", "LA", "SF"]
})
# Select columns
df.select("name", "age")
# Filter rows
df.filter(pl.col("age") > 25)
# Add computed columns
df.with_columns(
age_plus_10=pl.col("age") + 10
)
Expressions are the fundamental building blocks of Polars operations. They describe transformations on data and can be composed, reused, and optimized.
Key principles:
pl.col("column_name") to reference columnsExample:
# Expression-based computation
df.select(
pl.col("name"),
(pl.col("age") * 12).alias("age_in_months")
)
Eager (DataFrame): Operations execute immediately
df = pl.read_csv("file.csv") # Reads immediately
result = df.filter(pl.col("age") > 25) # Executes immediately
Lazy (LazyFrame): Operations build a query plan, optimized before execution
lf = pl.scan_csv("file.csv") # Doesn't read yet
result = lf.filter(pl.col("age") > 25).select("name", "age")
df = result.collect() # Now executes optimized query
When to use lazy:
Benefits of lazy evaluation:
For detailed concepts, load references/core_concepts.md.
Select and manipulate columns:
# Select specific columns
df.select("name", "age")
# Select with expressions
df.select(
pl.col("name"),
(pl.col("age") * 2).alias("double_age")
)
# Select all columns matching a pattern
df.select(pl.col("^.*_id$"))
Filter rows by conditions:
# Single condition
df.filter(pl.col("age") > 25)
# Multiple conditions (cleaner than using &)
df.filter(
pl.col("age") > 25,
pl.col("city") == "NY"
)
# Complex conditions
df.filter(
(pl.col("age") > 25) | (pl.col("city") == "LA")
)
Add or modify columns while preserving existing ones:
# Add new columns
df.with_columns(
age_plus_10=pl.col("age") + 10,
name_upper=pl.col("name").str.to_uppercase()
)
# Parallel computation (all columns computed in parallel)
df.with_columns(
pl.col("value") * 10,
pl.col("value") * 100,
)
Group data and compute aggregations:
# Basic grouping
df.group_by("city").agg(
pl.col("age").mean().alias("avg_age"),
pl.len().alias("count")
)
# Multiple group keys
df.group_by("city", "department").agg(
pl.col("salary").sum()
)
# Conditional aggregations
df.group_by("city").agg(
(pl.col("age") > 30).sum().alias("over_30")
)
For detailed operation patterns, load references/operations.md.
Common aggregations within group_by context:
pl.len() - count rowspl.col("x").sum() - sum valuespl.col("x").mean() - averagepl.col("x").min() / pl.col("x").max() - extremespl.first() / pl.last() - first/last valuesover()Apply aggregations while preserving row count:
# Add group statistics to each row
df.with_columns(
avg_age_by_city=pl.col("age").mean().over("city"),
rank_in_city=pl.col("salary").rank().over("city")
)
# Multiple grouping columns
df.with_columns(
group_avg=pl.col("value").mean().over("category", "region")
)
Mapping strategies:
group_to_rows (default): Preserves original row orderexplode: Faster but groups rows togetherjoin: Creates list columnsPolars supports reading and writing:
CSV:
# Eager
df = pl.read_csv("file.csv")
df.write_csv("output.csv")
# Lazy (preferred for large files)
lf = pl.scan_csv("file.csv")
result = lf.filter(...).select(...).collect()
Parquet (recommended for performance):
df = pl.read_parquet("file.parquet")
df.write_parquet("output.parquet")
JSON:
df = pl.read_json("file.json")
df.write_json("output.json")
For comprehensive I/O documentation, load references/io_guide.md.
Combine DataFrames:
# Inner join
df1.join(df2, on="id", how="inner")
# Left join
df1.join(df2, on="id", how="left")
# Join on different column names
df1.join(df2, left_on="user_id", right_on="id")
Stack DataFrames:
# Vertical (stack rows)
pl.concat([df1, df2], how="vertical")
# Horizontal (add columns)
pl.concat([df1, df2], how="horizontal")
# Diagonal (union with different schemas)
pl.concat([df1, df2], how="diagonal")
Reshape data:
# Pivot (wide format)
df.pivot(values="sales", index="date", columns="product")
# Unpivot (long format)
df.unpivot(index="id", on=["col1", "col2"])
For detailed transformation examples, load references/transformations.md.
Polars offers significant performance improvements over pandas with a cleaner API. Key differences:
| Operation | Pandas | Polars |
|---|---|---|
| Select column | df["col"] | df.select("col") |
| Filter | df[df["col"] > 10] | df.filter(pl.col("col") > 10) |
| Add column | df.assign(x=...) | df.with_columns(x=...) |
| Group by | df.groupby("col").agg(...) |
Pandas sequential (slow):
df.assign(
col_a=lambda df_: df_.value * 10,
col_b=lambda df_: df_.value * 100
)
Polars parallel (fast):
df.with_columns(
col_a=pl.col("value") * 10,
col_b=pl.col("value") * 100,
)
For comprehensive migration guide, load references/pandas_migration.md.
Use lazy evaluation for large datasets:
lf = pl.scan_csv("large.csv") # Don't use read_csv result = lf.filter(...).select(...).collect()
Avoid Python functions in hot paths:
.map_elements() only when necessaryUse streaming for very large data:
lf.collect(streaming=True)
Select only needed columns early:
lf.select("col1", "col2").filter(...)
# Bad: Filter on all columns first
lf.filter(...).select("col1", "col2")
5. Use appropriate data types:
* Categorical for low-cardinality strings
* Appropriate integer sizes (i32 vs i64)
* Date types for temporal data
Conditional operations:
pl.when(condition).then(value).otherwise(other_value)
Column operations across multiple columns:
df.select(pl.col("^.*_value$") * 2) # Regex pattern
Null handling:
pl.col("x").fill_null(0)
pl.col("x").is_null()
pl.col("x").drop_nulls()
For additional best practices and patterns, load references/best_practices.md.
This skill includes comprehensive reference documentation:
core_concepts.md - Detailed explanations of expressions, lazy evaluation, and type systemoperations.md - Comprehensive guide to all common operations with examplespandas_migration.md - Complete migration guide from pandas to Polarsio_guide.md - Data I/O operations for all supported formatstransformations.md - Joins, concatenation, pivots, and reshaping operationsbest_practices.md - Performance optimization tips and common patternsLoad these references as needed when users require detailed information about specific topics.
Weekly Installs
58
Repository
GitHub Stars
17.3K
First Seen
Jan 20, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode49
codex48
gemini-cli48
claude-code47
cursor46
github-copilot44
免费AI数据抓取智能体:自动化收集、丰富与存储网站/API数据
1,400 周安装
df.group_by("col").agg(...) |
| Window | df.groupby("col").transform(...) | df.with_columns(...).over("col") |