polars by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill polarsPolars 是一个基于 Apache Arrow 构建的极速 DataFrame 库,适用于 Python 和 Rust。利用 Polars 基于表达式的 API、惰性求值框架和高性能数据操作能力,实现高效的数据处理、pandas 迁移和数据管道优化。
安装 Polars:
uv pip install polars
基本 DataFrame 创建与操作:
import polars as pl
# 创建 DataFrame
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"city": ["NY", "LA", "SF"]
})
# 选择列
df.select("name", "age")
# 筛选行
df.filter(pl.col("age") > 25)
# 添加计算列
df.with_columns(
age_plus_10=pl.col("age") + 10
)
表达式是 Polars 操作的基本构建块。它们描述数据上的转换,并且可以组合、重用和优化。
关键原则:
pl.col("column_name") 引用列示例:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
# 基于表达式的计算
df.select(
pl.col("name"),
(pl.col("age") * 12).alias("age_in_months")
)
即时求值(DataFrame): 操作立即执行
df = pl.read_csv("file.csv") # 立即读取
result = df.filter(pl.col("age") > 25) # 立即执行
惰性求值(LazyFrame): 操作构建查询计划,在执行前进行优化
lf = pl.scan_csv("file.csv") # 尚未读取
result = lf.filter(pl.col("age") > 25).select("name", "age")
df = result.collect() # 现在执行优化后的查询
何时使用惰性求值:
惰性求值的优势:
有关详细概念,请加载 references/core_concepts.md。
选择和操作列:
# 选择特定列
df.select("name", "age")
# 使用表达式选择
df.select(
pl.col("name"),
(pl.col("age") * 2).alias("double_age")
)
# 选择匹配模式的所有列
df.select(pl.col("^.*_id$"))
按条件筛选行:
# 单个条件
df.filter(pl.col("age") > 25)
# 多个条件(比使用 & 更清晰)
df.filter(
pl.col("age") > 25,
pl.col("city") == "NY"
)
# 复杂条件
df.filter(
(pl.col("age") > 25) | (pl.col("city") == "LA")
)
在保留现有列的同时添加或修改列:
# 添加新列
df.with_columns(
age_plus_10=pl.col("age") + 10,
name_upper=pl.col("name").str.to_uppercase()
)
# 并行计算(所有列并行计算)
df.with_columns(
pl.col("value") * 10,
pl.col("value") * 100,
)
分组数据并计算聚合:
# 基本分组
df.group_by("city").agg(
pl.col("age").mean().alias("avg_age"),
pl.len().alias("count")
)
# 多个分组键
df.group_by("city", "department").agg(
pl.col("salary").sum()
)
# 条件聚合
df.group_by("city").agg(
(pl.col("age") > 30).sum().alias("over_30")
)
有关详细操作模式,请加载 references/operations.md。
group_by 上下文中的常见聚合:
pl.len() - 行数统计pl.col("x").sum() - 值求和pl.col("x").mean() - 平均值pl.col("x").min() / pl.col("x").max() - 极值pl.first() / pl.last() - 第一个/最后一个值over() 的窗口函数在保持行数的同时应用聚合:
# 将分组统计添加到每一行
df.with_columns(
avg_age_by_city=pl.col("age").mean().over("city"),
rank_in_city=pl.col("salary").rank().over("city")
)
# 多个分组列
df.with_columns(
group_avg=pl.col("value").mean().over("category", "region")
)
映射策略:
group_to_rows(默认):保留原始行顺序explode:更快但将行分组在一起join:创建列表列Polars 支持读写:
CSV:
# 即时求值
df = pl.read_csv("file.csv")
df.write_csv("output.csv")
# 惰性求值(大型文件首选)
lf = pl.scan_csv("file.csv")
result = lf.filter(...).select(...).collect()
Parquet(性能推荐):
df = pl.read_parquet("file.parquet")
df.write_parquet("output.parquet")
JSON:
df = pl.read_json("file.json")
df.write_json("output.json")
有关全面的输入/输出文档,请加载 references/io_guide.md。
合并 DataFrame:
# 内连接
df1.join(df2, on="id", how="inner")
# 左连接
df1.join(df2, on="id", how="left")
# 在不同列名上连接
df1.join(df2, left_on="user_id", right_on="id")
堆叠 DataFrame:
# 垂直(堆叠行)
pl.concat([df1, df2], how="vertical")
# 水平(添加列)
pl.concat([df1, df2], how="horizontal")
# 对角(不同模式的并集)
pl.concat([df1, df2], how="diagonal")
重塑数据:
# 透视(宽格式)
df.pivot(values="sales", index="date", columns="product")
# 逆透视(长格式)
df.unpivot(index="id", on=["col1", "col2"])
有关详细转换示例,请加载 references/transformations.md。
Polars 提供了比 pandas 显著的性能改进和更简洁的 API。主要区别:
| 操作 | Pandas | Polars |
|---|---|---|
| 选择列 | df["col"] | df.select("col") |
| 筛选 | df[df["col"] > 10] | df.filter(pl.col("col") > 10) |
| 添加列 | df.assign(x=...) | df.with_columns(x=...) |
| 分组 | df.groupby("col").agg(...) | df.group_by("col").agg(...) |
| 窗口 | df.groupby("col").transform(...) | df.with_columns(...).over("col") |
Pandas 顺序(慢):
df.assign(
col_a=lambda df_: df_.value * 10,
col_b=lambda df_: df_.value * 100
)
Polars 并行(快):
df.with_columns(
col_a=pl.col("value") * 10,
col_b=pl.col("value") * 100,
)
有关全面的迁移指南,请加载 references/pandas_migration.md。
对大型数据集使用惰性求值:
lf = pl.scan_csv("large.csv") # 不要使用 read_csv
result = lf.filter(...).select(...).collect()
避免在热点路径中使用 Python 函数:
.map_elements()对非常大的数据使用流式处理:
lf.collect(streaming=True)
尽早选择仅需要的列:
# 好:尽早选择列
lf.select("col1", "col2").filter(...)
# 差:先在所有列上筛选
lf.filter(...).select("col1", "col2")
使用适当的数据类型:
条件操作:
pl.when(condition).then(value).otherwise(other_value)
跨多列的列操作:
df.select(pl.col("^.*_value$") * 2) # 正则表达式模式
空值处理:
pl.col("x").fill_null(0)
pl.col("x").is_null()
pl.col("x").drop_nulls()
有关更多最佳实践和模式,请加载 references/best_practices.md。
此技能包含全面的参考文档:
core_concepts.md - 表达式、惰性求值和类型系统的详细解释operations.md - 所有常见操作的全面指南及示例pandas_migration.md - 从 pandas 迁移到 Polars 的完整指南io_guide.md - 所有支持格式的数据输入/输出操作transformations.md - 连接、连接、透视和重塑操作best_practices.md - 性能优化技巧和常见模式当用户需要特定主题的详细信息时,请根据需要加载这些参考资料。
每周安装数
274
仓库
GitHub 星标数
23.5K
首次出现
2026年1月21日
安全审计
安装于
opencode236
gemini-cli223
codex213
cursor204
claude-code203
github-copilot203
Polars is a lightning-fast DataFrame library for Python and Rust built on Apache Arrow. Work with Polars' expression-based API, lazy evaluation framework, and high-performance data manipulation capabilities for efficient data processing, pandas migration, and data pipeline optimization.
Install Polars:
uv pip install polars
Basic DataFrame creation and operations:
import polars as pl
# Create DataFrame
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"city": ["NY", "LA", "SF"]
})
# Select columns
df.select("name", "age")
# Filter rows
df.filter(pl.col("age") > 25)
# Add computed columns
df.with_columns(
age_plus_10=pl.col("age") + 10
)
Expressions are the fundamental building blocks of Polars operations. They describe transformations on data and can be composed, reused, and optimized.
Key principles:
pl.col("column_name") to reference columnsExample:
# Expression-based computation
df.select(
pl.col("name"),
(pl.col("age") * 12).alias("age_in_months")
)
Eager (DataFrame): Operations execute immediately
df = pl.read_csv("file.csv") # Reads immediately
result = df.filter(pl.col("age") > 25) # Executes immediately
Lazy (LazyFrame): Operations build a query plan, optimized before execution
lf = pl.scan_csv("file.csv") # Doesn't read yet
result = lf.filter(pl.col("age") > 25).select("name", "age")
df = result.collect() # Now executes optimized query
When to use lazy:
Benefits of lazy evaluation:
For detailed concepts, load references/core_concepts.md.
Select and manipulate columns:
# Select specific columns
df.select("name", "age")
# Select with expressions
df.select(
pl.col("name"),
(pl.col("age") * 2).alias("double_age")
)
# Select all columns matching a pattern
df.select(pl.col("^.*_id$"))
Filter rows by conditions:
# Single condition
df.filter(pl.col("age") > 25)
# Multiple conditions (cleaner than using &)
df.filter(
pl.col("age") > 25,
pl.col("city") == "NY"
)
# Complex conditions
df.filter(
(pl.col("age") > 25) | (pl.col("city") == "LA")
)
Add or modify columns while preserving existing ones:
# Add new columns
df.with_columns(
age_plus_10=pl.col("age") + 10,
name_upper=pl.col("name").str.to_uppercase()
)
# Parallel computation (all columns computed in parallel)
df.with_columns(
pl.col("value") * 10,
pl.col("value") * 100,
)
Group data and compute aggregations:
# Basic grouping
df.group_by("city").agg(
pl.col("age").mean().alias("avg_age"),
pl.len().alias("count")
)
# Multiple group keys
df.group_by("city", "department").agg(
pl.col("salary").sum()
)
# Conditional aggregations
df.group_by("city").agg(
(pl.col("age") > 30).sum().alias("over_30")
)
For detailed operation patterns, load references/operations.md.
Common aggregations within group_by context:
pl.len() - count rowspl.col("x").sum() - sum valuespl.col("x").mean() - averagepl.col("x").min() / pl.col("x").max() - extremespl.first() / pl.last() - first/last valuesover()Apply aggregations while preserving row count:
# Add group statistics to each row
df.with_columns(
avg_age_by_city=pl.col("age").mean().over("city"),
rank_in_city=pl.col("salary").rank().over("city")
)
# Multiple grouping columns
df.with_columns(
group_avg=pl.col("value").mean().over("category", "region")
)
Mapping strategies:
group_to_rows (default): Preserves original row orderexplode: Faster but groups rows togetherjoin: Creates list columnsPolars supports reading and writing:
CSV:
# Eager
df = pl.read_csv("file.csv")
df.write_csv("output.csv")
# Lazy (preferred for large files)
lf = pl.scan_csv("file.csv")
result = lf.filter(...).select(...).collect()
Parquet (recommended for performance):
df = pl.read_parquet("file.parquet")
df.write_parquet("output.parquet")
JSON:
df = pl.read_json("file.json")
df.write_json("output.json")
For comprehensive I/O documentation, load references/io_guide.md.
Combine DataFrames:
# Inner join
df1.join(df2, on="id", how="inner")
# Left join
df1.join(df2, on="id", how="left")
# Join on different column names
df1.join(df2, left_on="user_id", right_on="id")
Stack DataFrames:
# Vertical (stack rows)
pl.concat([df1, df2], how="vertical")
# Horizontal (add columns)
pl.concat([df1, df2], how="horizontal")
# Diagonal (union with different schemas)
pl.concat([df1, df2], how="diagonal")
Reshape data:
# Pivot (wide format)
df.pivot(values="sales", index="date", columns="product")
# Unpivot (long format)
df.unpivot(index="id", on=["col1", "col2"])
For detailed transformation examples, load references/transformations.md.
Polars offers significant performance improvements over pandas with a cleaner API. Key differences:
| Operation | Pandas | Polars |
|---|---|---|
| Select column | df["col"] | df.select("col") |
| Filter | df[df["col"] > 10] | df.filter(pl.col("col") > 10) |
| Add column | df.assign(x=...) | df.with_columns(x=...) |
| Group by | df.groupby("col").agg(...) |
Pandas sequential (slow):
df.assign(
col_a=lambda df_: df_.value * 10,
col_b=lambda df_: df_.value * 100
)
Polars parallel (fast):
df.with_columns(
col_a=pl.col("value") * 10,
col_b=pl.col("value") * 100,
)
For comprehensive migration guide, load references/pandas_migration.md.
Use lazy evaluation for large datasets:
lf = pl.scan_csv("large.csv") # Don't use read_csv
result = lf.filter(...).select(...).collect()
Avoid Python functions in hot paths:
.map_elements() only when necessaryUse streaming for very large data:
lf.collect(streaming=True)
Select only needed columns early:
# Good: Select columns early
lf.select("col1", "col2").filter(...)
# Bad: Filter on all columns first
lf.filter(...).select("col1", "col2")
Use appropriate data types:
Conditional operations:
pl.when(condition).then(value).otherwise(other_value)
Column operations across multiple columns:
df.select(pl.col("^.*_value$") * 2) # Regex pattern
Null handling:
pl.col("x").fill_null(0)
pl.col("x").is_null()
pl.col("x").drop_nulls()
For additional best practices and patterns, load references/best_practices.md.
This skill includes comprehensive reference documentation:
core_concepts.md - Detailed explanations of expressions, lazy evaluation, and type systemoperations.md - Comprehensive guide to all common operations with examplespandas_migration.md - Complete migration guide from pandas to Polarsio_guide.md - Data I/O operations for all supported formatstransformations.md - Joins, concatenation, pivots, and reshaping operationsbest_practices.md - Performance optimization tips and common patternsLoad these references as needed when users require detailed information about specific topics.
Weekly Installs
274
Repository
GitHub Stars
23.5K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
opencode236
gemini-cli223
codex213
cursor204
claude-code203
github-copilot203
DOCX文件创建、编辑与分析完整指南 - 使用docx-js、Pandoc和Python脚本
43,600 周安装
短视频脚本生成器 - 为YouTube Shorts/Reels创作病毒式个性驱动脚本,避免AI陈词滥调
270 周安装
客户成功管理全流程指南:从客户入职、健康度评分到留存策略
271 周安装
Claude Code技能智能路由器 - 分析需求推荐最佳开发技能工具
271 周安装
Apollo Kotlin:Android与JVM的强类型GraphQL客户端,支持Kotlin多平台
271 周安装
专业图表创建工具 - 支持Mermaid/PlantUML生成流程图、架构图、UML图
271 周安装
Voicebox 开源语音合成与克隆工具:本地化 TTS 工作室,替代 ElevenLabs
271 周安装
df.group_by("col").agg(...) |
| Window | df.groupby("col").transform(...) | df.with_columns(...).over("col") |