Rust性能优化指南：m10-performance技能详解，包含算法、数据结构与内存优化策略

m10-performance by zhanghandong/rust-skills

704 周安装量

912 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/zhanghandong/rust-skills --skill m10-performance

开发性能优化 Rust

🇨🇳中文介绍

性能优化

第二层：设计选择

核心问题

瓶颈是什么，优化是否值得？

在优化之前：

你测量过吗？（不要猜测）
可接受的性能标准是什么？
优化会增加复杂性吗？

性能决策 → 实现

目标	设计选择	实现
减少分配	预分配，复用	`with_capacity`, 对象池
改善缓存	连续数据	`Vec`, `SmallVec`
并行化	数据并行

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

749,400 周安装

Vercel React 最佳实践指南 | 58条Next.js性能优化规则与代码重构

255,700 周安装

Vercel Web界面规范检查工具 - 自动检测代码是否符合Web设计指南

205,600 周安装

agent-browser 浏览器自动化工具 - Vercel Labs 命令行网页操作与测试

136,300 周安装

你测量过吗？
- 先进行性能剖析 → flamegraph, perf
- 基准测试 → criterion, cargo bench
- 识别实际的热点
优先级是什么？
- 算法（10倍-1000倍提升）
- 数据结构（2倍-10倍）
- 内存分配（2倍-5倍）
- 缓存（1.5倍-3倍）
需要权衡什么？
- 复杂度 vs 速度
- 内存 vs CPU
- 延迟 vs 吞吐量

到领域约束（第三层）：

"这需要多快？"
    ↑ 询问：性能 SLA 是什么？
    ↑ 检查：domain-*（延迟要求）
    ↑ 检查：业务需求（可接受的响应时间）

问题	追溯至	询问
延迟要求	domain-*	可接受的响应时间是多少？
吞吐量需求	domain-*	每秒多少请求？
内存约束	domain-*	内存预算是多少？

到实现（第一层）：

"需要减少分配"
    ↓ m01-ownership: 使用引用，避免 clone
    ↓ m02-resource: 使用 with_capacity 预分配

"需要并行化"
    ↓ m07-concurrency: 选择 rayon 或线程
    ↓ m07-concurrency: 对于 I/O 密集型考虑 async

"需要缓存效率"
    ↓ 数据布局：可能时优先使用 Vec 而非 HashMap
    ↓ 访问模式：顺序访问优于随机访问

工具	用途
`cargo bench`	微基准测试
`criterion`	统计基准测试
`perf` / `flamegraph`	CPU 性能剖析
`heaptrack`	分配跟踪
`valgrind` / `cachegrind`	缓存分析

1. 算法选择     (10倍 - 1000倍)
2. 数据结构       (2倍 - 10倍)
3. 减少分配 (2倍 - 5倍)
4. 缓存优化   (1.5倍 - 3倍)
5. SIMD/并行化     (2倍 - 8倍)

技术	适用场景	方法
预分配	已知大小	`Vec::with_capacity(n)`
避免克隆	热点路径	使用引用或 `Cow<T>`
批量操作	大量小操作	收集然后处理
SmallVec	通常较小	`smallvec::SmallVec<[T; N]>`
内联缓冲区	固定大小数据	使用数组而非 Vec

错误	为何错误	更好的做法
未剖析就优化	目标错误	先进行性能剖析
在调试模式下进行基准测试	无意义	始终使用 `--release`
使用 LinkedList	缓存不友好	`Vec` 或 `VecDeque`
隐藏的 `.clone()`	不必要的分配	使用引用
过早优化	浪费精力	先让它能工作

反模式	为何不好	更好的做法
克隆以避免生命周期	性能代价	正确的所有权管理
将所有东西装箱	间接访问代价	可能时使用栈
对小集合使用 HashMap	开销大	使用 Vec 进行线性搜索
在循环中拼接字符串	O(n^2) 复杂度	`String::with_capacity` 或 `format!`

场景	参见
减少克隆	m01-ownership
并发选项	m07-concurrency
智能指针选择	m02-resource
领域需求	domain-*

🇺🇸English

Performance Optimization

Layer 2: Design Choices

Core Question

What's the bottleneck, and is optimization worth it?

Before optimizing:

Have you measured? (Don't guess)
What's the acceptable performance?
Will optimization add complexity?

Performance Decision → Implementation

Goal	Design Choice	Implementation
Reduce allocations	Pre-allocate, reuse	`with_capacity`, object pools
Improve cache	Contiguous data	`Vec`, `SmallVec`
Parallelize	Data parallelism	`rayon`, threads
Avoid copies	Zero-copy	References, `Cow<T>`
Reduce indirection	Inline data	`smallvec`, arrays

Thinking Prompt

Before optimizing:

Have you measured?
- Profile first → flamegraph, perf
- Benchmark → criterion, cargo bench
- Identify actual hotspots
What's the priority?
- Algorithm (10x-1000x improvement)
- Data structure (2x-10x)
- Allocation (2x-5x)
- Cache (1.5x-3x)
What's the trade-off?
- Complexity vs speed
- Memory vs CPU
- Latency vs throughput

Trace Up ↑

To domain constraints (Layer 3):

"How fast does this need to be?"
    ↑ Ask: What's the performance SLA?
    ↑ Check: domain-* (latency requirements)
    ↑ Check: Business requirements (acceptable response time)

Question	Trace To	Ask
Latency requirements	domain-*	What's acceptable response time?
Throughput needs	domain-*	How many requests per second?
Memory constraints	domain-*	What's the memory budget?

Trace Down ↓

To implementation (Layer 1):

"Need to reduce allocations"
    ↓ m01-ownership: Use references, avoid clone
    ↓ m02-resource: Pre-allocate with_capacity

"Need to parallelize"
    ↓ m07-concurrency: Choose rayon or threads
    ↓ m07-concurrency: Consider async for I/O-bound

"Need cache efficiency"
    ↓ Data layout: Prefer Vec over HashMap when possible
    ↓ Access patterns: Sequential over random access

Quick Reference

Tool	Purpose
`cargo bench`	Micro-benchmarks
`criterion`	Statistical benchmarks
`perf` / `flamegraph`	CPU profiling
`heaptrack`	Allocation tracking
`valgrind` / `cachegrind`	Cache analysis

Optimization Priority

1. Algorithm choice     (10x - 1000x)
2. Data structure       (2x - 10x)
3. Allocation reduction (2x - 5x)
4. Cache optimization   (1.5x - 3x)
5. SIMD/Parallelism     (2x - 8x)

Common Techniques

Technique	When	How
Pre-allocation	Known size	`Vec::with_capacity(n)`
Avoid cloning	Hot paths	Use references or `Cow<T>`
Batch operations	Many small ops	Collect then process
SmallVec	Usually small	`smallvec::SmallVec<[T; N]>`
Inline buffers	Fixed-size data	Arrays over Vec

Common Mistakes

Mistake	Why Wrong	Better
Optimize without profiling	Wrong target	Profile first
Benchmark in debug mode	Meaningless	Always `--release`
Use LinkedList	Cache unfriendly	`Vec` or `VecDeque`
Hidden `.clone()`	Unnecessary allocs	Use references
Premature optimization	Wasted effort	Make it work first

Anti-Patterns

Anti-Pattern	Why Bad	Better
Clone to avoid lifetimes	Performance cost	Proper ownership
Box everything	Indirection cost	Stack when possible
HashMap for small sets	Overhead	Vec with linear search
String concat in loop	O(n^2)	`String::with_capacity` or `format!`

Related Skills

When	See
Reducing clones	m01-ownership
Concurrency options	m07-concurrency
Smart pointer choice	m02-resource
Domain requirements	domain-*

Weekly Installs

704

Repository

zhanghandong/rust-skills

GitHub Stars

912

First Seen

Jan 20, 2026

Security Audits

Gen Agent Trust HubFail SocketPass SnykPass

Installed on

opencode641

codex629

gemini-cli611

github-copilot600

amp527

kimi-cli525

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

103,800 周安装

Rust性能优化指南：m10-performance技能详解，包含算法、数据结构与内存优化策略

🇨🇳中文介绍

性能优化

核心问题

性能决策 → 实现

相关 Skills

思考提示

向上追溯 ↑

向下追溯 ↓

快速参考

优化优先级

常用技术

常见错误

反模式

相关技能