性能工程师技能指南：eBPF火焰图分析、负载测试与内核调优实战

performance-engineer by 404kidwiz/claude-supercode-skills

131 周安装量

63 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/404kidwiz/claude-supercode-skills --skill performance-engineer

开发运维监控性能优化

🇨🇳中文介绍

性能工程师

目的

提供系统优化和性能分析专业知识，专注于使用 eBPF 和火焰图进行深度性能分析、负载测试和内核级调优。识别并解决应用程序和基础设施中的性能瓶颈。

使用时机

调查高延迟（P99 尖峰）或低吞吐量
分析 CPU/内存性能分析（火焰图）
进行负载测试（K6、Gatling、Locust）
调优 Linux 内核参数（sysctl）
实施持续性能分析（Parca、Pyroscope）
调试"在我机器上运行正常，但在生产环境很慢"的问题

2. 决策框架

性能分析策略

    What is the bottleneck?
    │
    ├─ **CPU High?**
    │  ├─ User Space? → **Language Profiler** (pprof, async-profiler)
    │  └─ Kernel Space? → **perf / eBPF** (System calls, Context switches)
    │
    ├─ **Memory High?**
    │  ├─ Leak? → **Heap Dump Analysis** (Eclipse MAT, heaptrack)
    │  └─ Fragmentation? → **Allocator tuning** (jemalloc, tcmalloc)
    │
    ├─ **I/O Wait?**
    │  ├─ Disk? → **iostat / biotop**
    │  └─ Network? → **tcpdump / Wireshark**
    │
    └─ **Latency (Wait Time)?**
       └─ Distributed? → **Tracing** (OpenTelemetry, Jaeger)

负载测试工具

工具	语言

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

工作流 1：使用火焰图进行 CPU 性能分析

目标： 确定哪个函数消耗了 80% 的 CPU。

捕获性能分析数据（Linux perf）

# Record stack traces at 99Hz for 30 seconds
perf record -F 99 -a -g -- sleep 30

生成火焰图

perf script > out.perf
./stackcollapse-perf.pl out.perf > out.folded
./flamegraph.pl out.folded > profile.svg

分析
- 在浏览器中打开 profile.svg。
- 寻找宽塔（耗时长的函数）。
- 示例： json_parse 宽度占 40% → 优化 JSON 处理。

工作流 3：交互到下一次绘制（INP）

目标： 提高前端响应能力（核心 Web 指标）。

测量
- 使用 Chrome DevTools 性能面板。
- 查找"长任务"（红色块 > 50ms）。
识别
- 是水合作用吗？事件处理程序？
- 示例： 一个点击处理程序强制进行同步布局重计算。
优化
- 让出主线程： await new Promise(r => setTimeout(r, 0)) 或 scheduler.postTask()。
- Web Workers： 将繁重逻辑移出主线程。

工作流 5：交互到下一次绘制（INP）优化

目标： 修复 React 按钮上的"点击卡顿"（INP > 200ms）。

识别交互
- 使用 React DevTools 性能分析器（交互追踪）。
- 查找 click 处理程序的持续时间。

拆分长任务

async function handleClick() {
  // 1. UI Update (Immediate)
  setLoading(true);
  
  // 2. Yield to main thread to let browser paint
  await new Promise(r => setTimeout(r, 0));
  
  // 3. Heavy Logic
  await heavyCalculation();
  setLoading(false);
}

验证
- 使用 Web Vitals 扩展。检查 INP 是否降至 200ms 以下。

5. 反模式与陷阱

❌ 反模式 1：过早优化

在没有测量的情况下，将可读的 map() 替换为复杂的 for 循环，因为"它更快"。

浪费开发时间。
代码变得难以阅读。
与 I/O 相比，通常影响微乎其微。

先测量： 只优化性能分析器识别的热点路径。

❌ 反模式 2：测试"本地主机"与生产环境

"它在我的 MacBook 上能处理 10k 请求/秒。"

网络延迟（本地主机为 0ms）。
数据库数据集大小（本地很小）。
云限制（CPU 积分，I/O 突发）。

在与生产环境容量匹配（或按比例缩小）的预发布环境中进行测试。

❌ 反模式 3：忽略尾部延迟（平均值）

"平均延迟是 200ms，我们没问题。"

P99 可能是 10 秒。1% 的用户正在遭受痛苦。
在微服务中，尾部延迟会成倍增加。

始终测量 P50、P95 和 P99。为 P99 进行优化。

示例 1：使用火焰图进行 CPU 性能优化

场景： 生产环境 API 出现 80% CPU 利用率，导致延迟尖峰。

性能分析数据收集：使用 perf 捕获 CPU 堆栈跟踪
火焰图生成：创建 CPU 使用情况的可视化
分析：识别消耗最多 CPU 的热点函数
优化：针对前 3 个函数进行优化

函数	CPU %	优化措施
json_serialize	35%	切换到二进制格式
crypto_hash	25%	批量哈希操作
regex_match	20%	预编译模式

CPU 利用率：80% → 35%
P99 延迟：1.2s → 150ms
吞吐量：500 RPS → 2,000 RPS

示例 2：微服务延迟的分布式追踪

场景： 包含 15 个服务的分布式系统出现端到端延迟问题。

追踪收集：部署 OpenTelemetry 收集器
延迟分析：识别延迟贡献最高的服务
依赖关系分析：映射服务依赖关系和数据流
根本原因：数据库连接池耗尽

    Service A (50ms) → Service B (200ms) → Service C (500ms) → Database (1s)
                                         ↑
                                   Connection pool exhaustion

增加连接池大小
实施查询优化
为繁重查询添加只读副本

端到端 P99：2.5s → 300ms
数据库 CPU：95% → 60%
错误率：5% → 0.1%

示例 3：用于容量规划的负载测试

场景： 电子商务平台为黑色星期五流量（正常负载的 10 倍）做准备。

负载测试方法：

测试设计：创建真实的用户旅程场景
测试执行：逐步增加到目标负载
瓶颈识别：找到崩溃点
容量规划：确定所需资源

负载测试结果：

虚拟用户数	RPS	P95 延迟	错误率
1,000	500	150ms	0.1%
5,000	2,400	280ms	0.3%
10,000	4,800	550ms	1.2%
15,000	6,200	1.2s	5.8%

扩展到 12,000 并发用户
增加 3 个应用服务器
将数据库只读副本增加到 5 个
在 10,000 RPS 时实施速率限制

性能分析与分析

先测量：在优化之前始终进行性能分析
全面覆盖：分析 CPU、内存、I/O 和网络
生产环境安全：在生产环境中使用低开销的性能分析
定期基线：建立性能基线以供比较

真实场景：模拟实际的用户行为和工作流
渐进式增加：从低负载开始，逐步增加
瓶颈识别：系统地找出限制因素
可重复性：保持一致的测试环境

算法优先：在微优化之前先优化算法
缓存策略：实施适当的缓存层
数据库优化：索引、查询、连接池
资源管理：高效的分配和池化

监控与可观测性

全面的指标：CPU、内存、磁盘、网络、应用程序
分布式追踪：微服务中的端到端可见性
告警：主动识别性能下降
仪表盘：系统健康的实时可见性

符号： 有调试符号可用于准确的堆栈跟踪。
开销： 已验证性能分析器开销（生产环境 < 1-2%）。
范围： 同时分析了 CPU 时间和挂钟时间。
上下文： 性能分析包含完整的请求生命周期。

场景： 真实的用户行为（不仅仅是命中一个端点）。
预热： 系统在测量前已预热（JIT/缓存）。
瓶颈： 识别了限制因素（CPU、数据库、带宽）。
可重复： 测试可以一致地运行。

验证： 修复后运行基准测试以确认改进。
回归： 确保优化没有破坏功能。
文档： 记录了为什么进行优化。
监控： 添加了指标以跟踪优化影响。

🇺🇸English

Performance Engineer

Purpose

Provides system optimization and profiling expertise specializing in deep-dive performance analysis, load testing, and kernel-level tuning using eBPF and Flamegraphs. Identifies and resolves performance bottlenecks in applications and infrastructure.

When to Use

Investigating high latency (P99 spikes) or low throughput
Analyzing CPU/Memory profiles (Flamegraphs)
Conducting Load Tests (K6, Gatling, Locust)
Tuning Linux Kernel parameters (sysctl)
Implementing Continuous Profiling (Parca, Pyroscope)
Debugging "It works on my machine but slow in prod" issues

2. Decision Framework

Profiling Strategy

What is the bottleneck?
│
├─ **CPU High?**
│  ├─ User Space? → **Language Profiler** (pprof, async-profiler)
│  └─ Kernel Space? → **perf / eBPF** (System calls, Context switches)
│
├─ **Memory High?**
│  ├─ Leak? → **Heap Dump Analysis** (Eclipse MAT, heaptrack)
│  └─ Fragmentation? → **Allocator tuning** (jemalloc, tcmalloc)
│
├─ **I/O Wait?**
│  ├─ Disk? → **iostat / biotop**
│  └─ Network? → **tcpdump / Wireshark**
│
└─ **Latency (Wait Time)?**
   └─ Distributed? → **Tracing** (OpenTelemetry, Jaeger)

Load Testing Tools

Tool	Language	Best For
K6	JS	Developer-friendly, CI/CD integration.
Gatling	Scala/Java	High concurrency, complex scenarios.
Locust	Python	Rapid prototyping, code-based tests.
Wrk2	C	Raw HTTP throughput benchmarking (simple).

Optimization Hierarchy

Algorithm: O(n^2) → O(n log n). Biggest wins.
Architecture: Caching, Async processing.
Code/Language: Memory allocation, loop unrolling.
System/Kernel: TCP stack tuning, CPU affinity.

Red Flags → Escalate todatabase-optimizer:

"Slow performance" turns out to be a single SQL query missing an index
Database locks/deadlocks causing application stalls
Disk I/O saturation on the DB server

3. Core Workflows

Workflow 1: CPU Profiling with Flamegraphs

Goal: Identify which function is consuming 80% CPU.

Steps:

Capture Profile (Linux perf)

# Record stack traces at 99Hz for 30 seconds
perf record -F 99 -a -g -- sleep 30

Generate Flamegraph

perf script > out.perf
./stackcollapse-perf.pl out.perf > out.folded
./flamegraph.pl out.folded > profile.svg

Analysis
- Open profile.svg in browser.
- Look for wide towers (functions taking time).
- Example: json_parse is 40% width → Optimize JSON handling.

Workflow 3: Interaction to Next Paint (INP)

Goal: Improve Frontend responsiveness (Core Web Vital).

Steps:

Measure
- Use Chrome DevTools Performance tab.
- Look for "Long Tasks" (Red blocks > 50ms).
Identify
- Is it hydration? Event handlers?
- Example: A click handler forcing a synchronous layout recalculation.
Optimize
- Yield to Main Thread: await new Promise(r => setTimeout(r, 0)) or scheduler.postTask().
- Web Workers: Move heavy logic off-thread.

Workflow 5: Interaction to Next Paint (INP) Optimization

Goal: Fix "Laggy Click" (INP > 200ms) on a React button.

Steps:

Identify Interaction
- Use React DevTools Profiler (Interaction Tracing).
- Find the click handler duration.

Break Up Long Tasks

async function handleClick() {
  // 1. UI Update (Immediate)
  setLoading(true);
  
  // 2. Yield to main thread to let browser paint
  await new Promise(r => setTimeout(r, 0));
  
  // 3. Heavy Logic
  await heavyCalculation();
  setLoading(false);
}

Verify
- Use Web Vitals extension. Check if INP drops below 200ms.

5. Anti-Patterns & Gotchas

❌ Anti-Pattern 1: Premature Optimization

What it looks like:

Replacing a readable map() with a complex for loop because "it's faster" without measuring.

Why it fails:

Wasted dev time.
Code becomes unreadable.
Usually negligible impact compared to I/O.

Correct approach:

Measure First: Only optimize hot paths identified by a profiler.

❌ Anti-Pattern 2: Testing "localhost" vs Production

What it looks like:

"It handles 10k req/s on my MacBook."

Why it fails:

Network latency (0ms on localhost).
Database dataset size (tiny on local).
Cloud limits (CPU credits, I/O bursts).

Correct approach:

Test in a Staging Environment that mirrors Prod capacity (or a scaled-down ratio).

❌ Anti-Pattern 3: Ignoring Tail Latency (Averages)

What it looks like:

"Average latency is 200ms, we are fine."

Why it fails:

P99 could be 10 seconds. 1% of users are suffering.
In microservices, tail latencies multiply.

Correct approach:

Always measure P50, P95, and P99. Optimize for P99.

Examples

Example 1: CPU Performance Optimization Using Flamegraphs

Scenario: Production API experiencing 80% CPU utilization causing latency spikes.

Investigation Approach:

Profile Collection : Used perf to capture CPU stack traces
Flamegraph Generation : Created visualization of CPU usage
Analysis : Identified hot functions consuming most CPU
Optimization : Targeted the top 3 functions

Key Findings:

Function	CPU %	Optimization Action
json_serialize	35%	Switch to binary format
crypto_hash	25%	Batch hashing operations
regex_match	20%	Pre-compile patterns

Results:

CPU utilization: 80% → 35%
P99 latency: 1.2s → 150ms
Throughput: 500 RPS → 2,000 RPS

Example 2: Distributed Tracing for Microservices Latency

Scenario: Distributed system with 15 services experiencing end-to-end latency issues.

Investigation Approach:

Trace Collection : Deployed OpenTelemetry collectors
Latency Analysis : Identified service with highest latency contribution
Dependency Analysis : Mapped service dependencies and data flows
Root Cause : Database connection pool exhaustion

Trace Analysis:

Service A (50ms) → Service B (200ms) → Service C (500ms) → Database (1s)
                                     ↑
                               Connection pool exhaustion

Resolution:

Increased connection pool size
Implemented query optimization
Added read replicas for heavy queries

Results:

End-to-end P99: 2.5s → 300ms
Database CPU: 95% → 60%
Error rate: 5% → 0.1%

Example 3: Load Testing for Capacity Planning

Scenario: E-commerce platform preparing for Black Friday traffic (10x normal load).

Load Testing Approach:

Test Design : Created realistic user journey scenarios
Test Execution : Gradual ramp-up to target load
Bottleneck Identification : Found breaking points
Capacity Planning : Determined required resources

Load Test Results:

Virtual Users	RPS	P95 Latency	Error Rate
1,000	500	150ms	0.1%
5,000	2,400	280ms	0.3%
10,000	4,800	550ms	1.2%
15,000	6,200	1.2s	5.8%

Capacity Recommendations:

Scale to 12,000 concurrent users
Add 3 more application servers
Increase database read replicas to 5
Implement rate limiting at 10,000 RPS

Best Practices

Profiling and Analysis

Measure First : Always profile before optimizing
Comprehensive Coverage : Analyze CPU, memory, I/O, and network
Production Safe : Use low-overhead profiling in production
Regular Baselines : Establish performance baselines for comparison

Load Testing

Realistic Scenarios : Model actual user behavior and workflows
Progressive Ramp-up : Start low, increase gradually
Bottleneck Identification : Find limiting factors systematically
Repeatability : Maintain consistent test environments

Performance Optimization

Algorithm First : Optimize algorithms before micro-optimizations
Caching Strategy : Implement appropriate caching layers
Database Optimization : Indexes, queries, connection pooling
Resource Management : Efficient allocation and pooling

Monitoring and Observability

Comprehensive Metrics : CPU, memory, disk, network, application
Distributed Tracing : End-to-end visibility in microservices
Alerting : Proactive identification of performance degradation
Dashboarding : Real-time visibility into system health

Quality Checklist

Profiling:

Symbols: Debug symbols available for accurate stack traces.
Overhead: Profiler overhead verified (< 1-2% for production).
Scope: Both CPU and Wall-clock time analyzed.
Context: Profile includes full request lifecycle.

Load Testing:

Scenarios: Realistic user behavior (not just hitting one endpoint).
Warmup: System warmed up before measurement (JIT/Caches).
Bottleneck: Identified the limiting factor (CPU, DB, Bandwidth).
Repeatable: Tests can be run consistently.

Optimization:

Validation: Benchmark run after fix to confirm improvement.
Regression: Ensured optimization didn't break functionality.
Documentation: Documented why the optimization was done.
Monitoring: Added metrics to track optimization impact.

Weekly Installs

Repository

404kidwiz/claud…e-skills

GitHub Stars

First Seen

Jan 24, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode68

codex62

gemini-cli62

cursor59

claude-code59

github-copilot55

Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU

94,100 周安装

K6	JS	对开发者友好，易于 CI/CD 集成。
Gatling	Scala/Java	高并发，复杂场景。
Locust	Python	快速原型设计，基于代码的测试。
Wrk2	C	原始 HTTP 吞吐量基准测试（简单）。

性能工程师技能指南：eBPF火焰图分析、负载测试与内核调优实战

🇨🇳中文介绍

性能工程师

目的

使用时机

2. 决策框架

性能分析策略

负载测试工具

相关 Skills

优化层次

3. 核心工作流

工作流 1：使用火焰图进行 CPU 性能分析

工作流 3：交互到下一次绘制（INP）

工作流 5：交互到下一次绘制（INP）优化

5. 反模式与陷阱

❌ 反模式 1：过早优化

❌ 反模式 2：测试"本地主机"与生产环境

❌ 反模式 3：忽略尾部延迟（平均值）

示例

示例 1：使用火焰图进行 CPU 性能优化

示例 2：微服务延迟的分布式追踪

示例 3：用于容量规划的负载测试

最佳实践

性能分析与分析

负载测试

性能优化

监控与可观测性

质量检查清单

🇺🇸English

Performance Engineer

Purpose

When to Use

2. Decision Framework

Profiling Strategy

Load Testing Tools

Optimization Hierarchy

3. Core Workflows

Workflow 1: CPU Profiling with Flamegraphs

Workflow 3: Interaction to Next Paint (INP)

Workflow 5: Interaction to Next Paint (INP) Optimization

5. Anti-Patterns & Gotchas

❌ Anti-Pattern 1: Premature Optimization

❌ Anti-Pattern 2: Testing "localhost" vs Production

❌ Anti-Pattern 3: Ignoring Tail Latency (Averages)

Examples

Example 1: CPU Performance Optimization Using Flamegraphs

Example 2: Distributed Tracing for Microservices Latency

Example 3: Load Testing for Capacity Planning

Best Practices

Profiling and Analysis

Load Testing

Performance Optimization

Monitoring and Observability

Quality Checklist

最新 Skills