npx skills add https://github.com/wondelai/skills --skill system-design一种用于设计大规模分布式系统的结构化方法。在架构新服务、评审系统设计、估算容量或准备系统设计讨论时,应用这些原则。
从需求出发,而非解决方案。 每个系统设计都始于明确你要构建什么、为谁构建以及规模如何。在理解约束条件之前就跳转到架构设计,会导致系统过度设计或设计不足。
基础: 可扩展的系统并非凭空创造——它们由清晰理解的构建模块(负载均衡器、缓存、队列、数据库、CDN)通过明确的数据流连接而成。关键在于选择合适的模块、正确评估其规模,并理解每个选择带来的权衡。一个四步流程——界定范围、高层设计、深入探讨、总结收尾——使设计保持专注且易于沟通。
目标:10/10。 在评审或创建系统设计时,根据对以下原则的遵循程度进行 0-10 分评分。10/10 意味着设计清晰地陈述了需求,包含了粗略估算,使用了适当的构建模块,解决了可扩展性和可靠性问题,并承认了权衡取舍。较低的分数表示存在需要解决的差距。始终提供当前分数以及达到 10/10 所需的具体改进措施。
构建可靠、可扩展分布式系统的六个方面:
核心概念: 每个系统设计都遵循四个阶段:(1) 理解问题并确定设计范围,(2) 提出高层设计并获得认可,(3) 深入探讨关键组件,(4) 总结权衡取舍和未来改进。
为何有效: 没有结构化流程,设计要么过于抽象,要么过早陷入细节。四步方法确保你按比例投入时间——先勾勒轮廓,再深入关键部分。
关键见解:
代码应用场景:
| 上下文 | 模式 | 示例 |
|---|
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 新服务启动 | 在编码前撰写包含所有四个步骤的一页设计文档 | 需求、API 契约、数据模型、容量估算,然后实施 |
| 架构评审 | 按顺序向评审者讲解四个步骤 | 展示范围、高层示意图、对风险最高组件的深入探讨、待解决问题 |
| 事故复盘 | 通过四步视角追溯故障 | 遗漏了哪些需求?哪个构建模块失效了?哪个权衡取舍导致了问题? |
核心概念: 在确定架构之前,使用 2 的幂次方、延迟数字和简单算术来估算每秒查询率、存储、带宽和服务器数量。
为何有效: 估算可以防止两种故障模式:过度配置(浪费资金)和配置不足(负载下宕机)。2 分钟的计算可以节省数周的重构工作。
关键见解:
代码应用场景:
| 上下文 | 模式 | 示例 |
|---|---|---|
| 容量规划 | 估算每秒查询率,然后乘以增长因子 | 1 亿日活跃用户数 x 5 次操作 / 86400 = 约 5,800 平均每秒查询率,约 30,000 峰值每秒查询率 |
| 存储预算 | 估算每条记录大小,乘以数量和保留期 | 每天 5 亿条推文 x 300 字节 x 365 天 = 约 55 TB/年 |
| 服务等级协议定义 | 将可用性九数转换为允许的宕机时间 | 四个九(99.99%)= 每年约 52 分钟宕机时间 |
核心概念: 可扩展系统由标准工具集组装而成:DNS、CDN、负载均衡器、反向代理、应用服务器、缓存、消息队列和一致性哈希。
为何有效: 每个模块解决特定的扩展或可靠性问题。了解何时以及为何引入每个模块,可以防止过早复杂化和可避免的瓶颈。
关键见解:
代码应用场景:
| 上下文 | 模式 | 示例 |
|---|---|---|
| 读密集型工作负载 | 在数据库前添加带 Redis 的旁路缓存 | 缓存用户配置文件并设置生存时间;写入时失效 |
| 流量峰值 | 在 API 和工作器之间插入消息队列 | 将图像调整大小任务入队;工作器按自身节奏拉取 |
| 全球用户 | 在静态资源前放置 CDN | 从边缘提供 JS/CSS/图像;源站仅提供 API |
| 负载不均 | 使用一致性哈希进行分片分配 | 添加节点时,仅约 1/n 的键需要移动 |
核心概念: 根据数据形状和访问模式选择 SQL 与 NoSQL,然后先垂直扩展,当垂直扩展达到极限时再水平扩展(复制和分片)。
为何有效: 数据库通常是第一个瓶颈。理解复制、分片策略和非规范化权衡,可以让你推迟昂贵的重新架构,并有计划地规划增长。
关键见解:
代码应用场景:
| 上下文 | 模式 | 示例 |
|---|---|---|
| 读密集型 API | 主从复制配合只读副本 | 将读取路由到副本,写入路由到主库;接受轻微的复制延迟 |
| 大规模用户数据 | 基于 user_id 的哈希分片 | 分片键 = hash(user_id) % num_shards;分布均匀,每个分片独立 |
| 分析仪表板 | 非规范化为读取优化的物化视图 | 每晚预连接和聚合;从物化表提供仪表板 |
| 多区域应用 | 多主复制配合冲突解决 | 每个区域有一个主库;最后写入获胜或应用级合并 |
核心概念: 大多数系统都是一小部分知名设计的变体:URL 缩短器、速率限制器、通知系统、新闻推送、聊天系统、搜索自动补全、网络爬虫和唯一 ID 生成器。
为何有效: 研究常见设计可以建立模式和权衡取舍的心理库。当新问题出现时,你能识别它与哪个已知设计最相似,并进行调整,而不是从头开始发明。
关键见解:
Retry-After 头代码应用场景:
| 上下文 | 模式 | 示例 |
|---|---|---|
| 短链接服务 | 对自增 ID 或哈希进行 base62 编码 | https://short.ly/a1B2c3 映射到键值存储中的行 |
| API 保护 | 在网关处使用令牌桶速率限制器 | 每个 API 密钥每分钟 100 个令牌;以稳定速率补充;拒绝时返回 429 |
| 社交推送 | 混合扇出:普通用户推送,名人账户拉取 | 为关注者少于 1 万的账户预计算推送;名人帖子在读取时合并 |
| 分布式 ID | Snowflake:时间戳 + 数据中心 + 机器 + 序列号 | 64 位,时间可排序,生成器之间无需协调 |
核心概念: 系统的价值取决于其保持运行、从故障中恢复以及被观察的能力。健康检查、监控、日志记录和部署策略不是事后考虑——它们是一流的设计关注点。
为何有效: 生产系统的故障方式在设计图中永远无法预测。运维就绪度——指标、告警、回滚计划和冗余——决定了故障是成为小插曲还是重大中断。
关键见解:
代码应用场景:
| 上下文 | 模式 | 示例 |
|---|---|---|
| 零停机部署 | 带健康检查门的蓝绿部署 | 健康检查通过后将流量路由到绿色环境;保留蓝色环境作为即时回滚 |
| 渐进式发布 | 带指标比较的金丝雀部署 | 将 5% 的流量发送到新版本;比较错误率和延迟;升级或回滚 |
| 故障检测 | 存活和就绪探针 | 如果存活,/healthz 返回 200;如果数据库已连接且缓存已预热,/ready 返回 200 |
| 数据安全 | 定义 RPO/RTO 并相应实施 | RPO = 1 小时意味着每小时备份;RTO = 5 分钟意味着自动故障转移 |
| 错误 | 失败原因 | 修复方法 |
|---|---|---|
| 未明确需求就跳转到架构 | 解决了错误的问题或遗漏了关键约束 | 花前 5-10 分钟确定范围:功能、规模、服务等级协议 |
| 没有粗略估算 | 过度配置或配置不足,误差达数量级 | 在选择组件前估算每秒查询率、存储和带宽 |
| 单点故障 | 一个组件故障导致整个系统宕机 | 在每一层添加冗余:多服务器、多可用区、多区域 |
| 过早分片 | 在需要之前增加了巨大的运维复杂性 | 先垂直扩展,添加只读副本,积极缓存,最后才分片 |
| 缓存没有失效策略 | 陈旧数据导致错误和用户困惑 | 定义生存时间,写入时使用旁路缓存并显式失效 |
| 到处都是同步调用 | 一个慢速下游服务将延迟级联到所有调用者 | 对非延迟关键路径使用消息队列;为同步调用设置超时 |
| 忽略名人/热点问题 | 一个分片或缓存键被过度使用,其他闲置 | 检测热键,添加二级分区,或使用本地缓存 |
| 没有监控或告警 | 从用户那里得知故障,而不是从仪表板 | 从一开始就集成指标、日志和追踪 |
| 问题 | 如果否 | 行动 |
|---|---|---|
| 功能性和非功能性需求是否明确列出? | 设计基于假设 | 写下功能、日活跃用户数、每秒查询率、存储、延迟服务等级协议、可用性服务等级协议 |
| 是否有每秒查询率和存储的粗略估算? | 容量是猜测 | 计算:日活跃用户数 x 操作次数 / 86400 得到每秒查询率;记录数 x 大小 x 保留期得到存储 |
| 图中的每个组件是否都有冗余? | 存在单点故障 | 为每个组件添加副本、故障转移或多可用区 |
| 数据库扩展策略是否已定义? | 增长时会遇到瓶颈 | 规划:先垂直,然后只读副本,然后使用明确分片键进行分片 |
| 读密集型路径是否有缓存层? | 数据库承受不必要的负载 | 添加带旁路缓存和定义生存时间的 Redis/Memcached |
| 异步路径是否使用消息队列? | 紧耦合,级联故障 | 使用 Kafka/SQS 解耦后台作业、通知、分析 |
| 是否有监控和告警计划? | 对生产故障视而不见 | 定义指标、日志聚合、追踪和告警阈值 |
| 部署策略是否已定义? | 冒险的一次性发布 | 选择滚动、蓝绿或金丝雀部署,并配合自动回滚 |
本技能基于 Alex Xu 的实用系统设计方法论。如需包含详细图表和逐步讲解的完整指南:
Alex Xu 是一名软件工程师,也是 ByteByteGo 的创建者,ByteByteGo 是最受欢迎的系统设计学习平台之一。他的两卷本《系统设计面试》系列已成为各级工程师事实上的准备资源,销量超过 50 万册。Xu 的方法强调结构化思维、粗略估算以及清晰沟通设计决策。在创办 ByteByteGo 之前,他曾在 Twitter、Apple 和 Oracle 工作。他的可视化讲解和逐步框架使系统设计为广大工程师所接受,将传统上晦涩难懂的主题转变为可学习、可重复的技能。
每周安装次数
251
代码仓库
GitHub 星标数
260
首次出现
2026年2月23日
安全审计
安装于
opencode242
codex242
gemini-cli241
github-copilot241
amp241
kimi-cli241
A structured approach to designing large-scale distributed systems. Apply these principles when architecting new services, reviewing system designs, estimating capacity, or preparing for system design discussions.
Start with requirements, not solutions. Every system design begins by clarifying what you are building, for whom, and at what scale. Jumping to architecture before understanding constraints produces over-engineered or under-engineered systems.
The foundation: Scalable systems are not invented from scratch -- they are assembled from well-understood building blocks (load balancers, caches, queues, databases, CDNs) connected by clear data flows. The skill lies in choosing the right blocks, sizing them correctly, and understanding the tradeoffs each choice introduces. A four-step process -- scope, high-level design, deep dive, wrap-up -- keeps the design focused and communicable.
Goal: 10/10. When reviewing or creating system designs, rate them 0-10 based on adherence to the principles below. A 10/10 means the design clearly states requirements, includes back-of-the-envelope estimates, uses appropriate building blocks, addresses scaling and reliability, and acknowledges tradeoffs. Lower scores indicate gaps to address. Always provide the current score and specific improvements needed to reach 10/10.
Six areas for building reliable, scalable distributed systems:
Core concept: Every system design follows four stages: (1) understand the problem and establish design scope, (2) propose a high-level design and get buy-in, (3) dive deep into critical components, (4) wrap up with tradeoffs and future improvements.
Why it works: Without a structured process, designs either stay too abstract or get lost in premature detail. The four-step approach ensures you invest time proportionally -- broad strokes first, depth where it matters.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| New service kickoff | Write a one-page design doc with all four steps before coding | Requirements, API contract, data model, capacity estimate, then implementation |
| Architecture review | Walk reviewers through the four steps sequentially | Present scope, high-level diagram, deep-dive on the riskiest component, open questions |
| Incident postmortem | Trace the failure back through the four-step lens | Which requirement was missed? Which building block failed? What tradeoff bit us? |
See: references/four-step-process.md
Core concept: Use powers of two, latency numbers, and simple arithmetic to estimate QPS, storage, bandwidth, and server count before committing to an architecture.
Why it works: Estimation prevents two failure modes: over-provisioning (wasting money) and under-provisioning (outages under load). A 2-minute calculation can save weeks of rework.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Capacity planning | Estimate QPS then multiply by growth factor | 100M DAU x 5 actions / 86400 = ~5,800 QPS avg, ~30K QPS peak |
| Storage budgeting | Estimate per-record size and multiply by volume and retention | 500M tweets/day x 300 bytes x 365 days = ~55 TB/year |
| SLA definition | Convert availability nines to allowed downtime | Four nines (99.99%) = ~52 minutes downtime per year |
See: references/estimation-numbers.md
Core concept: Scalable systems are assembled from a standard toolkit: DNS, CDN, load balancers, reverse proxies, application servers, caches, message queues, and consistent hashing.
Why it works: Each block solves a specific scaling or reliability problem. Knowing when and why to introduce each block prevents both premature complexity and avoidable bottlenecks.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Read-heavy workload | Add cache-aside with Redis in front of the database | Cache user profiles with TTL; invalidate on write |
| Traffic spikes | Insert a message queue between API and workers | Enqueue image-resize jobs; workers pull at their own pace |
| Global users | Place a CDN in front of static assets | Serve JS/CSS/images from edge; origin only serves API |
| Uneven load | Use consistent hashing for shard assignment | Add a node and only ~1/n keys need to move |
See: references/building-blocks.md
Core concept: Choose SQL vs NoSQL based on data shape and access patterns, then scale vertically first, horizontally (replication and sharding) when vertical limits are reached.
Why it works: The database is usually the first bottleneck. Understanding replication, sharding strategies, and denormalization tradeoffs lets you delay expensive re-architectures and plan growth deliberately.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Read-heavy API | Leader-follower replication with read replicas | Route reads to replicas, writes to leader; accept slight replication lag |
| User data at scale | Hash-based sharding on user_id | Shard key = hash(user_id) % num_shards; even distribution, each shard independent |
| Analytics dashboard | Denormalize into read-optimized materialized views | Pre-join and aggregate nightly; serve dashboards from the materialized table |
| Multi-region app | Multi-leader replication with conflict resolution | Each region has a leader; last-write-wins or application-level merge |
See: references/database-scaling.md
Core concept: Most systems are variations of a small set of well-known designs: URL shortener, rate limiter, notification system, news feed, chat system, search autocomplete, web crawler, and unique ID generator.
Why it works: Studying common designs builds a mental library of patterns and tradeoffs. When a new problem arrives, you recognize which known design it most resembles and adapt rather than invent from scratch.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Short link service | Base62 encode an auto-increment ID or hash | https://short.ly/a1B2c3 maps to row in key-value store |
| API protection | Token bucket rate limiter at gateway | 100 tokens/min per API key; refill at steady rate; reject with 429 |
| Social feed | Hybrid fanout: push for normal users, pull for celebrities | Pre-compute feeds for accounts with < 10K followers; merge at read time for celebrity posts |
| Distributed IDs | Snowflake: timestamp + datacenter + machine + sequence | 64-bit, time-sortable, no coordination required between generators |
See: references/common-designs.md
Core concept: A system is only as good as its ability to stay up, recover from failures, and be observed. Health checks, monitoring, logging, and deployment strategies are not afterthoughts -- they are first-class design concerns.
Why it works: Production systems fail in ways that design diagrams never predict. Operational readiness -- metrics, alerts, rollback plans, and redundancy -- determines whether a failure becomes a minor blip or a major outage.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Zero-downtime deploy | Blue-green with health check gates | Route traffic to green after health checks pass; keep blue as instant rollback |
| Gradual rollout | Canary deploy with metric comparison | Send 5% of traffic to new version; compare error rate and latency; promote or rollback |
| Failure detection | Liveness and readiness probes | /healthz returns 200 if alive; /ready returns 200 if database connected and cache warm |
| Data safety | Define RPO/RTO and implement accordingly | RPO = 1 hour means hourly backups; RTO = 5 min means automated failover |
See: references/reliability-operations.md
| Mistake | Why It Fails | Fix |
|---|---|---|
| Jumping to architecture without clarifying requirements | You solve the wrong problem or miss critical constraints | Spend the first 5-10 minutes on scope: features, scale, SLA |
| No back-of-the-envelope estimation | Over-provision or under-provision by orders of magnitude | Estimate QPS, storage, and bandwidth before choosing components |
| Single point of failure | One component failure takes down the entire system | Add redundancy at every layer: multi-server, multi-AZ, multi-region |
| Premature sharding | Adds enormous operational complexity before it is needed | Scale vertically first, add read replicas, cache aggressively, shard last |
| Caching without invalidation strategy | Stale data causes bugs and user confusion | Define TTL, cache-aside with explicit invalidation on writes |
| Synchronous calls everywhere | One slow downstream service cascades latency to all callers | Use message queues for non-latency-critical paths; set timeouts on sync calls |
| Question | If No | Action |
|---|---|---|
| Are functional and non-functional requirements explicitly listed? | Design is based on assumptions | Write down features, DAU, QPS, storage, latency SLA, availability SLA |
| Do you have a back-of-the-envelope estimate for QPS and storage? | Capacity is a guess | Calculate: DAU x actions / 86400 for QPS; records x size x retention for storage |
| Is every component in the diagram redundant? | Single points of failure exist | Add replicas, failover, or multi-AZ for each component |
| Is the database scaling strategy defined? | You will hit a wall under growth | Plan: vertical first, then read replicas, then sharding with a clear shard key |
| Is there a caching layer for read-heavy paths? | Database takes unnecessary load | Add Redis/Memcached with cache-aside and a defined TTL |
| Are async paths using message queues? | Tight coupling, cascading failures | Decouple with Kafka/SQS for background jobs, notifications, analytics |
| Is there a monitoring and alerting plan? | Blind to failures in production | Define metrics, log aggregation, tracing, and alert thresholds |
This skill is based on Alex Xu's practical system design methodology. For the complete guides with detailed diagrams and walkthroughs:
Alex Xu is a software engineer and the creator of ByteByteGo, one of the most popular platforms for learning system design. His two-volume System Design Interview series has become the de facto preparation resource for engineers at all levels, with over 500,000 copies sold. Xu's approach emphasizes structured thinking, back-of-the-envelope estimation, and clear communication of design decisions. Before ByteByteGo, he worked at Twitter, Apple, and Oracle. His visual explanations and step-by-step frameworks have made system design accessible to a broad engineering audience, transforming what was traditionally an opaque topic into a learnable, repeatable skill.
Weekly Installs
251
Repository
GitHub Stars
260
First Seen
Feb 23, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode242
codex242
gemini-cli241
github-copilot241
amp241
kimi-cli241
Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU
68,100 周安装
AI小说写作助手 - 专业小说创作全流程支持,涵盖构思、角色设计、世界观构建与章节续写
632 周安装
Oracle到PostgreSQL存储过程迁移工具:自动翻译PL/SQL为PL/pgSQL
644 周安装
GitHub Copilot for Azure 技能创作指南:规范、令牌预算与渐进式披露
650 周安装
Oracle到PostgreSQL迁移缺陷报告模板 | 数据库迁移问题记录指南
646 周安装
Oracle 到 PostgreSQL 数据库迁移计划自动生成工具 | .NET 解决方案分析
649 周安装
Sensei:GitHub Copilot for Azure技能合规性自动化改进工具
659 周安装
| Ignoring the celebrity/hotspot problem | One shard or cache key gets hammered, others idle | Detect hot keys, add secondary partitioning, or use local caches |
| No monitoring or alerting | You find out about failures from users, not dashboards | Instrument metrics, logs, and traces from day one |
| Is the deployment strategy defined? | Risky all-at-once releases | Choose rolling, blue-green, or canary with automated rollback |