系统设计框架：大规模分布式系统设计原则、四步流程与构建模块详解

system-design by wondelai/skills

251 周安装量

260 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/wondelai/skills --skill system-design

方法论开发运维系统架构

🇨🇳中文介绍

系统设计框架

一种用于设计大规模分布式系统的结构化方法。在架构新服务、评审系统设计、估算容量或准备系统设计讨论时，应用这些原则。

核心原则

从需求出发，而非解决方案。 每个系统设计都始于明确你要构建什么、为谁构建以及规模如何。在理解约束条件之前就跳转到架构设计，会导致系统过度设计或设计不足。

基础： 可扩展的系统并非凭空创造——它们由清晰理解的构建模块（负载均衡器、缓存、队列、数据库、CDN）通过明确的数据流连接而成。关键在于选择合适的模块、正确评估其规模，并理解每个选择带来的权衡。一个四步流程——界定范围、高层设计、深入探讨、总结收尾——使设计保持专注且易于沟通。

评分

目标：10/10。 在评审或创建系统设计时，根据对以下原则的遵循程度进行 0-10 分评分。10/10 意味着设计清晰地陈述了需求，包含了粗略估算，使用了适当的构建模块，解决了可扩展性和可靠性问题，并承认了权衡取舍。较低的分数表示存在需要解决的差距。始终提供当前分数以及达到 10/10 所需的具体改进措施。

系统设计框架

构建可靠、可扩展分布式系统的六个方面：

1. 四步流程

核心概念： 每个系统设计都遵循四个阶段：(1) 理解问题并确定设计范围，(2) 提出高层设计并获得认可，(3) 深入探讨关键组件，(4) 总结权衡取舍和未来改进。

为何有效： 没有结构化流程，设计要么过于抽象，要么过早陷入细节。四步方法确保你按比例投入时间——先勾勒轮廓，再深入关键部分。

关键见解：

第 1 步耗时约 5-10 分钟：提出澄清性问题，列出功能性和非功能性需求，就规模（日活跃用户数、每秒查询率、存储）达成一致
第 2 步耗时约 15-20 分钟：绘制包含 API、服务、数据存储和数据流箭头的高层示意图
第 3 步耗时约 15-20 分钟：挑选 2-3 个最难或最关键组件进行详细设计
第 4 步耗时约 5 分钟：总结权衡取舍，识别瓶颈，建议未来改进
切勿跳过第 1 步——范围不明确会导致设计工作白费
在继续之前，就假设条件达成明确共识

代码应用场景：

上下文	模式	示例

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

上下文	模式	示例
容量规划	估算每秒查询率，然后乘以增长因子	1 亿日活跃用户数 x 5 次操作 / 86400 = 约 5,800 平均每秒查询率，约 30,000 峰值每秒查询率
存储预算	估算每条记录大小，乘以数量和保留期	每天 5 亿条推文 x 300 字节 x 365 天 = 约 55 TB/年
服务等级协议定义	将可用性九数转换为允许的宕机时间	四个九（99.99%）= 每年约 52 分钟宕机时间

上下文	模式	示例
读密集型工作负载	在数据库前添加带 Redis 的旁路缓存	缓存用户配置文件并设置生存时间；写入时失效
流量峰值	在 API 和工作器之间插入消息队列	将图像调整大小任务入队；工作器按自身节奏拉取
全球用户	在静态资源前放置 CDN	从边缘提供 JS/CSS/图像；源站仅提供 API
负载不均	使用一致性哈希进行分片分配	添加节点时，仅约 1/n 的键需要移动

4. 数据库设计与扩展

核心概念： 根据数据形状和访问模式选择 SQL 与 NoSQL，然后先垂直扩展，当垂直扩展达到极限时再水平扩展（复制和分片）。

为何有效： 数据库通常是第一个瓶颈。理解复制、分片策略和非规范化权衡，可以让你推迟昂贵的重新架构，并有计划地规划增长。

垂直扩展（更大的机器）更简单但有上限；水平扩展（更多机器）更难但几乎无限
复制：主从复制（一个写入者，多个读取者）用于读密集型；多主复制用于多区域写入
分片策略：基于哈希（分布均匀，范围查询困难）、基于范围（范围查询高效，存在热点风险）、基于目录（灵活，需要额外查找）
当需要 ACID 事务、复杂连接和明确定义的模式时使用 SQL；当需要灵活模式、水平扩展或极高写入吞吐量时使用 NoSQL
非规范化以存储和写入复杂性为代价换取更快的读取速度——当读取性能至关重要且数据不频繁更改时使用
名人/热点问题：如果一个分片获得不成比例的流量，添加二级分区或缓存层

代码应用场景：

上下文	模式	示例
读密集型 API	主从复制配合只读副本	将读取路由到副本，写入路由到主库；接受轻微的复制延迟
大规模用户数据	基于 `user_id` 的哈希分片	分片键 = `hash(user_id) % num_shards`；分布均匀，每个分片独立
分析仪表板	非规范化为读取优化的物化视图	每晚预连接和聚合；从物化表提供仪表板
多区域应用	多主复制配合冲突解决	每个区域有一个主库；最后写入获胜或应用级合并

5. 常见系统设计

核心概念： 大多数系统都是一小部分知名设计的变体：URL 缩短器、速率限制器、通知系统、新闻推送、聊天系统、搜索自动补全、网络爬虫和唯一 ID 生成器。

为何有效： 研究常见设计可以建立模式和权衡取舍的心理库。当新问题出现时，你能识别它与哪个已知设计最相似，并进行调整，而不是从头开始发明。

URL 缩短器：base62 编码，键值存储，301 与 302 重定向的权衡，通过重定向日志进行分析
速率限制器：令牌桶或滑动窗口算法，放置在 API 网关或中间件，返回 429 并附带 Retry-After 头
新闻推送：写时扇出（发布时推送到关注者的缓存）对比读时扇出（读取时拉取和合并）；名人账户采用混合模式
聊天系统：WebSocket 用于实时双向通信，消息队列用于投递保证，通过心跳实现在线状态服务
搜索自动补全：字典树数据结构，前 k 个高频查询，为热门前缀预计算并缓存结果
网络爬虫：带 URL 边界的广度优先搜索，礼貌性（robots.txt，每个域名的速率限制），通过内容哈希去重
唯一 ID 生成器：UUID（简单，无需协调）对比 Snowflake（时间可排序，64 位，感知数据中心）

代码应用场景：

上下文	模式	示例
短链接服务	对自增 ID 或哈希进行 base62 编码	`https://short.ly/a1B2c3` 映射到键值存储中的行
API 保护	在网关处使用令牌桶速率限制器	每个 API 密钥每分钟 100 个令牌；以稳定速率补充；拒绝时返回 429
社交推送	混合扇出：普通用户推送，名人账户拉取	为关注者少于 1 万的账户预计算推送；名人帖子在读取时合并
分布式 ID	Snowflake：时间戳 + 数据中心 + 机器 + 序列号	64 位，时间可排序，生成器之间无需协调

6. 可靠性与运维

核心概念： 系统的价值取决于其保持运行、从故障中恢复以及被观察的能力。健康检查、监控、日志记录和部署策略不是事后考虑——它们是一流的设计关注点。

为何有效： 生产系统的故障方式在设计图中永远无法预测。运维就绪度——指标、告警、回滚计划和冗余——决定了故障是成为小插曲还是重大中断。

健康检查：存活探针（进程是否存活？）和就绪探针（能否服务流量？）——Kubernetes 两者都用
监控栈：指标（Prometheus、Datadog）、日志（ELK、CloudWatch）、追踪（Jaeger、Zipkin）——可观测性的三大支柱
部署策略：滚动（逐步替换）、蓝绿（两个相同环境，即时切换）、金丝雀（先小部分流量，然后扩展）
灾难恢复：RPO（可能丢失多少数据）和 RTO（恢复需要多长时间）定义了备份和故障转移策略
多数据中心：主备（故障转移）或双活（同时服务）；双活需要数据同步和冲突解决
自动扩缩：基于 CPU、内存、队列深度或自定义指标扩缩；始终设置最小和最大实例数

代码应用场景：

上下文	模式	示例
零停机部署	带健康检查门的蓝绿部署	健康检查通过后将流量路由到绿色环境；保留蓝色环境作为即时回滚
渐进式发布	带指标比较的金丝雀部署	将 5% 的流量发送到新版本；比较错误率和延迟；升级或回滚
故障检测	存活和就绪探针	如果存活，`/healthz` 返回 200；如果数据库已连接且缓存已预热，`/ready` 返回 200
数据安全	定义 RPO/RTO 并相应实施	RPO = 1 小时意味着每小时备份；RTO = 5 分钟意味着自动故障转移

错误	失败原因	修复方法
未明确需求就跳转到架构	解决了错误的问题或遗漏了关键约束	花前 5-10 分钟确定范围：功能、规模、服务等级协议
没有粗略估算	过度配置或配置不足，误差达数量级	在选择组件前估算每秒查询率、存储和带宽
单点故障	一个组件故障导致整个系统宕机	在每一层添加冗余：多服务器、多可用区、多区域
过早分片	在需要之前增加了巨大的运维复杂性	先垂直扩展，添加只读副本，积极缓存，最后才分片
缓存没有失效策略	陈旧数据导致错误和用户困惑	定义生存时间，写入时使用旁路缓存并显式失效
到处都是同步调用	一个慢速下游服务将延迟级联到所有调用者	对非延迟关键路径使用消息队列；为同步调用设置超时
忽略名人/热点问题	一个分片或缓存键被过度使用，其他闲置	检测热键，添加二级分区，或使用本地缓存
没有监控或告警	从用户那里得知故障，而不是从仪表板	从一开始就集成指标、日志和追踪

问题	如果否	行动
功能性和非功能性需求是否明确列出？	设计基于假设	写下功能、日活跃用户数、每秒查询率、存储、延迟服务等级协议、可用性服务等级协议
是否有每秒查询率和存储的粗略估算？	容量是猜测	计算：日活跃用户数 x 操作次数 / 86400 得到每秒查询率；记录数 x 大小 x 保留期得到存储
图中的每个组件是否都有冗余？	存在单点故障	为每个组件添加副本、故障转移或多可用区
数据库扩展策略是否已定义？	增长时会遇到瓶颈	规划：先垂直，然后只读副本，然后使用明确分片键进行分片
读密集型路径是否有缓存层？	数据库承受不必要的负载	添加带旁路缓存和定义生存时间的 Redis/Memcached
异步路径是否使用消息队列？	紧耦合，级联故障	使用 Kafka/SQS 解耦后台作业、通知、分析
是否有监控和告警计划？	对生产故障视而不见	定义指标、日志聚合、追踪和告警阈值
部署策略是否已定义？	冒险的一次性发布	选择滚动、蓝绿或金丝雀部署，并配合自动回滚

four-step-process.md：完整的四步流程，包含时间分配、示例问题和每个阶段的提示
estimation-numbers.md：2 的幂次方、延迟数字、可用性九数、每秒查询率/存储/带宽估算及示例
building-blocks.md：DNS、CDN、负载均衡器、缓存策略、消息队列、一致性哈希
database-scaling.md：SQL 与 NoSQL、复制、分片策略、非规范化、数据库选择指南
common-designs.md：URL 缩短器、速率限制器、新闻推送、聊天系统、搜索自动补全、网络爬虫、唯一 ID 生成器
reliability-operations.md：健康检查、监控、日志记录、部署策略、灾难恢复、自动扩缩

本技能基于 Alex Xu 的实用系统设计方法论。如需包含详细图表和逐步讲解的完整指南：

Alex Xu 的 "System Design Interview -- An Insider's Guide"（第一卷）
Alex Xu 的 "System Design Interview -- An Insider's Guide: Volume 2"（第二卷）
Martin Kleppmann 的 "Designing Data-Intensive Applications"（深入探讨数据系统基础）
ByteByteGo —— Alex Xu 的平台，提供可视化系统设计讲解

Alex Xu 是一名软件工程师，也是 ByteByteGo 的创建者，ByteByteGo 是最受欢迎的系统设计学习平台之一。他的两卷本《系统设计面试》系列已成为各级工程师事实上的准备资源，销量超过 50 万册。Xu 的方法强调结构化思维、粗略估算以及清晰沟通设计决策。在创办 ByteByteGo 之前，他曾在 Twitter、Apple 和 Oracle 工作。他的可视化讲解和逐步框架使系统设计为广大工程师所接受，将传统上晦涩难懂的主题转变为可学习、可重复的技能。

🇺🇸English

System Design Framework

A structured approach to designing large-scale distributed systems. Apply these principles when architecting new services, reviewing system designs, estimating capacity, or preparing for system design discussions.

Core Principle

Start with requirements, not solutions. Every system design begins by clarifying what you are building, for whom, and at what scale. Jumping to architecture before understanding constraints produces over-engineered or under-engineered systems.

The foundation: Scalable systems are not invented from scratch -- they are assembled from well-understood building blocks (load balancers, caches, queues, databases, CDNs) connected by clear data flows. The skill lies in choosing the right blocks, sizing them correctly, and understanding the tradeoffs each choice introduces. A four-step process -- scope, high-level design, deep dive, wrap-up -- keeps the design focused and communicable.

Scoring

Goal: 10/10. When reviewing or creating system designs, rate them 0-10 based on adherence to the principles below. A 10/10 means the design clearly states requirements, includes back-of-the-envelope estimates, uses appropriate building blocks, addresses scaling and reliability, and acknowledges tradeoffs. Lower scores indicate gaps to address. Always provide the current score and specific improvements needed to reach 10/10.

The System Design Framework

Six areas for building reliable, scalable distributed systems:

1. The Four-Step Process

Core concept: Every system design follows four stages: (1) understand the problem and establish design scope, (2) propose a high-level design and get buy-in, (3) dive deep into critical components, (4) wrap up with tradeoffs and future improvements.

Why it works: Without a structured process, designs either stay too abstract or get lost in premature detail. The four-step approach ensures you invest time proportionally -- broad strokes first, depth where it matters.

Key insights:

Step 1 consumes ~5-10 minutes: ask clarifying questions, list functional and non-functional requirements, agree on scale (DAU, QPS, storage)
Step 2 consumes ~15-20 minutes: draw a high-level diagram with APIs, services, data stores, and data flow arrows
Step 3 consumes ~15-20 minutes: pick 2-3 components that are hardest or most critical and design them in detail
Step 4 consumes ~5 minutes: summarize tradeoffs, identify bottlenecks, suggest future improvements
Never skip Step 1 -- ambiguity in scope leads to wasted design effort
Get explicit agreement on assumptions before proceeding

Code applications:

Context	Pattern	Example
New service kickoff	Write a one-page design doc with all four steps before coding	Requirements, API contract, data model, capacity estimate, then implementation
Architecture review	Walk reviewers through the four steps sequentially	Present scope, high-level diagram, deep-dive on the riskiest component, open questions
Incident postmortem	Trace the failure back through the four-step lens	Which requirement was missed? Which building block failed? What tradeoff bit us?

See: references/four-step-process.md

2. Back-of-the-Envelope Estimation

Core concept: Use powers of two, latency numbers, and simple arithmetic to estimate QPS, storage, bandwidth, and server count before committing to an architecture.

Why it works: Estimation prevents two failure modes: over-provisioning (wasting money) and under-provisioning (outages under load). A 2-minute calculation can save weeks of rework.

Key insights:

Know the powers of two: 2^10 = 1 thousand, 2^20 = 1 million, 2^30 = 1 billion, 2^40 = 1 trillion
Memory read ~100 ns, SSD read ~100 us, disk seek ~10 ms, round-trip same datacenter ~0.5 ms, cross-continent ~150 ms
Availability nines: 99.9% = 8.77 hours downtime/year, 99.99% = 52.6 minutes/year
QPS estimation: DAU x average-actions-per-day / 86,400 seconds; peak QPS is typically 2-5x average
Storage estimation: records-per-day x record-size x retention-period
Always round aggressively -- the goal is order of magnitude, not precision

Code applications:

Context	Pattern	Example
Capacity planning	Estimate QPS then multiply by growth factor	100M DAU x 5 actions / 86400 = ~5,800 QPS avg, ~30K QPS peak
Storage budgeting	Estimate per-record size and multiply by volume and retention	500M tweets/day x 300 bytes x 365 days = ~55 TB/year
SLA definition	Convert availability nines to allowed downtime	Four nines (99.99%) = ~52 minutes downtime per year

See: references/estimation-numbers.md

3. Building Blocks

Core concept: Scalable systems are assembled from a standard toolkit: DNS, CDN, load balancers, reverse proxies, application servers, caches, message queues, and consistent hashing.

Why it works: Each block solves a specific scaling or reliability problem. Knowing when and why to introduce each block prevents both premature complexity and avoidable bottlenecks.

Key insights:

DNS resolves domain names; CDN caches static assets at edge locations close to users
Load balancers distribute traffic -- L4 (transport layer, fast, simple) vs L7 (application layer, content-aware routing)
Caching layers: client-side, CDN, web server, application (e.g., Redis/Memcached), database query cache
Cache strategies: cache-aside (app manages), read-through (cache manages reads), write-through (cache manages writes synchronously), write-behind (cache writes asynchronously)
Message queues (Kafka, RabbitMQ, SQS) decouple producers from consumers, absorb traffic spikes, and enable async processing
Consistent hashing distributes keys across nodes with minimal redistribution when nodes are added or removed

Code applications:

Context	Pattern	Example
Read-heavy workload	Add cache-aside with Redis in front of the database	Cache user profiles with TTL; invalidate on write
Traffic spikes	Insert a message queue between API and workers	Enqueue image-resize jobs; workers pull at their own pace
Global users	Place a CDN in front of static assets	Serve JS/CSS/images from edge; origin only serves API
Uneven load	Use consistent hashing for shard assignment	Add a node and only ~1/n keys need to move

See: references/building-blocks.md

4. Database Design and Scaling

Core concept: Choose SQL vs NoSQL based on data shape and access patterns, then scale vertically first, horizontally (replication and sharding) when vertical limits are reached.

Why it works: The database is usually the first bottleneck. Understanding replication, sharding strategies, and denormalization tradeoffs lets you delay expensive re-architectures and plan growth deliberately.

Key insights:

Vertical scaling (bigger machine) is simpler but has a ceiling; horizontal scaling (more machines) is harder but nearly unlimited
Replication: leader-follower (one writer, many readers) for read-heavy; multi-leader for multi-region writes
Sharding strategies: hash-based (even distribution, hard range queries), range-based (efficient range queries, risk of hotspots), directory-based (flexible, extra lookup)
SQL when you need ACID transactions, complex joins, and a well-defined schema; NoSQL when you need flexible schema, horizontal scale, or very high write throughput
Denormalization trades storage and write complexity for faster reads -- use it when read performance is critical and data doesn't change frequently
Celebrity/hotspot problem: if one shard gets disproportionate traffic, add a secondary partition or cache layer

Code applications:

Context	Pattern	Example
Read-heavy API	Leader-follower replication with read replicas	Route reads to replicas, writes to leader; accept slight replication lag
User data at scale	Hash-based sharding on user_id	Shard key = hash(user_id) % num_shards; even distribution, each shard independent
Analytics dashboard	Denormalize into read-optimized materialized views	Pre-join and aggregate nightly; serve dashboards from the materialized table
Multi-region app	Multi-leader replication with conflict resolution	Each region has a leader; last-write-wins or application-level merge

See: references/database-scaling.md

5. Common System Designs

Core concept: Most systems are variations of a small set of well-known designs: URL shortener, rate limiter, notification system, news feed, chat system, search autocomplete, web crawler, and unique ID generator.

Why it works: Studying common designs builds a mental library of patterns and tradeoffs. When a new problem arrives, you recognize which known design it most resembles and adapt rather than invent from scratch.

Key insights:

URL shortener: base62 encoding, key-value store, 301 vs 302 redirect tradeoff, analytics via redirect logging
Rate limiter: token bucket or sliding window algorithm, placed at API gateway or middleware, return 429 with Retry-After header
News feed: fanout-on-write (push to followers' caches at post time) vs fanout-on-read (pull and merge at read time); hybrid for celebrity accounts
Chat system: WebSocket for real-time bidirectional communication, message queue for delivery guarantees, presence service via heartbeat
Search autocomplete: trie data structure, top-k frequent queries, precompute and cache results for popular prefixes
Web crawler: BFS with URL frontier, politeness (robots.txt, rate limiting per domain), deduplication via content hash
Unique ID generator: UUID (simple, no coordination) vs Snowflake (time-sortable, 64-bit, datacenter-aware)

Code applications:

Context	Pattern	Example
Short link service	Base62 encode an auto-increment ID or hash	`https://short.ly/a1B2c3` maps to row in key-value store
API protection	Token bucket rate limiter at gateway	100 tokens/min per API key; refill at steady rate; reject with 429
Social feed	Hybrid fanout: push for normal users, pull for celebrities	Pre-compute feeds for accounts with < 10K followers; merge at read time for celebrity posts
Distributed IDs	Snowflake: timestamp + datacenter + machine + sequence	64-bit, time-sortable, no coordination required between generators

See: references/common-designs.md

6. Reliability and Operations

Core concept: A system is only as good as its ability to stay up, recover from failures, and be observed. Health checks, monitoring, logging, and deployment strategies are not afterthoughts -- they are first-class design concerns.

Why it works: Production systems fail in ways that design diagrams never predict. Operational readiness -- metrics, alerts, rollback plans, and redundancy -- determines whether a failure becomes a minor blip or a major outage.

Key insights:

Health checks: liveness (is the process alive?) and readiness (can it serve traffic?) -- Kubernetes uses both
Monitoring stack: metrics (Prometheus, Datadog), logging (ELK, CloudWatch), tracing (Jaeger, Zipkin) -- the three pillars of observability
Deployment strategies: rolling (gradual replacement), blue-green (two identical environments, instant switch), canary (small percentage first, then expand)
Disaster recovery: RPO (how much data can you lose) and RTO (how long until recovery) define your backup and failover strategy
Multi-datacenter: active-passive (failover) or active-active (both serving); active-active requires data synchronization and conflict resolution
Autoscaling: scale on CPU, memory, queue depth, or custom metrics; always set both min and max instance counts

Code applications:

Context	Pattern	Example
Zero-downtime deploy	Blue-green with health check gates	Route traffic to green after health checks pass; keep blue as instant rollback
Gradual rollout	Canary deploy with metric comparison	Send 5% of traffic to new version; compare error rate and latency; promote or rollback
Failure detection	Liveness and readiness probes	`/healthz` returns 200 if alive; `/ready` returns 200 if database connected and cache warm
Data safety	Define RPO/RTO and implement accordingly	RPO = 1 hour means hourly backups; RTO = 5 min means automated failover

See: references/reliability-operations.md

Common Mistakes

Mistake	Why It Fails	Fix
Jumping to architecture without clarifying requirements	You solve the wrong problem or miss critical constraints	Spend the first 5-10 minutes on scope: features, scale, SLA
No back-of-the-envelope estimation	Over-provision or under-provision by orders of magnitude	Estimate QPS, storage, and bandwidth before choosing components
Single point of failure	One component failure takes down the entire system	Add redundancy at every layer: multi-server, multi-AZ, multi-region
Premature sharding	Adds enormous operational complexity before it is needed	Scale vertically first, add read replicas, cache aggressively, shard last
Caching without invalidation strategy	Stale data causes bugs and user confusion	Define TTL, cache-aside with explicit invalidation on writes
Synchronous calls everywhere	One slow downstream service cascades latency to all callers	Use message queues for non-latency-critical paths; set timeouts on sync calls

Quick Diagnostic

Question	If No	Action
Are functional and non-functional requirements explicitly listed?	Design is based on assumptions	Write down features, DAU, QPS, storage, latency SLA, availability SLA
Do you have a back-of-the-envelope estimate for QPS and storage?	Capacity is a guess	Calculate: DAU x actions / 86400 for QPS; records x size x retention for storage
Is every component in the diagram redundant?	Single points of failure exist	Add replicas, failover, or multi-AZ for each component
Is the database scaling strategy defined?	You will hit a wall under growth	Plan: vertical first, then read replicas, then sharding with a clear shard key
Is there a caching layer for read-heavy paths?	Database takes unnecessary load	Add Redis/Memcached with cache-aside and a defined TTL
Are async paths using message queues?	Tight coupling, cascading failures	Decouple with Kafka/SQS for background jobs, notifications, analytics
Is there a monitoring and alerting plan?	Blind to failures in production	Define metrics, log aggregation, tracing, and alert thresholds

Reference Files

four-step-process.md: The complete four-step process with time allocation, example questions, and tips for each stage
estimation-numbers.md: Powers of two, latency numbers, availability nines, QPS/storage/bandwidth estimation with worked examples
building-blocks.md: DNS, CDN, load balancers, caching strategies, message queues, consistent hashing
database-scaling.md: SQL vs NoSQL, replication, sharding strategies, denormalization, database selection guide
common-designs.md: URL shortener, rate limiter, news feed, chat system, search autocomplete, web crawler, unique ID generator
reliability-operations.md: Health checks, monitoring, logging, deployment strategies, disaster recovery, autoscaling

About the Author

Alex Xu is a software engineer and the creator of ByteByteGo, one of the most popular platforms for learning system design. His two-volume System Design Interview series has become the de facto preparation resource for engineers at all levels, with over 500,000 copies sold. Xu's approach emphasizes structured thinking, back-of-the-envelope estimation, and clear communication of design decisions. Before ByteByteGo, he worked at Twitter, Apple, and Oracle. His visual explanations and step-by-step frameworks have made system design accessible to a broad engineering audience, transforming what was traditionally an opaque topic into a learnable, repeatable skill.

Weekly Installs

251

Repository

wondelai/skills

GitHub Stars

260

First Seen

Feb 23, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode242

codex242

gemini-cli241

github-copilot241

amp241

kimi-cli241

Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU

68,100 周安装

新服务启动	在编码前撰写包含所有四个步骤的一页设计文档	需求、API 契约、数据模型、容量估算，然后实施
架构评审	按顺序向评审者讲解四个步骤	展示范围、高层示意图、对风险最高组件的深入探讨、待解决问题
事故复盘	通过四步视角追溯故障	遗漏了哪些需求？哪个构建模块失效了？哪个权衡取舍导致了问题？

系统设计框架：大规模分布式系统设计原则、四步流程与构建模块详解

🇨🇳中文介绍

系统设计框架

核心原则

评分

系统设计框架

1. 四步流程

相关 Skills

2. 粗略估算

3. 构建模块

4. 数据库设计与扩展

5. 常见系统设计

6. 可靠性与运维

常见错误

快速诊断

参考文件

延伸阅读

关于作者

🇺🇸English

System Design Framework

Core Principle

Scoring

The System Design Framework

1. The Four-Step Process

2. Back-of-the-Envelope Estimation

3. Building Blocks

4. Database Design and Scaling

5. Common System Designs

6. Reliability and Operations

Common Mistakes

Quick Diagnostic

Reference Files

Further Reading

About the Author

最新 Skills