数据密集型应用设计框架：DDIA系统核心原则与架构指南

ddia-systems by wondelai/skills

201 周安装量

255 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/wondelai/skills --skill ddia-systems

数据库系统架构分布式系统

🇨🇳中文介绍

设计数据密集型应用框架

构建可靠、可扩展、可维护的数据系统的原则性方法。在选择数据库、设计模式、架构分布式系统或推理一致性和容错性时，应用这些原则。

核心原则

数据比代码更持久。 应用程序被重写，语言发生变化，框架来来去去——但数据及其结构会持续数十年。每个架构决策都必须优先考虑数据层的长期正确性、持久性和可演进性，高于一切。

基础： 大多数应用程序是数据密集型的，而不是计算密集型的。困难的问题在于数据的数量、复杂性及其变化速度。理解一致性、可用性、分区容错性、延迟和吞吐量之间的权衡，是区分健壮系统与脆弱系统的关键。

评分

目标：10/10。 在评审或设计数据架构时，根据对以下原则的遵循程度，将其评为 0-10 分。10/10 意味着对数据模型、存储引擎、复制、分区、事务和处理流水线做出了深思熟虑的权衡选择；较低的分数表明存在意外复杂性或忽略了故障模式。始终提供当前分数以及达到 10/10 所需的具体改进措施。

DDIA 框架

推理数据密集型系统的七个领域：

1. 数据模型和查询语言

核心概念： 数据模型塑造了你对问题的思考方式。关系模型、文档模型和图模型各自施加不同的约束，并支持不同的查询模式。

为何有效： 选择错误的数据模型会迫使应用程序代码去弥补表示不匹配，从而增加意外复杂性，并随着时间的推移而加剧。

关键见解：

关系模型擅长处理多对多关系和即席查询
文档模型擅长处理一对多关系和数据局部性
图模型擅长处理高度互连的数据和递归遍历
写时模式（关系型）能及早发现错误；读时模式（文档型）提供灵活性
多语言持久化——针对不同的访问模式使用不同的存储——通常是正确的答案
对象和关系之间的阻抗不匹配是真实存在的成本；文档模型为自包含的聚合减少了这种不匹配

代码应用：

上下文	模式	示例

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

上下文	模式	示例
高写入吞吐量	LSM 树引擎	Cassandra 或 RocksDB，用于以每秒 10 万次以上的写入速率进行时间序列数据摄取
混合读写 OLTP	B 树引擎	PostgreSQL B 树索引，用于具有点查找的事务性工作负载
大型数据集的分析查询	面向列的存储	ClickHouse 或 Parquet 文件，用于扫描数十亿行但列数较少的数据
低延迟缓存	内存存储	Redis 用于亚毫秒级查找；Memcached 用于简单的键值缓存

上下文	模式	示例
读密集型 Web 应用	单主复制配合读取副本	PostgreSQL 主库 + 通过 pgBouncer 后的读取副本，用于读取扩展
多区域写入	多主复制	CockroachDB 或 Spanner，用于具有有限过时性的地理分布式写入
购物车可用性	无主复制配合合并	DynamoDB 使用最后写入者胜出或应用级合并来处理购物车冲突
协同编辑	使用 CRDT 进行无冲突合并	Yjs 或 Automerge，用于实时协同文档编辑

上下文	模式	示例
时间序列数据	按时间 + 源进行键范围分区	按 `(sensor_id, date)` 分区，以避免当前日期出现写入热点
大规模用户数据	按用户 ID 进行哈希分区	Cassandra 在 `user_id` 上使用一致性哈希以实现均匀分布
全局搜索索引	全局二级索引	Elasticsearch 索引独立于主数据存储进行分片
名人/热点键问题	带随机后缀的键分裂	为热点分区键附加随机数字，将读取分散到 10 个子分区

5. 事务和一致性

核心概念： 事务提供安全性保证（ACID），通过让你假装在事务范围内不存在故障和并发，从而简化应用程序代码。

为何有效： 没有事务，每一段应用程序代码都必须处理部分故障、竞争条件和并发修改。事务将这种复杂性转移到数据库中，在那里可以一次性正确处理。

隔离级别是一个谱系：读未提交、读已提交、快照隔离（可重复读）、可序列化
大多数数据库默认使用读已提交或快照隔离——而不是可序列化——应用程序开发人员必须理解这允许的异常情况
写偏斜发生在两个事务读取相同数据，基于此做出决策，并写入不同记录时——行级锁无法防止这种情况
可序列化快照隔离（SSI）通过乐观并发提供完全的可序列化——无阻塞，但在冲突时中止
两阶段锁定提供可序列化，但在高并发下会导致争用和死锁
分布式事务（两阶段提交）昂贵且脆弱；尽可能通过围绕单分区操作进行设计来避免它们

上下文	模式	示例
账户余额转账	可序列化事务	`BEGIN; UPDATE accounts SET balance = balance - 100 WHERE id = 1; UPDATE accounts SET balance = balance + 100 WHERE id = 2; COMMIT;`
库存预留	使用 SELECT FOR UPDATE 防止写偏斜	在递减前执行 `SELECT stock FROM items WHERE id = X FOR UPDATE`
读密集型仪表板	用于一致读取的快照隔离	PostgreSQL MVCC 提供时间点快照，而不会阻塞写入者
跨服务操作	使用 Saga 模式替代分布式事务	补偿事务：扣款、预留库存，失败时退款

6. 批处理和流处理

核心概念： 批处理批量转换有界数据集；流处理连续转换无界事件流。两者都是派生数据计算的形式。

为何有效： 将记录系统（事实来源）与派生数据（缓存、索引、物化视图）分离，允许各自独立优化，并在需求变化时从源重建。

MapReduce 概念简单但操作不便；数据流引擎（Spark、Flink）通过任意 DAG 对其进行了泛化
事件溯源将每个状态变化存储为不可变事件，从而实现完整的审计跟踪和时间查询
变更数据捕获（CDC）将数据库写入转换为下游系统可以消费的流
流表二元性：流是表的变更日志；表是流的物化状态
流处理中的恰好一次语义需要幂等操作或事务性输出
时间窗口（滚动、跳跃、会话）对于聚合无界流至关重要

上下文	模式	示例
每日分析流水线	使用 Spark 进行批处理	从 S3 读取当日事件，聚合指标，写入数据仓库
实时欺诈检测	使用 Flink 进行流处理	从 Kafka 消费支付事件，在 5 秒滚动窗口内应用规则
同步搜索索引	变更数据捕获	Debezium 捕获 PostgreSQL WAL 变更，发布到 Kafka，Elasticsearch 消费者更新索引
审计跟踪 / 事件重放	事件溯源	存储 `OrderPlaced`、`OrderShipped`、`OrderRefunded` 事件；通过重放重建当前状态

7. 可靠性和容错性

核心概念： 故障是不可避免的；失效不是。可靠的系统即使在单个组件发生故障时也能继续正确运行。为故障而设计，而不是对抗故障。

为何有效： 硬件会故障，软件有缺陷，人会犯错。假设完美运行的系统是脆弱的。预期并优雅处理故障的系统是有韧性的。

故障是单个组件偏离规范；失效是整个系统停止。容错防止故障演变为失效
硬件故障是随机且独立的；软件故障是相关且系统性的（更危险）
人为错误是导致停机的主要原因——设计系统以最小化出错机会并最大化恢复能力
超时是分布式系统中的基本故障检测器——但选择合适的超时很难（太短导致误报，太长延迟恢复）
安全性属性（坏事不会发生）必须始终成立；活性属性（好事最终会发生）可能暂时被违反
拜占庭容错在区块链之外很少需要——大多数系统假设非拜占庭（崩溃-停止或崩溃-恢复）模型

上下文	模式	示例
服务通信	超时 + 带指数退避的重试	`retry(max=3, backoff=exponential(base=1s, max=30s))` 并添加抖动
领导者选举	共识算法（Raft/Paxos）	etcd 或 ZooKeeper 用于分布式锁和领导者选举
数据流水线可靠性	幂等操作 + 检查点	Kafka 消费者仅在成功处理后提交偏移量
优雅降级	断路器模式	Hystrix/Resilience4j：在 10 秒窗口内出现 50% 故障后打开断路器

错误	为何失败	修复方法
根据流行度选择数据库	不同的引擎有根本不同的权衡	将存储引擎特性与你实际的读写模式相匹配
忽略复制延迟	用户看到过时数据、幻读或丢失更新	实现读己之写一致性；使用单调读保证
到处使用分布式事务	两阶段提交缓慢且脆弱；协调者是单点故障	为单分区操作设计；使用 Saga 模式进行跨服务协调
对所有数据进行哈希分区	破坏了范围查询能力；某些工作负载需要排序访问	对时间序列使用键范围分区；对局部性使用复合键
假设可序列化隔离	大多数数据库默认使用较弱的隔离级别；写偏斜错误在生产中出现	检查数据库的实际默认隔离级别；在需要的地方使用显式锁定
混淆批处理和流处理	在流数据上使用批处理工具会增加延迟；在有界数据上使用流处理工具浪费复杂性	使处理模型与数据有界性和延迟要求相匹配
将所有故障视为可恢复的	某些故障（数据损坏、拜占庭）需要根本不同的处理方式	对故障进行分类，并为每类设计特定的恢复策略

问题	如果答案为否	行动
你能解释为什么选择这个数据库而不是其他替代品吗？	决策基于熟悉度，而非需求	评估数据模型匹配度、读写比例、一致性需求和扩展路径
你知道数据库的默认隔离级别吗？	你可能存在尚未发现的并发错误	查阅文档；测试写偏斜和幻读场景
你的复制策略是明确选择的（不是默认的）吗？	你对一致性和持久性有隐含假设	记录权衡：同步 vs 异步、故障转移行为、延迟容忍度
你的系统能处理热点分区键吗？	单个热门实体可能导致集群宕机	为热点键添加键分裂策略或应用级负载分流
你是否将记录系统与派生数据分离？	模式变更或新功能需要迁移所有内容	引入 CDC 或事件溯源，将源与派生存储解耦
你的超时和重试是调优过的，而不是默认的吗？	你会遇到级联故障或不必要的延迟	测量 p99 延迟；设置超时高于 p99 但低于级联阈值
你在生产条件下测试过故障转移吗？	你的恢复计划是理论上的，未经验证	运行混沌工程实验：杀死领导者、分区网络、填满磁盘

data-models.md：关系模型 vs 文档模型 vs 图模型，读时模式 vs 写时模式，查询语言，多语言持久化
storage-engines.md：LSM 树 vs B 树，写入放大，压缩，面向列的存储，内存数据库
replication.md：单主复制、多主复制、无主复制，复制延迟，冲突解决，CRDT
partitioning.md：键范围 vs 哈希分区，二级索引，再平衡，请求路由，热点
transactions.md：ACID，隔离级别，写偏斜，两阶段锁定，SSI，分布式事务
batch-stream.md：MapReduce，数据流引擎，事件溯源，CDC，流表二元性，恰好一次语义
fault-tolerance.md：故障 vs 失效，可靠性指标，超时，共识，安全性和活性保证

此技能基于 Martin Kleppmann 关于数据系统原理和实践的全面指南。如需包含详细图表和研究参考文献的完整论述：

Martin Kleppmann 是分布式系统领域的研究员，曾任 LinkedIn 和 Rapportive 的软件工程师。他是剑桥大学的高级副研究员，在 CRDT、拜占庭容错和本地优先软件方面有深入研究。Designing Data-Intensive Applications（2017）已成为工程师构建数据系统的权威参考书，因其通过清晰的解释和实际例子使复杂的分布式系统概念易于理解而广受赞誉。Kleppmann 的研究专注于数据一致性、去中心化协作以及确保分布式系统的正确性。他还以其会议演讲和教育性写作而闻名，这些工作弥合了学术研究与工业实践之间的鸿沟。

2026 年 2 月 23 日

🇺🇸English

Designing Data-Intensive Applications Framework

A principled approach to building reliable, scalable, and maintainable data systems. Apply these principles when choosing databases, designing schemas, architecting distributed systems, or reasoning about consistency and fault tolerance.

Core Principle

Data outlives code. Applications are rewritten, languages change, frameworks come and go -- but data and its structure persist for decades. Every architectural decision must prioritize the long-term correctness, durability, and evolvability of the data layer above all else.

The foundation: Most applications are data-intensive, not compute-intensive. The hard problems are the amount of data, its complexity, and the speed at which it changes. Understanding the trade-offs between consistency, availability, partition tolerance, latency, and throughput is what separates robust systems from fragile ones.

Scoring

Goal: 10/10. When reviewing or designing data architectures, rate them 0-10 based on adherence to the principles below. A 10/10 means deliberate trade-off choices for data models, storage engines, replication, partitioning, transactions, and processing pipelines; lower scores indicate accidental complexity or ignored failure modes. Always provide the current score and specific improvements needed to reach 10/10.

The DDIA Framework

Seven domains for reasoning about data-intensive systems:

1. Data Models and Query Languages

Core concept: The data model shapes how you think about the problem. Relational, document, and graph models each impose different constraints and enable different query patterns.

Why it works: Choosing the wrong data model forces application code to compensate for representational mismatch, adding accidental complexity that compounds over time.

Key insights:

Relational models excel at many-to-many relationships and ad-hoc queries
Document models excel at one-to-many relationships and data locality
Graph models excel at highly interconnected data with recursive traversals
Schema-on-write (relational) catches errors early; schema-on-read (document) offers flexibility
Polyglot persistence -- use different stores for different access patterns -- is often the right answer
Impedance mismatch between objects and relations is a real cost; document models reduce it for self-contained aggregates

Code applications:

Context	Pattern	Example
User profiles with nested data	Document model for self-contained aggregates	Store profile, addresses, and preferences in one MongoDB document
Social network connections	Graph model for relationship traversal	Neo4j Cypher query: `MATCH (a)-[:FOLLOWS*2]->(b)` for friend-of-friend
Financial ledger with joins	Relational model for referential integrity	PostgreSQL with foreign keys between accounts, transactions, and entries
Mixed access patterns	Polyglot persistence	PostgreSQL for transactions + Elasticsearch for full-text search + Redis for caching

See: references/data-models.md

2. Storage Engines

Core concept: Storage engines make a fundamental trade-off between read performance and write performance. Log-structured engines (LSM trees) optimize writes; page-oriented engines (B-trees) balance reads and writes.

Why it works: Understanding the internals of your database's storage engine lets you predict performance characteristics, choose appropriate indexes, and avoid pathological workloads.

Key insights:

LSM trees: append-only writes, periodic compaction, excellent write throughput, higher read amplification
B-trees: in-place updates, predictable read latency, write amplification from page splits
Write amplification means one logical write causes multiple physical writes -- critical for SSDs with limited write cycles
Column-oriented storage dramatically improves analytical query performance through compression and vectorized processing
In-memory databases are fast not because they avoid disk, but because they avoid encoding overhead

Code applications:

Context	Pattern	Example
High write throughput	LSM-tree engine	Cassandra or RocksDB for time-series ingestion at 100K+ writes/sec
Mixed read/write OLTP	B-tree engine	PostgreSQL B-tree indexes for transactional workloads with point lookups
Analytical queries on large datasets	Column-oriented storage	ClickHouse or Parquet files for scanning billions of rows with few columns
Low-latency caching	In-memory store	Redis for sub-millisecond lookups; Memcached for simple key-value caching

See: references/storage-engines.md

3. Replication

Core concept: Replication keeps copies of data on multiple machines for fault tolerance, scalability, and latency reduction. The core challenge is handling changes to replicated data consistently.

Why it works: Every replication strategy trades off between consistency, availability, and latency. Making this trade-off explicit prevents subtle data anomalies that surface only under load or failure.

Key insights:

Single-leader replication: simple, strong consistency possible, but the leader is a bottleneck and single point of failure
Multi-leader replication: better write availability across data centers, but conflict resolution is complex
Leaderless replication: highest availability, uses quorum reads/writes, but requires careful conflict handling
Replication lag causes read-your-writes violations, monotonic read violations, and causality violations
Synchronous replication guarantees durability but increases latency; asynchronous replication risks data loss on leader failure
CRDTs and last-writer-wins are conflict resolution strategies with very different correctness guarantees

Code applications:

Context	Pattern	Example
Read-heavy web app	Single-leader with read replicas	PostgreSQL primary + read replicas behind pgBouncer for read scaling
Multi-region writes	Multi-leader replication	CockroachDB or Spanner for geo-distributed writes with bounded staleness
Shopping cart availability	Leaderless with merge	DynamoDB with last-writer-wins or application-level merge for cart conflicts
Collaborative editing	CRDTs for conflict-free merging	Yjs or Automerge for real-time collaborative document editing

See: references/replication.md

4. Partitioning

Core concept: Partitioning (sharding) distributes data across multiple nodes so that each node handles a subset of the total data, enabling horizontal scaling beyond a single machine.

Why it works: Without partitioning, a single node becomes the bottleneck for storage capacity and throughput. Effective partitioning distributes load evenly and avoids hotspots.

Key insights:

Key-range partitioning supports efficient range scans but risks hotspots on sequential keys
Hash partitioning distributes load evenly but destroys sort order and makes range queries expensive
Secondary indexes can be partitioned locally (each partition has its own index) or globally (index partitioned separately)
Local secondary indexes require scatter-gather queries; global secondary indexes require cross-partition updates
Hotspots can occur even with hash partitioning if a single key is extremely popular (celebrity problem)
Rebalancing strategies: fixed number of partitions, dynamic splitting, or proportional to node count

Code applications:

Context	Pattern	Example
Time-series data	Key-range partitioning by time + source	Partition by `(sensor_id, date)` to avoid write hotspot on current day
User data at scale	Hash partitioning on user ID	Cassandra consistent hashing on `user_id` for even distribution
Global search index	Global secondary index	Elasticsearch index sharded independently from primary data store
Celebrity/hot-key problem	Key splitting with random suffix	Append random digit to hot partition key, fan-out reads across 10 sub-partitions

See: references/partitioning.md

5. Transactions and Consistency

Core concept: Transactions provide safety guarantees (ACID) that simplify application code by letting you pretend failures and concurrency don't exist -- within the transaction's scope.

Why it works: Without transactions, every piece of application code must handle partial failures, race conditions, and concurrent modifications. Transactions move this complexity into the database where it can be handled correctly once.

Key insights:

Isolation levels are a spectrum: read uncommitted, read committed, snapshot isolation (repeatable read), serializable
Most databases default to read committed or snapshot isolation -- not serializable -- and application developers must understand the anomalies this permits
Write skew occurs when two transactions read the same data, make decisions based on it, and write different records -- no row-level lock prevents this
Serializable snapshot isolation (SSI) provides full serializability with optimistic concurrency -- no blocking, but aborts on conflict
Two-phase locking provides serializability but causes contention and deadlocks under high concurrency
Distributed transactions (two-phase commit) are expensive and fragile; avoid them when possible by designing around single-partition operations

Code applications:

Context	Pattern	Example
Account balance transfer	Serializable transaction	`BEGIN; UPDATE accounts SET balance = balance - 100 WHERE id = 1; UPDATE accounts SET balance = balance + 100 WHERE id = 2; COMMIT;`
Inventory reservation	SELECT FOR UPDATE to prevent write skew	`SELECT stock FROM items WHERE id = X FOR UPDATE` before decrementing
Read-heavy dashboards	Snapshot isolation for consistent reads	PostgreSQL MVCC provides point-in-time snapshot without blocking writers
Cross-service operations	Saga pattern instead of distributed transactions	Compensating transactions: charge card, reserve inventory, on failure refund card

See: references/transactions.md

6. Batch and Stream Processing

Core concept: Batch processing transforms bounded datasets in bulk; stream processing transforms unbounded event streams continuously. Both are forms of derived data computation.

Why it works: Separating the system of record (source of truth) from derived data (caches, indexes, materialized views) allows each to be optimized independently and rebuilt from the source when requirements change.

Key insights:

MapReduce is conceptually simple but operationally awkward; dataflow engines (Spark, Flink) generalize it with arbitrary DAGs
Event sourcing stores every state change as an immutable event, enabling full audit trails and temporal queries
Change data capture (CDC) turns database writes into a stream that downstream systems can consume
Stream-table duality: a stream is the changelog of a table; a table is the materialized state of a stream
Exactly-once semantics in stream processing require idempotent operations or transactional output
Time windowing (tumbling, hopping, session) is essential for aggregating unbounded streams

Code applications:

Context	Pattern	Example
Daily analytics pipeline	Batch processing with Spark	Read day's events from S3, aggregate metrics, write to data warehouse
Real-time fraud detection	Stream processing with Flink	Consume payment events from Kafka, apply rules within 5-second tumbling windows
Syncing search index	Change data capture	Debezium captures PostgreSQL WAL changes, publishes to Kafka, Elasticsearch consumer updates index
Audit trail / event replay	Event sourcing	Store `OrderPlaced`, `OrderShipped`, `OrderRefunded` events; rebuild current state by replaying

See: references/batch-stream.md

7. Reliability and Fault Tolerance

Core concept: Faults are inevitable; failures are not. A reliable system continues operating correctly even when individual components fail. Design for faults, not against them.

Why it works: Hardware fails, software has bugs, humans make mistakes. Systems that assume perfect operation are brittle. Systems that expect and handle faults gracefully are resilient.

Key insights:

A fault is one component deviating from spec; a failure is the system as a whole stopping. Fault tolerance prevents faults from becoming failures
Hardware faults are random and independent; software faults are correlated and systematic (more dangerous)
Human error is the leading cause of outages -- design systems that minimize opportunity for mistakes and maximize ability to recover
Timeouts are the fundamental fault detector in distributed systems -- but choosing the right timeout is hard (too short causes false positives, too long delays recovery)
Safety properties (nothing bad happens) must always hold; liveness properties (something good eventually happens) may be temporarily violated
Byzantine fault tolerance is rarely needed outside blockchain -- most systems assume non-Byzantine (crash-stop or crash-recovery) models

Code applications:

Context	Pattern	Example
Service communication	Timeouts + retries with exponential backoff	`retry(max=3, backoff=exponential(base=1s, max=30s))` with jitter
Leader election	Consensus algorithm (Raft/Paxos)	etcd or ZooKeeper for distributed lock and leader election
Data pipeline reliability	Idempotent operations + checkpointing	Kafka consumer commits offset only after successful processing
Graceful degradation	Circuit breaker pattern	Hystrix/Resilience4j: open circuit after 50% failures in 10-second window

See: references/fault-tolerance.md

Common Mistakes

Mistake	Why It Fails	Fix
Choosing a database based on popularity	Different engines have fundamentally different trade-offs	Match storage engine characteristics to your actual read/write patterns
Ignoring replication lag	Users see stale data, phantom reads, or lost updates	Implement read-your-writes consistency; use monotonic read guarantees
Using distributed transactions everywhere	Two-phase commit is slow and fragile; coordinator is a single point of failure	Design for single-partition operations; use sagas for cross-service coordination
Hash partitioning everything	Destroys range query ability; some workloads need sorted access	Use key-range partitioning for time-series; composite keys for locality
Assuming serializable isolation	Most databases default to weaker isolation; write skew bugs appear in production	Check your database's actual default isolation level; use explicit locking where needed
Conflating batch and stream	Batch tools on streaming data add latency; stream tools on bounded data waste complexity

Quick Diagnostic

Question	If No	Action
Can you explain why you chose this database over alternatives?	Decision was based on familiarity, not requirements	Evaluate data model fit, read/write ratio, consistency needs, and scaling path
Do you know your database's default isolation level?	You may have concurrency bugs you haven't found yet	Check documentation; test for write skew and phantom read scenarios
Is your replication strategy explicitly chosen (not defaulted)?	You have implicit assumptions about consistency and durability	Document trade-offs: sync vs async, failover behavior, lag tolerance
Can your system handle a hot partition key?	A single popular entity can bring down the cluster	Add key-splitting strategy or application-level load shedding for hot keys
Do you separate your system of record from derived data?	Schema changes or new features require migrating everything	Introduce CDC or event sourcing to decouple source from derived stores
Are your timeouts and retries tuned, not defaulted?	You get cascading failures or unnecessary delays	Measure p99 latency; set timeouts above p99 but below cascade threshold
Have you tested failover in production conditions?

Reference Files

data-models.md: Relational vs document vs graph models, schema-on-read vs schema-on-write, query languages, polyglot persistence
storage-engines.md: LSM trees vs B-trees, write amplification, compaction, column-oriented storage, in-memory databases
replication.md: Single-leader, multi-leader, leaderless replication, replication lag, conflict resolution, CRDTs
partitioning.md: Key-range vs hash partitioning, secondary indexes, rebalancing, request routing, hotspots
transactions.md: ACID, isolation levels, write skew, two-phase locking, SSI, distributed transactions
batch-stream.md: MapReduce, dataflow engines, event sourcing, CDC, stream-table duality, exactly-once semantics
fault-tolerance.md: Faults vs failures, reliability metrics, timeouts, consensus, safety and liveness guarantees

About the Author

Martin Kleppmann is a researcher in distributed systems and a former software engineer at LinkedIn and Rapportive. He is a Senior Research Associate at the University of Cambridge and has worked extensively on CRDTs, Byzantine fault tolerance, and local-first software. Designing Data-Intensive Applications (2017) has become the definitive reference for engineers building data systems, praised for making complex distributed systems concepts accessible through clear explanations and practical examples. Kleppmann's research focuses on data consistency, decentralized collaboration, and ensuring correctness in distributed systems. He is also known for his conference talks and educational writing that bridge the gap between academic research and industrial practice.

Weekly Installs

201

Repository

wondelai/skills

GitHub Stars

255

First Seen

Feb 23, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

codex192

gemini-cli191

kimi-cli191

cursor191

opencode191

github-copilot191

Flutter状态管理教程：MVVM与Provider实现单向数据流和架构模式

1,100 周安装

数据密集型应用设计框架：DDIA系统核心原则与架构指南

🇨🇳中文介绍

设计数据密集型应用框架

核心原则

评分

DDIA 框架

1. 数据模型和查询语言

相关 Skills

2. 存储引擎

3. 复制

4. 分区

5. 事务和一致性

6. 批处理和流处理

7. 可靠性和容错性

常见错误

快速诊断

参考文件

延伸阅读

关于作者

🇺🇸English

Designing Data-Intensive Applications Framework

Core Principle

Scoring

The DDIA Framework

1. Data Models and Query Languages

2. Storage Engines

3. Replication

4. Partitioning

5. Transactions and Consistency

6. Batch and Stream Processing

7. Reliability and Fault Tolerance

Common Mistakes

Quick Diagnostic

Reference Files

Further Reading

About the Author

最新 Skills