Book Review: Designing Data-Intensive Applications

Notes on DDIA – Kleppmann

543 pages

3 parts, 12 chapters

P1 – Foundations of Data Systems

1 – RSM

database, cache, search index, stream processing, batch processing

Reliability

fault tolerant / resilient

Scalability

load parameters
response time – p95 p99 p999
tail latencies
SLO (service level objective) SLA (service level agreement)
scaling up (vertical – more powerful machine 🤖)
scaling out (horizontal)
elastic systems

Maintainability

majority of the cost of software is maintenance – 🐞 bugs, looking into failures, adapting it, paying tech debt, adding new features

Operability

Simplicity – big ball of mud

symptoms – explosion of state space, tight coupling of modules, tangled dependencies, inconsistent naming, hacks etc.
accidental complexity (Mosely and Marks, 2006) link
abstraction 👍

Evolvability / Plasticity

High-Level Takeaway – RSM 😎

2 – Data Models, Query Languages

SQL (based on Codd 1970)

data stored in relations (tables), where each relation is an unordered collection of tuples (rows)
query optimizer 🌈

NoSQL (2010s) 🌲

impedance mismatch (between application code + storage layer)
CODASYL model
better for document-like structure (tree of one-to-many relationships)
declarative vs imperative languages

MapReduce

map (aka collect) 🗾
reduce (aka fold/inject) 💉

Graph-like Data Models 〽️

graph consists of vertices/nodes and edges/relationships

Triple Store

SPO (subject predicate/verb object) (jim, likes, bananas)
RDF
SPARQL
Datalog
unusual data models – GenBank (genome), particle physics (LHC), information retrieval/full text search

High-Level Takeaway 😎

3 types – relational, document (NoSQL), graph (NoSQL)

3 – Internals of Storage Systems

optimize for transactional workloads, analytics
log-structured and page-oriented storage engines (B-trees 🌲)
compaction ♻️
SSTable (sorted string table)
red-black trees, AVL trees
memtable
Google’s BigTable paper (2006) link
Log-Structured Merge-Tree (LSM-Tree)
Bloom filter 🌹

B-tree (most widely used indexing structure) 🌲

depth of O(log n)
write-ahead log (WAL)
fuzzy search
Memcached (in-memory KV stores)
OLAP (online analytic processing)
The process of getting data into a data warehouse is called Extract-Transform-Load (ETL)
Google Dremel (2010) link
bitmap encoding

High-Level Takeaway 😎

OLTP system – user facing
OLAP system – BA facing
log-structured school (append only) e.g. Cassandra, LSM tree
update-in place – B-trees

4 – Serialization

encoding/serialization/marshalling
decoding/parsing/deserialization/unmarshalling
JSON, XML, CSV
Apache Thrift (Facebook) / Protocol Buffers (Google) (protobuf) – binary encoding libraries
protobuf does not have list/array datatype but has repeated marker
Apache Avro
data flows 🌊
data outlives code
database snapshots 📸
SOA – service-oriented architecture aka microservices
2 popular approaches to web services: REST 😌 and SOAP 🧼
OpenAPI Specification (OAS)
RPC (remote procedure call) – tries to make a request look like calling a function/method
message broker aka message queue
actor model 👨‍🎤

High-Level Takeaway

rolling upgrades – evolvability
textual formats (e.g. JSON, XML, CSV) are widespread
binary schema-driven formats like Thrift, Protocol Buffers, Avro allow compact encoding but the downside is the data needs to be decoded for humans
data-flows include databases, RPC/REST APIs, and async message passing using message brokers/actors

P2 – Distributed Data

why distribute data – scalability, fault tolerance/high availability, latency
shared-memory architecture 🗼
shared-disk architecture
shared-nothing architecture (horizontal scaling)
- each machine is a node
2 common ways:
- replication (provides redundancy)
- partitioning/sharding

5 – Replication

difficulty is in handling changes to data
3 popular algorithms:
- single-leader
- multi-leader
- leaderless
each node that stores a copy of the database is called a replica
leader-based replication aka active/passive aka master/slave
leader/master/primary → (replication log/change stream) → followers/read replicas/slaves/secondaries/hot standbys
synchronous vs asynchronous
chain replication (Azure storage)
log sequence number (Postgres), binlog coordinates (MySQL)
handling node outages:
- follower failure – catch-up recovery 🏃
- leader failure – failover 🛑
  - GitHub 2012-09-14, out-of-date MySQL follower promoted to leader, link
replication logs implementation:
- write-ahead log (WAL) shipping
- logical (row-based) log replication
- trigger-based replication
  - triggers, stored procedures
eventual consistency
replication lag 🎮
read-after-write consistency aka read-your-writes consistency 📕
monotonic reads – won’t see time ⌚ go backwards
consistent prefix reads

Multi-Leader Replication

aka master-master / active-active replication
conflict resolution
CRDT – conflict-free replicated datatypes Riak 2.0
mergeable persistent data structures
operational transformation – Etherpad / Google Docs
replication topology – circular 🟡, star/tree ⭐🌲, all-to-all
version vectors

Leaderless Replication

Amazon Dynamo 🧨 link
ACID – (atomicity, consistency, isolation, durability)
read request sent to several nodes in parallel
quorum condition:
- w + r > n (write nodes, read nodes, n replicas 🖥️)
leaderless replication ↔ use case requires HA and low latency (tolerate occasional stale reads)
sloppy quorum
last write wins (LWW) – discard concurrent writes
- achieves eventual convergence at the cost of durability

High-Level Takeaway

high availability (HA)
disconnected operation
latency
scalability
3 main approaches:
- single-leader replication (pros – no conflict resolution)
- multi-leader replication (cons – weak consistency guarantee)
- leaderless
replication can be synchronous or asynchronous
strange effects of replication lag:
- read-after-write consistency
- monotonic reads
- consistent prefix reads

6 – Partitioning

shard (MongoDB), region (HBase), tablet (Bigtable), vnode (Cassandra), vBucket (Couchbase)
unfair partitioning – skewed
partition with high load – hot spot ☀️
hash function to get partition for a key 🔑
consistent hashing aka hash partitioning (Karger et al) link
rebalancing (moving load from one node in cluster to another)
request routing/service discovery
Apache ZooKeeper (keep track of cluster metadata)

High-Level Takeaway

key range partitioning
hash partitioning
document-partitioned indexes (local indexes)
term-partitioned indexes (global indexes)

7 – Transactions

ACID criteria
BASE (basically available, soft state, eventual consistency)
Atomicity – abortability
Consistency – invariants
Isolation – serializability
Durability
single vs multi-object transactions
transaction isolation
- read committed – no dirty reads, no dirty writes
databases prevent dirty writes by using row level locks
nonrepeatable read/read skew – snapshot isolation 📸
atomic write operation ⚛️
write skew

Serializability

isolation levels are hard to understand
solution is serializable isolation
- literal transaction execution in order (VoltDB, Redis, Datomic)
- two-phase locking
  - predicate locks
  - index-range locks
- serializable snapshot isolation (SSI)

High-Level Takeaway

transaction abort
for complex access patterns, transactions can reduce errors
concurrency control:
- isolation levels include read committed, snapshot isolation aka repeatable read, serializable
examples of race conditions 🏎️:
- dirty reads
- dirty writes
- read skew (nonrepeatable reads)
- lost updates
- write skew
- phantom reads
implementing serializable transactions:
- literal transaction execution in order
- 2PL
- SSI

8 – Trouble with Distributed Systems

partial failures – nondeterministic
TCP flow control aka congestion avoidance aka backpressure
time-of-day clock (NTP) and monotonic clock ⏰
Byzantine faults
Byzantine fault-tolerant – even if some nodes are malicious

High-Level Takeaway

first step is to detect fault

9 – Consistency and Consensus

eventual consistency / convergence
linearizability aka atomic consistency
- recency guarantee on reads/writes of a register
CAP theorem 🧢
- consistency, availability, partition tolerance – pick 2 out of 3
Lamport timestamp (1978)
total order broadcast
consensus
- leader election
- atomic commit
FLP result
2PC (two-phase commit)

High-Level Takeaway

linearizability
causality
consensus

P3 – Derived Data

10 – Batch Processing

Unix philosophy
MapReduce

High-Level Takeaway

awk, grep, sort
2 main problems:
- partitioning
- fault tolerance
join algorithms:
- sort-merge joins
- broadcast hash joins
- partitioned hash joins

11 – Stream Processing

(unbounded inputs)
pub/sub model
message broker/message queue
types of windows 🪟:
- tumbling window
- hopping window
- sliding window
- session window

High-Level Takeaway

AMQP/JMS-style message broker
log-based message broker
3 joins:
- stream-stream joins
- stream-table joins
- table-table joins

12 – Future of Data Systems

lambda architecture 🇬🇷
dataflow

High-Level Takeaway

loosely coupled components
ethical aspects of building DIA

Conclusion

good book to read; highly recommend ★★★★☆

Additional Sources

A Primer on Database Replication

drew's blog