Notes on DDIA – Kleppmann
543 pages
3 parts, 12 chapters
P1 – Foundations of Data Systems
1 – RSM
- database, cache, search index, stream processing, batch processing
Reliability
- fault tolerant / resilient
Scalability
- load parameters
- response time – p95 p99 p999
- tail latencies
- SLO (service level objective) SLA (service level agreement)
- scaling up (vertical – more powerful machine ๐ค)
- scaling out (horizontal)
- elastic systems
Maintainability
- majority of the cost of software is maintenance – ๐ bugs, looking into failures, adapting it, paying tech debt, adding new features
Operability
Simplicity – big ball of mud
- symptoms – explosion of state space, tight coupling of modules, tangled dependencies, inconsistent naming, hacks etc.
- accidental complexity (Mosely and Marks, 2006) link
- abstraction ๐
Evolvability / Plasticity
High-Level Takeaway – RSM ๐
2 – Data Models, Query Languages
SQL (based on Codd 1970)
- data stored in relations (tables), where each relation is an unordered collection of tuples (rows)
- query optimizer ๐
NoSQL (2010s) ๐ฒ
- impedance mismatch (between application code + storage layer)
- CODASYL model
- better for document-like structure (tree of one-to-many relationships)
- declarative vs imperative languages
MapReduce
- map (aka collect) ๐พ
- reduce (aka fold/inject) ๐
Graph-like Data Models ใฝ๏ธ
- graph consists of vertices/nodes and edges/relationships
Triple Store
- SPO (subject predicate/verb object) (jim, likes, bananas)
- RDF
- SPARQL
- Datalog
- unusual data models – GenBank (genome), particle physics (LHC), information retrieval/full text search
High-Level Takeaway ๐
- 3 types – relational, document (NoSQL), graph (NoSQL)
3 – Internals of Storage Systems
- optimize for transactional workloads, analytics
- log-structured and page-oriented storage engines (B-trees ๐ฒ)
- compaction โป๏ธ
- SSTable (sorted string table)
- red-black trees, AVL trees
- memtable
- Googleโs BigTable paper (2006) link
- Log-Structured Merge-Tree (LSM-Tree)
- Bloom filter ๐น
B-tree (most widely used indexing structure) ๐ฒ
- depth of O(log n)
- write-ahead log (WAL)
- fuzzy search
- Memcached (in-memory KV stores)
- OLAP (online analytic processing)
- The process of getting data into a data warehouse is called Extract-Transform-Load (ETL)
- Google Dremel (2010) link
- bitmap encoding
High-Level Takeaway ๐
- OLTP system – user facing
- OLAP system – BA facing
- log-structured school (append only) e.g. Cassandra, LSM tree
- update-in place – B-trees
4 – Serialization
- encoding/serialization/marshalling
- decoding/parsing/deserialization/unmarshalling
- JSON, XML, CSV
- Apache Thrift (Facebook) / Protocol Buffers (Google) (protobuf) – binary encoding libraries
- protobuf does not have list/array datatype but has
repeated
marker - Apache Avro
- data flows ๐
- data outlives code
- database snapshots ๐ธ
- SOA – service-oriented architecture aka microservices
- 2 popular approaches to web services: REST ๐ and SOAP ๐งผ
- OpenAPI Specification (OAS)
- RPC (remote procedure call) – tries to make a request look like calling a function/method
- message broker aka message queue
- actor model ๐จโ๐ค
High-Level Takeaway
- rolling upgrades – evolvability
- textual formats (e.g. JSON, XML, CSV) are widespread
- binary schema-driven formats like Thrift, Protocol Buffers, Avro allow compact encoding but the downside is the data needs to be decoded for humans
- data-flows include databases, RPC/REST APIs, and async message passing using message brokers/actors
P2 – Distributed Data
- why distribute data – scalability, fault tolerance/high availability, latency
- shared-memory architecture ๐ผ
- shared-disk architecture
- shared-nothing architecture (horizontal scaling)
- each machine is a node
- 2 common ways:
- replication (provides redundancy)
- partitioning/sharding
5 – Replication
- difficulty is in handling changes to data
- 3 popular algorithms:
- single-leader
- multi-leader
- leaderless
- each node that stores a copy of the database is called a
replica
- leader-based replication aka active/passive aka master/slave
- leader/master/primary โ (replication log/change stream) โ followers/read replicas/slaves/secondaries/hot standbys
- synchronous vs asynchronous
- chain replication (Azure storage)
- log sequence number (Postgres), binlog coordinates (MySQL)
- handling node outages:
- follower failure – catch-up recovery ๐
- leader failure – failover ๐
- GitHub 2012-09-14, out-of-date MySQL follower promoted to leader, link
- replication logs implementation:
- write-ahead log (WAL) shipping
- logical (row-based) log replication
- trigger-based replication
- triggers, stored procedures
- eventual consistency
- replication lag ๐ฎ
- read-after-write consistency aka read-your-writes consistency ๐
- monotonic reads – wonโt see time โ go backwards
- consistent prefix reads
Multi-Leader Replication
- aka master-master / active-active replication
- conflict resolution
- CRDT – conflict-free replicated datatypes Riak 2.0
- mergeable persistent data structures
- operational transformation – Etherpad / Google Docs
- replication topology – circular ๐ก, star/tree โญ๐ฒ, all-to-all
- version vectors
Leaderless Replication
- Amazon Dynamo ๐งจ link
- ACID – (atomicity, consistency, isolation, durability)
- read request sent to several nodes in parallel
- quorum condition:
w + r > n
(write nodes, read nodes, n replicas ๐ฅ๏ธ)
- leaderless replication โ use case requires HA and low latency (tolerate occasional stale reads)
- sloppy quorum
- last write wins (LWW) – discard concurrent writes
- achieves eventual convergence at the cost of durability
High-Level Takeaway
- high availability (HA)
- disconnected operation
- latency
- scalability
- 3 main approaches:
- single-leader replication (pros – no conflict resolution)
- multi-leader replication (cons – weak consistency guarantee)
- leaderless
- replication can be synchronous or asynchronous
- strange effects of replication lag:
- read-after-write consistency
- monotonic reads
- consistent prefix reads
6 – Partitioning
- shard (MongoDB), region (HBase), tablet (Bigtable), vnode (Cassandra), vBucket (Couchbase)
- unfair partitioning – skewed
- partition with high load – hot spot โ๏ธ
- hash function to get partition for a key ๐
- consistent hashing aka hash partitioning (Karger et al) link
- rebalancing (moving load from one node in cluster to another)
- request routing/service discovery
- Apache ZooKeeper (keep track of cluster metadata)
High-Level Takeaway
- key range partitioning
- hash partitioning
- document-partitioned indexes (local indexes)
- term-partitioned indexes (global indexes)
7 – Transactions
- ACID criteria
- BASE (basically available, soft state, eventual consistency)
- Atomicity – abortability
- Consistency – invariants
- Isolation – serializability
- Durability
- single vs multi-object transactions
- transaction isolation
- read committed – no dirty reads, no dirty writes
- databases prevent dirty writes by using row level locks
- nonrepeatable read/read skew – snapshot isolation ๐ธ
- atomic write operation โ๏ธ
- write skew
Serializability
- isolation levels are hard to understand
- solution is serializable isolation
- literal transaction execution in order (VoltDB, Redis, Datomic)
- two-phase locking
- predicate locks
- index-range locks
- serializable snapshot isolation (SSI)
High-Level Takeaway
- transaction abort
- for complex access patterns, transactions can reduce errors
- concurrency control:
- isolation levels include read committed, snapshot isolation aka repeatable read, serializable
- examples of race conditions ๐๏ธ:
- dirty reads
- dirty writes
- read skew (nonrepeatable reads)
- lost updates
- write skew
- phantom reads
- implementing serializable transactions:
- literal transaction execution in order
- 2PL
- SSI
8 – Trouble with Distributed Systems
- partial failures – nondeterministic
- TCP flow control aka congestion avoidance aka backpressure
- time-of-day clock (NTP) and monotonic clock โฐ
- Byzantine faults
- Byzantine fault-tolerant – even if some nodes are malicious
High-Level Takeaway
- first step is to detect fault
9 – Consistency and Consensus
- eventual consistency / convergence
- linearizability aka atomic consistency
- recency guarantee on reads/writes of a register
- CAP theorem ๐งข
- consistency, availability, partition tolerance – pick 2 out of 3
- Lamport timestamp (1978)
- total order broadcast
- consensus
- leader election
- atomic commit
- FLP result
- 2PC (two-phase commit)
High-Level Takeaway
- linearizability
- causality
- consensus
P3 – Derived Data
10 – Batch Processing
- Unix philosophy
- MapReduce
High-Level Takeaway
awk, grep, sort
- 2 main problems:
- partitioning
- fault tolerance
- join algorithms:
- sort-merge joins
- broadcast hash joins
- partitioned hash joins
11 – Stream Processing
- (unbounded inputs)
- pub/sub model
- message broker/message queue
- types of windows ๐ช:
- tumbling window
- hopping window
- sliding window
- session window
High-Level Takeaway
- AMQP/JMS-style message broker
- log-based message broker
- 3 joins:
- stream-stream joins
- stream-table joins
- table-table joins
12 – Future of Data Systems
- lambda architecture ๐ฌ๐ท
- dataflow
High-Level Takeaway
- loosely coupled components
- ethical aspects of building DIA
Conclusion
- good book to read; highly recommend โ โ โ โ โ
Leave a Reply