Book Review: Designing Data-Intensive Applications

ddia

Notes on DDIA – Kleppmann

543 pages

3 parts, 12 chapters

P1 – Foundations of Data Systems

1 – RSM

  • database, cache, search index, stream processing, batch processing

Reliability

  • fault tolerant / resilient

Scalability

  • load parameters
  • response time – p95 p99 p999
  • tail latencies
  • SLO (service level objective) SLA (service level agreement)
  • scaling up (vertical – more powerful machine ๐Ÿค–)
  • scaling out (horizontal)
  • elastic systems

Maintainability

  • majority of the cost of software is maintenance – ๐Ÿž bugs, looking into failures, adapting it, paying tech debt, adding new features

Operability

Simplicity – big ball of mud

  • symptoms – explosion of state space, tight coupling of modules, tangled dependencies, inconsistent naming, hacks etc.
  • accidental complexity (Mosely and Marks, 2006) link
  • abstraction ๐Ÿ‘

Evolvability / Plasticity

High-Level Takeaway – RSM ๐Ÿ˜Ž

2 – Data Models, Query Languages

SQL (based on Codd 1970)

  • data stored in relations (tables), where each relation is an unordered collection of tuples (rows)
  • query optimizer ๐ŸŒˆ

NoSQL (2010s) ๐ŸŒฒ

  • impedance mismatch (between application code + storage layer)
  • CODASYL model
  • better for document-like structure (tree of one-to-many relationships)
  • declarative vs imperative languages

MapReduce

  • map (aka collect) ๐Ÿ—พ
  • reduce (aka fold/inject) ๐Ÿ’‰

Graph-like Data Models ใ€ฝ๏ธ

  • graph consists of vertices/nodes and edges/relationships

Triple Store

  • SPO (subject predicate/verb object) (jim, likes, bananas)
  • RDF
  • SPARQL
  • Datalog
  • unusual data models – GenBank (genome), particle physics (LHC), information retrieval/full text search

High-Level Takeaway ๐Ÿ˜Ž

  • 3 types – relational, document (NoSQL), graph (NoSQL)

3 – Internals of Storage Systems

  • optimize for transactional workloads, analytics
  • log-structured and page-oriented storage engines (B-trees ๐ŸŒฒ)
  • compaction โ™ป๏ธ
  • SSTable (sorted string table)
  • red-black trees, AVL trees
  • memtable
  • Googleโ€™s BigTable paper (2006) link
  • Log-Structured Merge-Tree (LSM-Tree)
  • Bloom filter ๐ŸŒน

B-tree (most widely used indexing structure) ๐ŸŒฒ

  • depth of O(log n)
  • write-ahead log (WAL)
  • fuzzy search
  • Memcached (in-memory KV stores)
  • OLAP (online analytic processing)
  • The process of getting data into a data warehouse is called Extract-Transform-Load (ETL)
  • Google Dremel (2010) link
  • bitmap encoding

High-Level Takeaway ๐Ÿ˜Ž

  • OLTP system – user facing
  • OLAP system – BA facing
  • log-structured school (append only) e.g. Cassandra, LSM tree
  • update-in place – B-trees

4 – Serialization

  • encoding/serialization/marshalling
  • decoding/parsing/deserialization/unmarshalling
  • JSON, XML, CSV
  • Apache Thrift (Facebook) / Protocol Buffers (Google) (protobuf) – binary encoding libraries
  • protobuf does not have list/array datatype but has repeated marker
  • Apache Avro
  • data flows ๐ŸŒŠ
  • data outlives code
  • database snapshots ๐Ÿ“ธ
  • SOA – service-oriented architecture aka microservices
  • 2 popular approaches to web services: REST ๐Ÿ˜Œ and SOAP ๐Ÿงผ
  • OpenAPI Specification (OAS)
  • RPC (remote procedure call) – tries to make a request look like calling a function/method
  • message broker aka message queue
  • actor model ๐Ÿ‘จโ€๐ŸŽค

High-Level Takeaway

  • rolling upgrades – evolvability
  • textual formats (e.g. JSON, XML, CSV) are widespread
  • binary schema-driven formats like Thrift, Protocol Buffers, Avro allow compact encoding but the downside is the data needs to be decoded for humans
  • data-flows include databases, RPC/REST APIs, and async message passing using message brokers/actors

P2 – Distributed Data

  • why distribute data – scalability, fault tolerance/high availability, latency
  • shared-memory architecture ๐Ÿ—ผ
  • shared-disk architecture
  • shared-nothing architecture (horizontal scaling)
    • each machine is a node
  • 2 common ways:
    • replication (provides redundancy)
    • partitioning/sharding

5 – Replication

  • difficulty is in handling changes to data
  • 3 popular algorithms:
    • single-leader
    • multi-leader
    • leaderless
  • each node that stores a copy of the database is called a replica
  • leader-based replication aka active/passive aka master/slave
  • leader/master/primary โ†’ (replication log/change stream) โ†’ followers/read replicas/slaves/secondaries/hot standbys
  • synchronous vs asynchronous
  • chain replication (Azure storage)
  • log sequence number (Postgres), binlog coordinates (MySQL)
  • handling node outages:
    • follower failure – catch-up recovery ๐Ÿƒ
    • leader failure – failover ๐Ÿ›‘
      • GitHub 2012-09-14, out-of-date MySQL follower promoted to leader, link
  • replication logs implementation:
    • write-ahead log (WAL) shipping
    • logical (row-based) log replication
    • trigger-based replication
      • triggers, stored procedures
  • eventual consistency
  • replication lag ๐ŸŽฎ
  • read-after-write consistency aka read-your-writes consistency ๐Ÿ“•
  • monotonic reads – wonโ€™t see time โŒš go backwards
  • consistent prefix reads

Multi-Leader Replication

  • aka master-master / active-active replication
  • conflict resolution
  • CRDT – conflict-free replicated datatypes Riak 2.0
  • mergeable persistent data structures
  • operational transformation – Etherpad / Google Docs
  • replication topology – circular ๐ŸŸก, star/tree โญ๐ŸŒฒ, all-to-all
  • version vectors

Leaderless Replication

  • Amazon Dynamo ๐Ÿงจ link
  • ACID – (atomicity, consistency, isolation, durability)
  • read request sent to several nodes in parallel
  • quorum condition:
    • w + r > n (write nodes, read nodes, n replicas ๐Ÿ–ฅ๏ธ)
  • leaderless replication โ†” use case requires HA and low latency (tolerate occasional stale reads)
  • sloppy quorum
  • last write wins (LWW) – discard concurrent writes
    • achieves eventual convergence at the cost of durability

High-Level Takeaway

  • high availability (HA)
  • disconnected operation
  • latency
  • scalability
  • 3 main approaches:
    • single-leader replication (pros – no conflict resolution)
    • multi-leader replication (cons – weak consistency guarantee)
    • leaderless
  • replication can be synchronous or asynchronous
  • strange effects of replication lag:
    • read-after-write consistency
    • monotonic reads
    • consistent prefix reads

6 – Partitioning

  • shard (MongoDB), region (HBase), tablet (Bigtable), vnode (Cassandra), vBucket (Couchbase)
  • unfair partitioning – skewed
  • partition with high load – hot spot โ˜€๏ธ
  • hash function to get partition for a key ๐Ÿ”‘
  • consistent hashing aka hash partitioning (Karger et al) link
  • rebalancing (moving load from one node in cluster to another)
  • request routing/service discovery
  • Apache ZooKeeper (keep track of cluster metadata)

High-Level Takeaway

  • key range partitioning
  • hash partitioning
  • document-partitioned indexes (local indexes)
  • term-partitioned indexes (global indexes)

7 – Transactions

  • ACID criteria
  • BASE (basically available, soft state, eventual consistency)
  • Atomicity – abortability
  • Consistency – invariants
  • Isolation – serializability
  • Durability
  • single vs multi-object transactions
  • transaction isolation
    • read committed – no dirty reads, no dirty writes
  • databases prevent dirty writes by using row level locks
  • nonrepeatable read/read skew – snapshot isolation ๐Ÿ“ธ
  • atomic write operation โš›๏ธ
  • write skew

Serializability

  • isolation levels are hard to understand
  • solution is serializable isolation
    • literal transaction execution in order (VoltDB, Redis, Datomic)
    • two-phase locking
      • predicate locks
      • index-range locks
    • serializable snapshot isolation (SSI)

High-Level Takeaway

  • transaction abort
  • for complex access patterns, transactions can reduce errors
  • concurrency control:
    • isolation levels include read committed, snapshot isolation aka repeatable read, serializable
  • examples of race conditions ๐ŸŽ๏ธ:
    • dirty reads
    • dirty writes
    • read skew (nonrepeatable reads)
    • lost updates
    • write skew
    • phantom reads
  • implementing serializable transactions:
    • literal transaction execution in order
    • 2PL
    • SSI

8 – Trouble with Distributed Systems

  • partial failures – nondeterministic
  • TCP flow control aka congestion avoidance aka backpressure
  • time-of-day clock (NTP) and monotonic clock โฐ
  • Byzantine faults
  • Byzantine fault-tolerant – even if some nodes are malicious

High-Level Takeaway

  • first step is to detect fault

9 – Consistency and Consensus

  • eventual consistency / convergence
  • linearizability aka atomic consistency
    • recency guarantee on reads/writes of a register
  • CAP theorem ๐Ÿงข
    • consistency, availability, partition tolerance – pick 2 out of 3
  • Lamport timestamp (1978)
  • total order broadcast
  • consensus
    • leader election
    • atomic commit
  • FLP result
  • 2PC (two-phase commit)

High-Level Takeaway

  • linearizability
  • causality
  • consensus

P3 – Derived Data

10 – Batch Processing

  • Unix philosophy
  • MapReduce

High-Level Takeaway

  • awk, grep, sort
  • 2 main problems:
    • partitioning
    • fault tolerance
  • join algorithms:
    • sort-merge joins
    • broadcast hash joins
    • partitioned hash joins

11 – Stream Processing

  • (unbounded inputs)
  • pub/sub model
  • message broker/message queue
  • types of windows ๐ŸชŸ:
    • tumbling window
    • hopping window
    • sliding window
    • session window

High-Level Takeaway

  • AMQP/JMS-style message broker
  • log-based message broker
  • 3 joins:
    • stream-stream joins
    • stream-table joins
    • table-table joins

12 – Future of Data Systems

  • lambda architecture ๐Ÿ‡ฌ๐Ÿ‡ท
  • dataflow

High-Level Takeaway

  • loosely coupled components
  • ethical aspects of building DIA

Conclusion

  • good book to read; highly recommend โ˜…โ˜…โ˜…โ˜…โ˜†

Additional Sources

A Primer on Database Replication

Leave a Reply

Your email address will not be published. Required fields are marked *