10 min read

DDIA Chapter 2 Notes: Scalability, Reliability & Maintainability

These are my supplementary notes on Chapter 2 of Designing Data-Intensive Applications, translating concepts of scalability, reliability, and maintainability using intuitive analogies.
DDIA Chapter 2 Notes: Scalability, Reliability & Maintainability
Photo by Paul Hanaoka / Unsplash

While Designing Data-Intensive Applications (DDIA) is essential reading for understanding backend architecture, its complex concepts and distributed system terminology can be challenging. This post is a collection of my supplementary reading notes focusing on Chapter 2. Rather than replacing the book, the goal is to translate foundational concepts like scalability, reliability, and maintainability into real-world scenarios.

The Chinese translated edition of the book:

2. 定义非功能性需求
互联网做得太好了,以至于大多数人把它看成像太平洋那样的自然资源,而不是人造产物。上一次出现这种规模且几乎无差错的技术是什么时候? 艾伦・凯, 在接受 Dr Dobb’s Journal 采访时(2012 年)

Table of Contents

Case Study: Social Network Timeline

The biggest challenge in designing a social platform like Twitter or Instagram is the massive asymmetry between reads and writes: people view timelines far more often than they post.

(設計社群平台最大的挑戰在於讀寫比例極度不平衡:滑動態的人遠比發文的人多。)

Polling vs. Push (Materialized Views)

Early systems relied on Polling, where the client app constantly queries the server at fixed intervals asking, "Any new posts?". This wastes massive network bandwidth and server CPU on empty responses. A better optimization is Fan-out on write (Push): when a user posts, the server immediately writes that post into all their followers' dedicated cache timelines.

In database terminology, this pre-computed and continuously updated query result is called a Materialized View. Instead of executing complex SQL queries on the fly, the system "materializes" the result into a real, physical list stored in the cache, allowing read requests to be served instantly ($O(1)$ time).

(與其讓客戶端頻繁輪詢,不如在發文當下直接把貼文推送到粉絲的快取中。這個預先計算好並持續更新的結果,就是一種物化視圖 (Materialized View),它讓讀取的延遲降到最低。)

Handling Traffic Spikes with Queues (Asynchronous Processing)

What happens during extreme traffic spikes, like New Year's Eve? If the system tries to aggressively "Push" everyone's posts synchronously, it will crash.

To survive, the system introduces a Message Queue (like Kafka) to act as a buffer (Asynchronous Processing):

  • Write (Queueing): When you post, it is saved to the main database, but the heavy task of "distributing to followers" is put into a queue. This introduces Eventual Consistency—it might take a few seconds or a minute for the post to actually appear in followers' timelines.
  • Read (Still Fast): Despite the queueing delay, the user experience remains flawless. Because read requests still hit the cache directly, the timeline loads instantly. Users might temporarily see posts from 10 seconds ago, but the app never freezes.

(遇到流量突刺時,系統會採用「先入隊」的非同步處理。雖然發文寫入粉絲時間線的速度變慢了(最終一致性),但因為讀取依然來自快取,所以讀者打開 App 仍然是秒開。這種「削峰填谷」的機制寧可犧牲一點即時性,也要死守系統的可用性。)

The Hybrid Approach for Celebrities

While the Push model works perfectly for regular users, it breaks down completely when dealing with mega-celebrities. If a celebrity with 50 million followers posts, pushing to all 50 million caches simultaneously would paralyze the server (known as fan-out delay).

To solve this, Twitter pioneered a Hybrid Approach that splits the workload:

  1. Regular Friends (Push / Materialized): When your normal friends post, the system pushes it to your dedicated cache in the background.
  2. Celebrities (Isolated Storage): When Elon Musk posts, the system does not push. The post is stored globally in the database.
  3. App Open (Merge on Read): When you open the app, the server instantly fetches your pre-computed cache (from Step 1), quickly queries the database for any new posts from the few celebrities you follow (from Step 2), and merges them by timestamp.

(普通人發文直接推入粉絲快取;超級名人發文則單獨存儲。等粉絲打開 App 時,伺服器才拿出現成快取,並動態拉取名人新貼文進行合併。這完美平衡了讀取與寫入的效能。)

Describing Performance

To understand the relationship between system load, throughput, and response time, imagine a popular Ramen Shop:

  • Load: The number of customers entering the shop per second (Requests Per Second / RPS).
  • Throughput: How many bowls of ramen the kitchen can actually cook and serve per second.
  • Response Time: The total time a customer waits from placing an order to eating the first bite.

When the load approaches the kitchen's throughput limit, orders pile up in a Queue, causing response times to spike exponentially. If unchecked, frustrated clients will time out and automatically retry, triggering a Retry Storm that worsens the load. This pushes the system into a Metastable Failure—a state where the system remains broken and over-capacitated even if the initial traffic spike subsides.

(當負載逼近極限,請求開始排隊,響應時間會暴增。若客戶端超時並瘋狂重試,就會引發重試風暴與亞穩態故障,導致系統持續卡死。)

Defense Mechanisms

  • Client-Side: Implement Exponential Backoff (gradually doubling the wait time between retries) paired with Jitter (adding randomness) to scatter retry bursts. Alternatively, use a Circuit Breaker to temporarily stop sending requests when errors spike. (客戶端利用指數退避與隨機抖動打散流量,或用熔斷器暫停請求。)
  • Server-Side: Execute Load Shedding by dropping non-essential incoming requests instantly (like hanging a "Sold Out" sign) or apply Backpressure to explicitly tell the client to slow down. (伺服器端可直接拒絕過載請求(負載卸除),或透過背壓要求前端降速。)

Percentiles & Monitoring (SLIs, SLOs & SLAs)

We shouldn't monitor backend performance using averages. Instead, focus on high percentiles like p99.9 (Tail Latency)—the slowest 1 in 1000 requests. The slowest requests often belong to users who have the largest accounts and the most data. Keeping them waiting directly impacts the business.

To understand how we measure and enforce these metrics, think of a Food Delivery Platform:

  • SLI (Service Level Indicator - The Thermometer): The objective measurement. E.g., The total minutes from order placement to delivery. In systems, this is the actual Response Time.
  • SLO (Service Level Objective - The Target): The internal goal. E.g., 99% of our orders must be delivered within 30 minutes. If performance drops below this, engineers must pause feature development to fix technical debt.
  • SLA (Service Level Agreement - The Contract): The commercial promise with financial penalties. E.g., If your food takes over 30 minutes, you get a $50 refund. In cloud systems, breaching the SLA means refunding paying customers.

(監控不該看平均值,而要看高百分位點如 p99 尾部延遲。我們通常用三個維度來管理:SLI 是客觀的測量指標(溫度計);SLO 是團隊內部的及格線目標(希望維持 25 度);SLA 則是對付費客戶帶有罰則的商業合約(超過 27 度就賠 50 元)。)

To monitor this effectively, systems maintain a rolling window (e.g., last 10 minutes) to continuously calculate these percentiles for real-time dashboards (like Grafana) and automated alerting.

Reliability & Fault Tolerance

Fault vs. Failure

  • Fault: A specific component deviating from its spec (e.g., one headlight burns out on your car).
  • Failure: The entire system stops providing service to the user (e.g., the car's engine dies, leaving you stranded).

Fault-tolerant systems aim to prevent a single fault from escalating into a total failure. If a single component's fault immediately breaks the whole system, that component is a Single Point of Failure (SPOF).

(容錯設計的目的是防止局部故障演變成整體失效。一壞掉就會讓全站停擺的零件,稱為單點故障。)

Chaos Engineering & Netflix's Chaos Monkey

Counterintuitively, a practical way to ensure error-handling code works is to inject faults on purpose. This methodology, known as Chaos Engineering, was famously pioneered by Netflix with their "Chaos Monkey" tool. By randomly killing live server processes in the production environment, Chaos Monkey continuously forces the system to test and validate its automated recovery and resilience mechanisms.

(與其祈禱災難不發生,不如像 Netflix 發明的「混沌猴子 (Chaos Monkey)」一樣,主動在正式環境無預警隨機殺死伺服器行程。透過這種「故障注入」的演練,系統的自動備援與容錯機制才能真正被驗證。)

Hardware Faults: Regions vs. Availability Zones (AZs)

To understand cloud infrastructure, we must distinguish between a Region and an Availability Zone (AZ):

  • Region: A broad geographical area, such as Tokyo or US-East.
  • AZ (Availability Zone): A distinct physical data center (or a group of data centers) located inside a Region.

Resources within the same AZ share the same physical building, power grid, and network switches. Because of this, they have a shared fate—they are highly likely to fail together during a local disaster, like a fire or a fiber cable cut. Achieving true High Availability (HA) requires designing systems to be redundant across multiple AZs, or even across entire Regions to survive massive geographic catastrophes.

(在雲端架構中,「Region (地域)」是廣大的地理範圍,「AZ (可用區)」則是裡面的實體機房。同一個 AZ 的機器共用水電網路,是命運共同體,容易因停電同時失效。因此,現代架構會把系統分散部署在不同的 AZ 甚至跨 Region 來包容硬體故障,達成高可用性。)

Scalability

When hardware resources are no longer enough, there are two primary scaling strategies:

Scale Up vs. Scale Out

  • Scale Up (Vertical Scaling / Shared Memory): Moving to a more powerful machine with faster CPUs and massive RAM. This is like getting a giant office desk where multiple threads share the same memory space. However, hardware costs grow super-linearly and eventually hit physical limits.
  • Scale Out (Horizontal Scaling / Shared Nothing): Connecting a cluster of multiple smaller, regular machines. Each node has its own CPU, RAM, and disk storage, communicating entirely over the network.

(Scale Up 就像買一張超大辦公桌,成本高且有物理極限;Scale Out 則是買很多普通桌子組成叢集,是目前的擴展主流。)

Evolution of the Storage Layer

Traditional shared-disk architectures rely on NAS (File-level access) or SAN (Block-level access). This is like multiple employees sharing one single massive filing cabinet; everyone has to pull out raw binders (data blocks) over the network, leading to lock contention and network bottlenecks.

(傳統架構如 NAS/SAN 像是共用一個大檔案櫃,搬運原始資料塊容易造成鎖定與網路塞車。)

Modern Cloud-Native Databases decouple storage and compute. Instead of shifting heavy raw blocks, the compute nodes send lightweight, tailored Storage APIs (log instructions) to an independent storage layer, bypassing old scalability bottlenecks.

(雲原生資料庫將儲存與計算分離,改為傳遞精簡的 API 指令,由儲存層自行變更資料。)

Maintainability

The ultimate tool against software complexity is Abstraction. A good abstraction hides low-level implementation details behind a clean, simple API. For example, SQL abstracts away complex on-disk data structures and concurrent requests.

In Go, the interface is a brilliant manifestation of this design philosophy:

// The Abstraction: We only care about the behavior, not the location.
type Storage interface {
    Save(data string) error
}

// Low-level Detail A: Writing to memory
type MemoryStorage struct {
    cache map[string]string
}
func (m *MemoryStorage) Save(data string) error {
    m.cache["data"] = data
    return nil
}

// Low-level Detail B: Handling heavy Disk I/O
type DiskStorage struct {
    filePath string
}
func (d *DiskStorage) Save(data string) error {
    // os.OpenFile, sync, and close logic...
    return nil
}

By coding against the Storage interface, your core business logic remains oblivious to whether data is written to a volatile memory map or a persistent disk. Just like a car's accelerator pedal masks the intricate physics happening under the hood, software interfaces empower us to build maintainable applications by shielding us from underlying complexity.

(透過定義 Go 的 interface,業務邏輯完美屏蔽了底層究竟是在操作記憶體還是磁碟 I/O。介面就像汽車的油門踏板,隱藏了引擎運作的複雜度,這正是抽象化的威力。)

Wrapping Up

System design is rarely about finding a perfect, one-size-fits-all solution. Instead, it is the constant art of making the right trade-offs.

As we've seen in this chapter:

  • Solving Scalability often means sacrificing strict real-time updates for eventual consistency (like our message queue example).
  • Achieving Reliability means embracing failure rather than hiding from it (like Netflix's Chaos Monkey).
  • Ensuring Maintainability requires building good abstractions to keep underlying complexity under control.

There is no single "best" architecture, only the right architecture for your current load, team, and constraints. Hopefully, these everyday analogies make the foundational concepts of DDIA a bit less daunting. Happy reading!

(系統架構從來就沒有標準答案,只有無止盡的「取捨(Trade-off)」。為了解決擴展性,我們可能要犧牲一點即時性;為了達到高可靠性,我們必須主動擁抱並演練故障;而優秀的抽象化設計,則是保護系統可維護性、拯救工程師心智的終極武器。希望這些大白話的比喻,能讓你在啃 DDIA 這本書時,讀得更有畫面感!)