NOTE

CAP theorem and what partitions force

#software-engineering (40)#distributed-systems (1)

People lean on CAP as shorthand for tradeoffs in distributed storage. The useful core is a narrow impossibility claim about what a system can promise when the network drops traffic.

It is impossible in the asynchronous network model to implement a read/write data object that guarantees availability and atomic consistency in all fair executions (including those in which messages are lost).

That is Theorem 1 in Seth Gilbert and Nancy Lynch’s proof of Brewer’s conjecture, in their note Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. The definitions in that paper are the ones that make the statement true.

Three properties drive the story:

  • Consistency (in this sense) is atomic or linearizable behavior. Operations behave as if they ran on a single copy. A read that starts after a write finishes must see that write (or something newer).
  • Availability means every request to a non-failing node eventually gets a response. Under their partition model that bar stays up even when the network loses arbitrary messages between components.
  • Partition tolerance is modeled as those arbitrary losses. A temporary split between two groups is indistinguishable from “all cross-group messages dropped right now.”

The proof is a partition argument with two non-empty sides. If writers on one side commit new state while readers on the other side answer without hearing them, you cannot keep the linearizability rule and the all-requests-complete rule at the same time. Something has to give for that execution.

What builders should take home is narrower than the usual three-property triangle chart.

  • “Pick two” tracks the proof’s assumptions. Gilbert and Lynch treat partition tolerance as part of the failure model for a wide-area service. In practice you plan for partitions if you run on an unreliable network. Brewer’s later IEEE retrospective on CAP stresses that partitions are rare, that many systems run in a CA regime inside a site, and that the hard fork between C and A bites when you detect a partition or run out of time waiting (timeouts and CAP line up in real operations).
  • Consistency and availability are not single bits. You can degrade gracefully (read-only mode, shed load, partial quorum), tune latency bounds, and recover with reconciliation. The theorem targets a specific formal pair (atomicity plus their availability rule in the asynchronous model). It does not replace product judgment about how stale a response may be or how long a user waits.
  • Scope your guarantees. Same datacenter, same Raft group, and “client cannot reach any replica” are different failure surfaces. Match your consistency story to the boundary where you actually run consensus or accept divergence.

Use CAP as a reminder that a partition forces a decision about whether to keep answering with possibly stale state or to stop honoring some requests until you can reconcile. Murphy’s law is the informal habit of planning as if those partitions and partial failures will show up in production, not only in proofs. Use the Gilbert and Lynch definitions when someone argues from the acronym alone.