The Art of Multiprocessor Programming

The Art of Multiprocessor Programming · Maurice Herlihy & Nir Shavit ·530 pages

The definitive textbook on concurrent programming theory — linearizability, mutual exclusion algorithms, lock-free data structures, hardware primitives, and transactional memory. Combines rigorous theory with practical Java implementations.

Capabilities (10)
  • Classify algorithm correctness using linearizability vs sequential consistency
  • Identify progress guarantees: wait-free, lock-free, obstruction-free, deadlock-free, starvation-free
  • Implement Peterson, Filter, and Bakery locks from first principles
  • Build lock-free stack (Treiber) with ABA-safe AtomicStampedReference
  • Understand Michael-Scott queue helping mechanism
  • Apply CAS/TAS/FAA hardware primitives correctly
  • Use consensus numbers to determine what CAS can implement (answer: anything)
  • Design TATAS, CLH, MCS queue locks with appropriate cache behavior
  • Apply hazard pointers, RCU, and epoch-based reclamation for lock-free memory safety
  • Evaluate when STM is appropriate (low contention, short transactions, read-heavy)
How to use

Install this skill and Claude can reason about concurrent algorithm correctness using linearizability, classify progress guarantees from wait-free to starvation-free, review lock-based code for deadlock and livelock, and apply lock-free patterns with correct ABA prevention and memory reclamation

Why it matters

Provides the theoretical foundations needed to write provably correct concurrent code rather than code that merely passes tests — understanding linearizability, consensus numbers, and progress conditions is essential when building high-performance lock-free data structures

Example use cases
  • Auditing a CAS-based stack implementation, identifying an ABA hazard, and rewriting it using AtomicStampedReference with a stamped version counter
  • Classifying a spinlock implementation as starvation-free or only deadlock-free and proposing an MCS queue lock for fairness under high contention
  • Determining whether volatile memory ordering in a Java class is sufficient for linearizability or whether a full synchronized block or AtomicReference is required

The Art of Multiprocessor Programming Skill

Correctness Properties

Linearizability

The gold standard for concurrent object correctness.

A concurrent execution is linearizable if:

  1. Each operation appears to take effect atomically at some point between its invocation and response
  2. The resulting sequence is consistent with the object’s sequential specification

Why it matters: a linearizable object is composable — a system of linearizable objects is linearizable.

Linearization point: the single instant where the operation “takes effect.” Must identify one for each execution path.

Sequential Consistency

Weaker than linearizability: operations appear in some legal sequential order consistent with each thread’s program order. Does NOT require real-time order across threads.

Rule of thumb: use linearizability for concurrent objects; sequential consistency is too weak for composition.

Progress Conditions (Strongest to Weakest)

Wait-free: every thread completes in a finite number of steps

Lock-free: some thread always completes (starvation possible)

Obstruction-free: thread completes if it runs in isolation

Deadlock-free: some thread eventually acquires the lock

Starvation-free: every thread eventually acquires the lock

Mutual Exclusion: Foundational Algorithms

Peterson Lock (2 Threads)

class Peterson {
    private volatile boolean[] flag = new boolean[2];
    private volatile int victim;

    public void lock() {
        int i = ThreadID.get();
        int j = 1 - i;
        flag[i] = true;        // I want in
        victim = i;            // I'll wait if needed
        while (flag[j] && victim == i); // spin
    }

    public void unlock() {
        int i = ThreadID.get();
        flag[i] = false;
    }
}

Satisfies: mutual exclusion + starvation-freedom (but requires volatile for correct memory ordering).

Filter Lock (N Threads)

Generalizes Peterson to N threads using a “filter” with N-1 levels. Thread must pass each level; at each level, at least one thread is let through.

Bakery Algorithm

Each thread takes a ticket number; threads enter in ticket order. Satisfies starvation-freedom.

Theoretical result: any deadlock-free mutual exclusion algorithm for N threads requires at least N distinct memory locations.


Concurrent Queue and Stack Designs

Lock-Based Queue (Two-Lock)

Use separate head and tail locks — allows concurrent enqueue and dequeue:

void enq(T x) {
    tailLock.lock();
    try { /* append to tail */ }
    finally { tailLock.unlock(); }
}

T deq() {
    headLock.lock();
    try { /* remove from head */ }
    finally { headLock.unlock(); }
}

Lock-Free Stack (Treiber’s Stack)

void push(T x) {
    Node node = new Node(x);
    do {
        node.next = top.get();
    } while (!top.compareAndSet(node.next, node));
}

T pop() {
    Node node, next;
    do {
        node = top.get();
        if (node == null) throw new EmptyException();
        next = node.next;
    } while (!top.compareAndSet(node, next));
    return node.value;
}

ABA problem: between reading top and CAS, another thread could pop + push the same node. Use stamped references: AtomicStampedReference pairs value with version counter.

Michael-Scott Queue (Lock-Free)

Two-pointer (head/tail) lock-free queue. Key trick: tail may lag one step behind, so each operation “helps” complete pending tail updates.


Hardware Primitives

Compare-And-Swap (CAS)

CAS(addr, expected, new) → bool:
    atomically:
        if *addr == expected:
            *addr = new
            return true
        return false

Universally available on x86, ARM, Java.

Test-And-Set

TAS(addr) → old_value:
    atomically:
        old = *addr; *addr = true; return old

Primitive for basic spinlocks. Generates cache traffic under contention.

Fetch-And-Add

FAA(addr, delta) → old_value:
    atomically: old = *addr; *addr += delta; return old

Used for counters without full CAS overhead.

Consensus Number

The consensus number of a primitive determines how many threads it can coordinate:

  • Read/Write registers: consensus number 1
  • Test-And-Set: consensus number 2
  • Compare-And-Swap: consensus number ∞

Herlihy’s Universality: any object with consensus number ∞ can implement any concurrent object in a wait-free manner.


Spin Lock Design

Basic Spinlock Issues

TAS locks: every thread spinning reads AND writes the lock → generates cache traffic.

Test-And-Test-And-Set (TATAS)

Spin on read (cache-friendly), only TAS when it looks free:

while (true) {
    while (lock.get());  // spin reading (in cache)
    if (!lock.getAndSet(true)) return; // try to acquire
}

CLH Queue Lock

Each thread creates a node and spins on its predecessor’s node. When predecessor releases, current thread proceeds. O(1) space per lock, no cache invalidation of other threads.

MCS Queue Lock

Like CLH but uses explicit linked list — better for NUMA (Non-Uniform Memory Access) because each thread spins on its own node (local memory).


Lock-Free Data Structures: Key Patterns

Helping

When a lock-free operation finds a partially-completed operation, it helps complete the other operation before proceeding. Ensures lock-freedom (some thread always makes progress).

Backoff

Under high contention, sleep exponentially before retrying:

void backoff(int min, int max) {
    int delay = random.nextInt(max - min) + min;
    Thread.sleep(delay);
    max = Math.min(2 * max, MAX_DELAY);
}

Memory Management in Lock-Free Code

Problem: thread reads a pointer, gets preempted, pointer gets freed + reallocated.

Solutions:

  1. Hazard pointers: before accessing node, publish pointer in hazard list; GC defers freeing hazardous pointers
  2. RCU (Read-Copy-Update): readers never blocked; updaters wait for all current readers to finish (grace period)
  3. Epoch-based reclamation: divide time into epochs; free objects only when no thread is in the epoch when object was retired

Transactional Memory

Software Transactional Memory (STM)

atomic {
    x = x + 1;
    y = y - 1;
}
  • Runtime detects conflicts between transactions (read/write sets overlap)
  • On conflict: one transaction aborts and retries
  • Advantages: programmers don’t manage locks
  • Disadvantages: overhead, no hardware support in most architectures

Conditions for Good STM Performance

  • Low contention between transactions
  • Short transactions
  • Read-heavy workloads
  • Avoid I/O inside transactions (can’t abort I/O)

Memory Model Foundations

Happens-Before (Java Memory Model)

  • If thread A performs action x before action y in program order, then x happens-before y
  • A volatile write happens-before all subsequent volatile reads
  • A lock release happens-before all subsequent lock acquisitions on the same lock

Consequence: without synchronization, a thread may not see writes from other threads.

Safe Publication

To safely publish a reference to a new object:

  1. Initialize it via a static initializer
  2. Store in a volatile field
  3. Store in an AtomicReference
  4. Store in a final field (after constructor completes)
  5. Store in a field properly guarded by a lock