C++ Concurrency in Action

C++ Concurrency in Action: Practical Multithreading · Anthony Williams ·528 pages

Complete guide to C++11 multithreading — from basic thread management and mutex patterns through lock-free data structures and the memory model. Williams is the primary author of the Boost Thread Library and C++11 concurrency proposals.

Capabilities (10)

› Manage std::thread lifecycle with RAII guards — join, detach, transfer ownership
› Apply mutex hierarchy and std::lock() to prevent deadlock
› Build thread-safe queues and stacks with condition variables and futures
› Select correct memory_order for atomic operations (relaxed/acquire/release/seq_cst)
› Design lock-free data structures with hazard pointers and reference counting
› Partition work using data parallelism, recursive decomposition, and task pipelines
› Apply Amdahl's Law to predict concurrency speedup limits
› Build thread pools with work stealing for reduced contention
› Identify and eliminate false sharing, cache ping-pong, and oversubscription
› Write exception-safe parallel algorithms and test for data races

How to use

Install this skill and Claude can audit multithreaded C++ code for data races, deadlocks, and incorrect memory orderings, then design thread-safe data structures using the correct synchronization primitives

Why it matters

Writing correct concurrent C++ is one of the hardest tasks in systems programming — data races cause undefined behavior that compilers won't catch, and this skill provides the memory model and synchronization patterns needed to get it right

Example use cases

› Diagnosing a deadlock where two functions acquire two mutexes in opposite order and rewriting them using std::lock()
› Selecting the correct memory_order for atomic operations on a reference-counted object, choosing release/acquire over relaxed
› Designing a bounded thread-safe producer-consumer queue using condition_variable with predicate waits and spurious-wakeup protection

C++ Concurrency in Action Skill

Core Philosophy

Concurrency in C++ has two legitimate motivations: separation of concerns (keeping UI responsive while background work runs) and performance (using available hardware cores). Everything else is one of these in disguise. Never add concurrency because you can — the complexity cost is real.

The fundamental tradeoff: multiple threads share an address space (cheap communication, complex correctness) vs. multiple processes (expensive communication, strong isolation). C++11 standardized multithreading; everything before was platform-specific.

Thread Management

Launching Threads

std::thread t(callable);        // launches immediately
t.join();                        // wait for completion
t.detach();                      // fire and forget (daemon)

RAII guard pattern — always join or detach before destructor:

class thread_guard {
    std::thread& t;
public:
    explicit thread_guard(std::thread& t_) : t(t_) {}
    ~thread_guard() { if(t.joinable()) t.join(); }
};

Decision: join vs detach

join: when you need the result or must ensure cleanup before moving on
detach: long-running background tasks (logging, monitoring) with no shared state
Never: let std::thread destruct while joinable → std::terminate()

Passing Arguments

std::thread t(func, arg1, arg2);  // copies args by default
std::thread t(func, std::ref(x)); // use std::ref for references
std::thread t(func, std::move(p)); // move-only types

Pitfall: passing a pointer to a local variable then detaching — the thread outlives the stack frame.

Thread Count at Runtime

unsigned hw_threads = std::thread::hardware_concurrency();
// Returns 0 if unknown — default to 2

Protecting Shared Data

Mutex Hierarchy

std::mutex                  // basic exclusive lock
std::recursive_mutex        // same thread can lock multiple times
std::timed_mutex            // try_lock_for / try_lock_until
std::shared_mutex (C++17)   // reader-writer lock

Lock Wrappers

std::lock_guard<std::mutex> lk(m);     // RAII, no manual unlock
std::unique_lock<std::mutex> lk(m);    // movable, can unlock early
std::shared_lock<std::shared_mutex> lk(m); // reader lock

Deadlock Prevention Rules (in priority order)

Always lock in the same order — document and enforce globally

Use std::lock() for multiple locks — locks atomically:

std::lock(m1, m2);
std::lock_guard<std::mutex> lk1(m1, std::adopt_lock);
std::lock_guard<std::mutex> lk2(m2, std::adopt_lock);

Avoid calling user code while holding a lock — callbacks may try to acquire the same lock
Use lock hierarchies — assign numeric levels, always lock high-to-low
Avoid nested locks — if you need two locks simultaneously, use rule 2

Race Conditions in Interfaces

Even with a mutex-protected container, this is still a race:

if(!stack.empty()) {    // check
    stack.top();        // use ← another thread may have popped between check and use
}

Solution: combine the check-and-use into a single atomic operation returning by value.

Initialization Protection

std::once_flag flag;
std::call_once(flag, [](){ /* init once */ });  // thread-safe singleton init

Avoid double-checked locking — it’s broken without proper memory ordering.

Synchronizing Operations

Condition Variables

std::mutex m;
std::condition_variable cv;
std::queue<T> data;

// Producer
{
    std::lock_guard<std::mutex> lk(m);
    data.push(item);
}
cv.notify_one();

// Consumer
std::unique_lock<std::mutex> lk(m);
cv.wait(lk, []{ return !data.empty(); }); // spurious-wakeup safe
auto val = data.front(); data.pop();

Always use the predicate form of wait() — spurious wakeups are real.

Futures and Promises

// async task
std::future<int> f = std::async(std::launch::async, compute);
int result = f.get();  // blocks until ready

// explicit promise/future pair
std::promise<int> p;
std::future<int> f = p.get_future();
// in another thread:
p.set_value(42);
// or on exception:
p.set_exception(std::current_exception());

std::packaged_task — Wrapping Callables

std::packaged_task<int(int,int)> task(add);
std::future<int> f = task.get_future();
std::thread t(std::move(task), 3, 4);
int result = f.get(); // 7

std::shared_future — Multiple Consumers

std::shared_future<int> sf = f.share(); // from std::future
// multiple threads can call sf.get() safely

Memory Model and Atomics

The Six Memory Orders

memory_order_relaxed    // no ordering guarantees, only atomicity
memory_order_consume    // dependency-ordered (rarely useful)
memory_order_acquire    // load: see all prior releases
memory_order_release    // store: visible to subsequent acquires
memory_order_acq_rel    // read-modify-write: both
memory_order_seq_cst    // total order across all threads (default, safest)

Decision Tree for Memory Order

Prototyping/correctness first: always use seq_cst
Hot path optimization: release/acquire pairs for producer-consumer
Counters, statistics with no ordering needs: relaxed
Never use consume: compiler support is unreliable

Synchronizes-With Relationship

A release store on atomic A synchronizes-with an acquire load that reads the stored value. Everything the storing thread did before the release is visible to the loading thread after the acquire.

std::atomic<bool> ready{false};
std::string data;

// Thread 1
data = "hello";                           // (1)
ready.store(true, memory_order_release);  // (2)

// Thread 2
while(!ready.load(memory_order_acquire)); // (3) — sees (2)
std::cout << data;                        // (4) — sees (1) via synchronizes-with

Fences

std::atomic_thread_fence(memory_order_release); // portable barrier
std::atomic_signal_fence(memory_order_acquire); // signal handlers only

ABA Problem

In lock-free CAS loops, a value can change A→B→A between your load and CAS. Solutions:

Tagged pointers (store a version number alongside the pointer)
Hazard pointers
Reference-counted nodes

Lock-Free Data Structures

Lock-Free Stack (Reference-Counted)

Core pattern:

template<typename T>
class lock_free_stack {
    struct node {
        T data;
        std::shared_ptr<node> next;
    };
    std::atomic<std::shared_ptr<node>> head;
public:
    void push(T val) {
        auto new_node = std::make_shared<node>(std::move(val));
        new_node->next = head.load();
        while(!head.compare_exchange_weak(new_node->next, new_node));
    }
};

Guidelines for Lock-Free Code

Start with seq_cst, optimize later — verify correctness first
Use a lock-free memory reclamation scheme — hazard pointers or epoch-based reclamation
Watch for ABA — use version counters or split-reference counting
Identify busy-wait loops and help — if thread B is in the middle of an operation thread A needs, A should help B complete it, not spin

When to Use Lock-Free

Lock-free: progress guarantee — at least one thread makes progress
Wait-free: stronger — every thread makes progress in bounded steps
Lock-based is usually fine; reach for lock-free only when profiling shows contention

Designing Concurrent Code

Work Division Strategies

Data parallelism: partition the dataset before processing (parallel for_each)
Recursive decomposition: divide-and-conquer (parallel merge sort, quicksort)
Task type separation: pipeline stages, producer-consumer

Amdahl’s Law

If fraction p of the work can be parallelized:

Speedup = 1 / ((1-p) + p/N)

If 5% is serial, max speedup with infinite cores is 20×. The serial fraction dominates.

Performance Killers

False sharing: two threads write to different variables in the same cache line (64 bytes on x86). Fix: alignas(64) or pad structs.
Data contention: cache ping-pong from frequently-written atomics shared across cores
Oversubscription: more threads than hardware — context switch overhead dominates
Cache thrashing: large working sets accessed by multiple threads

Exception Safety in Parallel Code

// parallel_for_each pattern — collect exceptions
std::vector<std::exception_ptr> exceptions;
// in each thread: catch and store via std::current_exception()
// after join: rethrow first exception if any

Thread Pools

Minimal Thread Pool

class thread_pool {
    std::atomic<bool> done{false};
    thread_safe_queue<std::function<void()>> work_queue;
    std::vector<std::thread> threads;

    void worker_thread() {
        while(!done) {
            std::function<void()> task;
            if(work_queue.try_pop(task)) task();
            else std::this_thread::yield();
        }
    }
public:
    thread_pool() {
        unsigned n = std::thread::hardware_concurrency();
        for(unsigned i = 0; i < n; ++i)
            threads.emplace_back(&thread_pool::worker_thread, this);
    }
    ~thread_pool() { done = true; for(auto& t : threads) t.join(); }

    template<typename F>
    std::future<std::result_of_t<F()>> submit(F f) {
        auto task = std::make_shared<std::packaged_task<std::result_of_t<F()>()>>(f);
        work_queue.push([task]{ (*task)(); });
        return task->get_future();
    }
};

Work Stealing

Each thread has its own deque. Threads pop from the front; stealing threads pop from the back. Reduces contention vs. a single shared queue.

Testing Multithreaded Code

Bug Categories

Data races: unsynchronized concurrent access with at least one write
Deadlock: circular wait
Livelock: threads keep responding to each other without progress
Starvation: low-priority thread never gets scheduled
Spurious failure: timing-dependent bugs that pass most runs

Testing Strategies

Review synchronization points — every shared variable access must be protected
Stress testing — run tests millions of times, vary thread counts
Vary processor counts — bugs hide on single-core that appear on multi-core and vice versa
Helgrind / ThreadSanitizer — dynamic race detectors (instrument, don’t rely on)
Design for testability: minimize shared state, prefer message-passing, make synchronization explicit

Code Review Checklist

Every shared variable protected by a mutex or is atomic
Lock order consistent and documented
No user code called while holding a lock
All threads joined or detached before destructor
Condition variable waits use predicate form
Memory orders match the synchronization intent