C++ Concurrency in Action
Complete guide to C++11 multithreading — from basic thread management and mutex patterns through lock-free data structures and the memory model. Williams is the primary author of the Boost Thread Library and C++11 concurrency proposals.
- › Manage std::thread lifecycle with RAII guards — join, detach, transfer ownership
- › Apply mutex hierarchy and std::lock() to prevent deadlock
- › Build thread-safe queues and stacks with condition variables and futures
- › Select correct memory_order for atomic operations (relaxed/acquire/release/seq_cst)
- › Design lock-free data structures with hazard pointers and reference counting
- › Partition work using data parallelism, recursive decomposition, and task pipelines
- › Apply Amdahl's Law to predict concurrency speedup limits
- › Build thread pools with work stealing for reduced contention
- › Identify and eliminate false sharing, cache ping-pong, and oversubscription
- › Write exception-safe parallel algorithms and test for data races
Install this skill and Claude can audit multithreaded C++ code for data races, deadlocks, and incorrect memory orderings, then design thread-safe data structures using the correct synchronization primitives
Writing correct concurrent C++ is one of the hardest tasks in systems programming — data races cause undefined behavior that compilers won't catch, and this skill provides the memory model and synchronization patterns needed to get it right
- › Diagnosing a deadlock where two functions acquire two mutexes in opposite order and rewriting them using std::lock()
- › Selecting the correct memory_order for atomic operations on a reference-counted object, choosing release/acquire over relaxed
- › Designing a bounded thread-safe producer-consumer queue using condition_variable with predicate waits and spurious-wakeup protection
C++ Concurrency in Action Skill
Core Philosophy
Concurrency in C++ has two legitimate motivations: separation of concerns (keeping UI responsive while background work runs) and performance (using available hardware cores). Everything else is one of these in disguise. Never add concurrency because you can — the complexity cost is real.
The fundamental tradeoff: multiple threads share an address space (cheap communication, complex correctness) vs. multiple processes (expensive communication, strong isolation). C++11 standardized multithreading; everything before was platform-specific.
Thread Management
Launching Threads
std::thread t(callable); // launches immediately
t.join(); // wait for completion
t.detach(); // fire and forget (daemon)
RAII guard pattern — always join or detach before destructor:
class thread_guard {
std::thread& t;
public:
explicit thread_guard(std::thread& t_) : t(t_) {}
~thread_guard() { if(t.joinable()) t.join(); }
};
Decision: join vs detach
- join: when you need the result or must ensure cleanup before moving on
- detach: long-running background tasks (logging, monitoring) with no shared state
- Never: let
std::threaddestruct while joinable →std::terminate()
Passing Arguments
std::thread t(func, arg1, arg2); // copies args by default
std::thread t(func, std::ref(x)); // use std::ref for references
std::thread t(func, std::move(p)); // move-only types
Pitfall: passing a pointer to a local variable then detaching — the thread outlives the stack frame.
Thread Count at Runtime
unsigned hw_threads = std::thread::hardware_concurrency();
// Returns 0 if unknown — default to 2
Protecting Shared Data
Mutex Hierarchy
std::mutex // basic exclusive lock
std::recursive_mutex // same thread can lock multiple times
std::timed_mutex // try_lock_for / try_lock_until
std::shared_mutex (C++17) // reader-writer lock
Lock Wrappers
std::lock_guard<std::mutex> lk(m); // RAII, no manual unlock
std::unique_lock<std::mutex> lk(m); // movable, can unlock early
std::shared_lock<std::shared_mutex> lk(m); // reader lock
Deadlock Prevention Rules (in priority order)
- Always lock in the same order — document and enforce globally
- Use std::lock() for multiple locks — locks atomically:
std::lock(m1, m2); std::lock_guard<std::mutex> lk1(m1, std::adopt_lock); std::lock_guard<std::mutex> lk2(m2, std::adopt_lock); - Avoid calling user code while holding a lock — callbacks may try to acquire the same lock
- Use lock hierarchies — assign numeric levels, always lock high-to-low
- Avoid nested locks — if you need two locks simultaneously, use rule 2
Race Conditions in Interfaces
Even with a mutex-protected container, this is still a race:
if(!stack.empty()) { // check
stack.top(); // use ← another thread may have popped between check and use
}
Solution: combine the check-and-use into a single atomic operation returning by value.
Initialization Protection
std::once_flag flag;
std::call_once(flag, [](){ /* init once */ }); // thread-safe singleton init
Avoid double-checked locking — it’s broken without proper memory ordering.
Synchronizing Operations
Condition Variables
std::mutex m;
std::condition_variable cv;
std::queue<T> data;
// Producer
{
std::lock_guard<std::mutex> lk(m);
data.push(item);
}
cv.notify_one();
// Consumer
std::unique_lock<std::mutex> lk(m);
cv.wait(lk, []{ return !data.empty(); }); // spurious-wakeup safe
auto val = data.front(); data.pop();
Always use the predicate form of wait() — spurious wakeups are real.
Futures and Promises
// async task
std::future<int> f = std::async(std::launch::async, compute);
int result = f.get(); // blocks until ready
// explicit promise/future pair
std::promise<int> p;
std::future<int> f = p.get_future();
// in another thread:
p.set_value(42);
// or on exception:
p.set_exception(std::current_exception());
std::packaged_task — Wrapping Callables
std::packaged_task<int(int,int)> task(add);
std::future<int> f = task.get_future();
std::thread t(std::move(task), 3, 4);
int result = f.get(); // 7
std::shared_future — Multiple Consumers
std::shared_future<int> sf = f.share(); // from std::future
// multiple threads can call sf.get() safely
Memory Model and Atomics
The Six Memory Orders
memory_order_relaxed // no ordering guarantees, only atomicity
memory_order_consume // dependency-ordered (rarely useful)
memory_order_acquire // load: see all prior releases
memory_order_release // store: visible to subsequent acquires
memory_order_acq_rel // read-modify-write: both
memory_order_seq_cst // total order across all threads (default, safest)
Decision Tree for Memory Order
- Prototyping/correctness first: always use
seq_cst - Hot path optimization:
release/acquirepairs for producer-consumer - Counters, statistics with no ordering needs:
relaxed - Never use
consume: compiler support is unreliable
Synchronizes-With Relationship
A release store on atomic A synchronizes-with an acquire load that reads the stored value. Everything the storing thread did before the release is visible to the loading thread after the acquire.
std::atomic<bool> ready{false};
std::string data;
// Thread 1
data = "hello"; // (1)
ready.store(true, memory_order_release); // (2)
// Thread 2
while(!ready.load(memory_order_acquire)); // (3) — sees (2)
std::cout << data; // (4) — sees (1) via synchronizes-with
Fences
std::atomic_thread_fence(memory_order_release); // portable barrier
std::atomic_signal_fence(memory_order_acquire); // signal handlers only
ABA Problem
In lock-free CAS loops, a value can change A→B→A between your load and CAS. Solutions:
- Tagged pointers (store a version number alongside the pointer)
- Hazard pointers
- Reference-counted nodes
Lock-Free Data Structures
Lock-Free Stack (Reference-Counted)
Core pattern:
template<typename T>
class lock_free_stack {
struct node {
T data;
std::shared_ptr<node> next;
};
std::atomic<std::shared_ptr<node>> head;
public:
void push(T val) {
auto new_node = std::make_shared<node>(std::move(val));
new_node->next = head.load();
while(!head.compare_exchange_weak(new_node->next, new_node));
}
};
Guidelines for Lock-Free Code
- Start with seq_cst, optimize later — verify correctness first
- Use a lock-free memory reclamation scheme — hazard pointers or epoch-based reclamation
- Watch for ABA — use version counters or split-reference counting
- Identify busy-wait loops and help — if thread B is in the middle of an operation thread A needs, A should help B complete it, not spin
When to Use Lock-Free
- Lock-free: progress guarantee — at least one thread makes progress
- Wait-free: stronger — every thread makes progress in bounded steps
- Lock-based is usually fine; reach for lock-free only when profiling shows contention
Designing Concurrent Code
Work Division Strategies
- Data parallelism: partition the dataset before processing (parallel
for_each) - Recursive decomposition: divide-and-conquer (parallel merge sort, quicksort)
- Task type separation: pipeline stages, producer-consumer
Amdahl’s Law
If fraction p of the work can be parallelized:
Speedup = 1 / ((1-p) + p/N)
If 5% is serial, max speedup with infinite cores is 20×. The serial fraction dominates.
Performance Killers
- False sharing: two threads write to different variables in the same cache line (64 bytes on x86). Fix:
alignas(64)or pad structs. - Data contention: cache ping-pong from frequently-written atomics shared across cores
- Oversubscription: more threads than hardware — context switch overhead dominates
- Cache thrashing: large working sets accessed by multiple threads
Exception Safety in Parallel Code
// parallel_for_each pattern — collect exceptions
std::vector<std::exception_ptr> exceptions;
// in each thread: catch and store via std::current_exception()
// after join: rethrow first exception if any
Thread Pools
Minimal Thread Pool
class thread_pool {
std::atomic<bool> done{false};
thread_safe_queue<std::function<void()>> work_queue;
std::vector<std::thread> threads;
void worker_thread() {
while(!done) {
std::function<void()> task;
if(work_queue.try_pop(task)) task();
else std::this_thread::yield();
}
}
public:
thread_pool() {
unsigned n = std::thread::hardware_concurrency();
for(unsigned i = 0; i < n; ++i)
threads.emplace_back(&thread_pool::worker_thread, this);
}
~thread_pool() { done = true; for(auto& t : threads) t.join(); }
template<typename F>
std::future<std::result_of_t<F()>> submit(F f) {
auto task = std::make_shared<std::packaged_task<std::result_of_t<F()>()>>(f);
work_queue.push([task]{ (*task)(); });
return task->get_future();
}
};
Work Stealing
Each thread has its own deque. Threads pop from the front; stealing threads pop from the back. Reduces contention vs. a single shared queue.
Testing Multithreaded Code
Bug Categories
- Data races: unsynchronized concurrent access with at least one write
- Deadlock: circular wait
- Livelock: threads keep responding to each other without progress
- Starvation: low-priority thread never gets scheduled
- Spurious failure: timing-dependent bugs that pass most runs
Testing Strategies
- Review synchronization points — every shared variable access must be protected
- Stress testing — run tests millions of times, vary thread counts
- Vary processor counts — bugs hide on single-core that appear on multi-core and vice versa
- Helgrind / ThreadSanitizer — dynamic race detectors (instrument, don’t rely on)
- Design for testability: minimize shared state, prefer message-passing, make synchronization explicit
Code Review Checklist
- Every shared variable protected by a mutex or is atomic
- Lock order consistent and documented
- No user code called while holding a lock
- All threads joined or detached before destructor
- Condition variable waits use predicate form
- Memory orders match the synchronization intent