Understanding the Linux Kernel

Understanding the Linux Kernel, 3rd Edition · Daniel P. Bovet & Marco Cesati ·950 pages

Comprehensive deep-dive into Linux 2.6 kernel internals — memory addressing (paging/segmentation), process management (task_struct, context switch, CoW), scheduling, synchronization primitives, memory management (buddy/slab), VFS, device drivers, page cache, IPC, and ELF execution.

Capabilities (10)
  • Explain x86 memory addressing: segmentation → paging → physical address translation
  • Trace process creation from fork() through copy-on-write to exec()
  • Explain kernel synchronization: spinlocks, semaphores, RCU, atomic ops — and when each applies
  • Describe Linux scheduling: priority levels, O(1) scheduler, SMP load balancing
  • Explain buddy system and slab allocator for kernel memory management
  • Trace page fault handling from exception to page allocation or file read
  • Explain system call entry path (int 0x80/syscall) and argument passing
  • Describe VFS object hierarchy: superblock, inode, dentry, file and their operation tables
  • Explain page cache writeback, dirty page management, and sync vs fsync
  • Describe ELF binary format and exec() flow including dynamic linking
How to use

Install this skill and Claude can trace any system call from userspace through the kernel entry point into the relevant subsystem, explain page fault and memory allocation behavior, reason through kernel synchronization deadlocks, describe the VFS call chain from read() to the filesystem, and analyze the full process lifecycle from fork() to exec()

Why it matters

Kernel internals knowledge is the foundation for writing correct device drivers, diagnosing hard-to-reproduce system bugs (OOM kills, deadlocks, page faults), and understanding the true cost of operations that appear cheap at the syscall boundary

Example use cases
  • Identify why a kernel module locks up under concurrent access by tracing which locks are held, in what order, and whether any code path sleeps while holding a spinlock
  • Explain why a process is being killed by the OOM killer on a system with available swap by tracing buddy allocator state, zone watermarks, and GFP flags
  • Trace what happens in the kernel when mmap(MAP_ANONYMOUS) is called, including VMA creation, page fault deferral, and when physical pages are actually allocated

Understanding the Linux Kernel Skill

Memory Addressing (Chapter 2)

Three-Level Address Translation (x86)

Logical address → [Segmentation] → Linear address → [Paging] → Physical address

Segmentation in Linux: mostly disabled (flat memory model). CS, DS, SS all map to same base 0, limit 4GB. Only segment descriptor privilege levels (CPL/DPL) are actively used for kernel/user separation.

Paging (2-level on 32-bit):

Linear address: [PGD index | PTE index | Offset]
Page Global Directory (PGD) → Page Table Entry (PTE) → Physical frame

TLB: Translation Lookaside Buffer caches recent translations. Context switch flushes TLB (or uses PCID on modern x86 to tag entries per process).

High memory (32-bit): memory above 896MB can’t be permanently mapped. Kernel accesses it via temporary kmap() mappings.


Processes (Chapter 3)

task_struct (Process Descriptor)

Key fields:

  • state: TASK_RUNNING, TASK_INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, TASK_ZOMBIE, TASK_STOPPED
  • pid / tgid: process ID / thread group ID
  • mm: pointer to mm_struct (address space), NULL for kernel threads
  • files: open file descriptors table
  • fs: filesystem context (cwd, root)
  • signal: signal handlers
  • sched_class / policy: scheduler class (CFS, RT)

Process Creation: fork() / clone()

  1. do_fork() allocates new task_struct (from slab cache)
  2. Copy-on-write: both parent and child get read-only page mappings; page fault triggers actual copy
  3. clone() flags control what is shared: CLONE_VM, CLONE_FS, CLONE_FILES, CLONE_THREAD
  4. Threads = processes sharing same mm_struct, files, etc.

Process Switch (Context Switch)

  1. Save hardware context (registers) to thread_struct in task_struct
  2. Switch page table root (CR3 register on x86) — changes address space
  3. Load new process’s hardware context
  4. Executed via schedule()context_switch()switch_to()

Interrupts and Exceptions (Chapter 4)

Interrupt Descriptor Table (IDT)

256 entries. First 32: CPU exceptions (divide by zero, page fault, etc.). 32-255: hardware IRQs + software interrupts.

Interrupt Handling Flow

  1. CPU saves registers, switches to kernel stack (if in user mode)
  2. Calls interrupt handler registered in IDT
  3. Handler runs with local IRQs disabled
  4. Returns via iret

Bottom Halves: Deferring Work

  • Softirqs: precompiled, run with interrupts enabled, cannot sleep. Used by network subsystem, block layer.
  • Tasklets: built on softirqs, serialized per-tasklet, easier to use. For most drivers.
  • Work queues: deferred work in kernel thread context — can sleep. For slow driver operations.
// Schedule work for later execution in softirq context
tasklet_schedule(&my_tasklet);

// Schedule work that may sleep
queue_work(my_workqueue, &my_work);

Kernel Synchronization (Chapter 5)

Synchronization Primitives

PrimitiveSleeps?Use case
Atomic ops (atomic_t)NoCounters, flags
Spin lockNo (busy wait)Short critical sections, interrupt handlers
Read-write spinlockNoReader-heavy data
SemaphoreYesLonger critical sections
MutexYesSimpler than semaphore (single-lock only)
RCUNo (readers)Read-mostly data, list traversal
SeqlockNoFast readers, rare writes

Spin Lock Rules

  • Never hold a spin lock while sleeping (deadlock with the scheduler)
  • Use spin_lock_irqsave() in interrupt handlers (saves interrupt state)
  • Spin locks are recursive-unsafe — same CPU holding same lock = deadlock

RCU (Read-Copy-Update)

// Reader (no lock needed):
rcu_read_lock();
p = rcu_dereference(head);
// use p safely
rcu_read_unlock();

// Writer:
new_node = kmalloc(...);
rcu_assign_pointer(head, new_node);
synchronize_rcu();  // wait for all readers to finish
kfree(old_node);

Process Scheduling (Chapter 7)

Scheduling Classes (Linux 2.6)

  • SCHED_NORMAL (CFS): interactive and batch processes, most processes
  • SCHED_FIFO: real-time, run until preempted or blocked
  • SCHED_RR: real-time with round-robin time quantum

O(1) Scheduler (2.6 era)

  • 140 priority levels (0-99 RT, 100-139 nice-adjusted)
  • Two runqueues per CPU: active + expired
  • Bitmap of non-empty priority queues → O(1) next task selection
  • Interactive processes get dynamic priority boost; CPU-hogs get penalties

Load Balancing (SMP)

  • Per-CPU runqueues; scheduler balances load across CPUs
  • load_balance() called periodically or when CPU goes idle
  • Domain hierarchy: hyper-thread siblings → physical cores → NUMA nodes

Memory Management (Chapter 8)

Buddy System (Page Frame Allocator)

Manages contiguous physical page frames in powers of 2:

// Allocate 2^order contiguous pages
struct page *page = alloc_pages(GFP_KERNEL, order);
// Free:
__free_pages(page, order);

Slab Allocator (Object Cache)

For frequently allocated kernel objects (task_struct, inode, dentry):

// Create cache for objects of size sizeof(my_struct):
struct kmem_cache *cache = kmem_cache_create("name", sizeof(my_struct), 0, 0, NULL);
// Allocate:
my_struct *p = kmem_cache_alloc(cache, GFP_KERNEL);
// Free:
kmem_cache_free(cache, p);

GFP Flags

  • GFP_KERNEL: can sleep, normal kernel allocation
  • GFP_ATOMIC: cannot sleep (interrupt context)
  • GFP_DMA: allocate from DMA-accessible memory
  • __GFP_HIGHMEM: can use high memory

vmalloc vs kmalloc

  • kmalloc: physically contiguous, max 128KB typically, fastest
  • vmalloc: virtually contiguous only, larger allocations, slower (TLB misses)

Process Address Space (Chapter 9)

Key Structures

  • mm_struct: per-process memory descriptor (page table root, list of VMAs, etc.)
  • vm_area_struct (VMA): describes one contiguous region (code, stack, heap, mmap’d file)

Page Fault Handler

  1. CPU raises page fault exception (interrupt 14)
  2. Kernel checks if address is in a valid VMA
  3. If valid VMA + anonymous: allocate new page, zero-fill, map
  4. If valid VMA + file-backed: read from page cache or file
  5. If invalid: SIGSEGV to process

Memory Maps

// In user space, mmap() creates a new VMA:
// Anonymous: private memory
mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// File-backed: map file into address space
mmap(NULL, size, PROT_READ, MAP_SHARED, fd, offset);

System Calls (Chapter 10)

System Call Entry (32-bit)

  1. User code: int 0x80 or sysenter (faster)
  2. CPU switches to ring 0, uses kernel stack
  3. System call number in eax; arguments in ebx, ecx, edx, esi, edi, ebp
  4. Kernel dispatches via sys_call_table[eax]
  5. Returns to user space via iret or sysexit

64-bit (x86-64)

  • Uses syscall instruction
  • Arguments in rdi, rsi, rdx, r10, r8, r9
  • Return value in rax

Virtual Filesystem (VFS) (Chapter 12)

VFS Object Hierarchy

Superblock → describes mounted filesystem
    └── Inode → represents a file/dir on disk
         └── Dentry → represents a path component (cached)
              └── File → open file instance (per open() call)

VFS Operations

Each object has an operations pointer with function pointers:

struct inode_operations {
    int (*create)(struct inode*, struct dentry*, ...);
    struct dentry* (*lookup)(struct inode*, struct dentry*, ...);
    int (*link)(...); int (*unlink)(...);
    // ...
};

struct file_operations {
    ssize_t (*read)(struct file*, char __user*, size_t, loff_t*);
    ssize_t (*write)(struct file*, const char __user*, size_t, loff_t*);
    int (*mmap)(struct file*, struct vm_area_struct*);
    // ...
};

Dentry Cache

Kernel caches recently accessed path components in the dentry cache. pathname_lookup() traverses path component by component, checking dcache first.


Device Drivers (Chapter 13)

Character vs Block Devices

  • Character device: byte stream (serial ports, terminals, sensors). No seek.
  • Block device: fixed-size blocks, random access (disk). Has I/O scheduler.

Device Driver Registration

// Register char device
static struct file_operations my_fops = {
    .open = my_open,
    .read = my_read,
    .write = my_write,
    .release = my_release,
};
register_chrdev(major, "mydev", &my_fops);

I/O Scheduler

Merges and reorders block I/O requests to minimize disk head movement:

  • CFQ (Completely Fair Queuing): default, fair sharing among processes
  • Deadline: prevents starvation with hard deadlines
  • Noop: no reordering (for SSDs)

Page Cache (Chapter 15)

The page cache holds disk data in RAM. All file I/O passes through it:

  1. read(): check page cache → cache hit returns directly; miss reads from disk
  2. write(): write to page cache, mark page dirty → writeback daemon flushes asynchronously

Writeback: pdflush/kflushd writes dirty pages when:

  • dirty_expire_centisecs exceeded (default 30s)
  • Memory pressure
  • sync(), fsync() called explicitly

IPC and Signals (Chapters 11, 19)

Signals

  • Delivered asynchronously to a process
  • Kernel sets pending bit in task_struct; checked on return from kernel
  • User registers handler via sigaction(); kernel sets up signal frame on user stack
  • Default actions: terminate, core dump, ignore, stop

Pipes and FIFOs

  • Pipe: anonymous, two file descriptors (read/write ends), kernel buffer (64KB default)
  • FIFO (named pipe): filesystem entry, same mechanism, accessible by name

System V IPC

  • Message queues: kernel-buffered messages, typed
  • Semaphores: counting semaphores for process synchronization
  • Shared memory: fastest IPC — map same physical pages into multiple processes

Program Execution (Chapter 20)

ELF Binary Format

ELF header → program headers (PT_LOAD segments) → sections

Key segments:

  • PT_LOAD with R-X: text (code)
  • PT_LOAD with RW-: data + BSS
  • PT_INTERP: dynamic linker path

exec() Flow

  1. execve() syscall
  2. Kernel reads ELF header, finds interpreter (ld.so)
  3. Maps segments into new address space
  4. Transfers control to ELF entry point (dynamic linker bootstraps, then main())