Understanding the Linux Kernel

Understanding the Linux Kernel, 3rd Edition · Daniel P. Bovet & Marco Cesati ·950 pages

Comprehensive deep-dive into Linux 2.6 kernel internals — memory addressing (paging/segmentation), process management (task_struct, context switch, CoW), scheduling, synchronization primitives, memory management (buddy/slab), VFS, device drivers, page cache, IPC, and ELF execution.

Capabilities (10)

› Explain x86 memory addressing: segmentation → paging → physical address translation
› Trace process creation from fork() through copy-on-write to exec()
› Explain kernel synchronization: spinlocks, semaphores, RCU, atomic ops — and when each applies
› Describe Linux scheduling: priority levels, O(1) scheduler, SMP load balancing
› Explain buddy system and slab allocator for kernel memory management
› Trace page fault handling from exception to page allocation or file read
› Explain system call entry path (int 0x80/syscall) and argument passing
› Describe VFS object hierarchy: superblock, inode, dentry, file and their operation tables
› Explain page cache writeback, dirty page management, and sync vs fsync
› Describe ELF binary format and exec() flow including dynamic linking

How to use

Install this skill and Claude can trace any system call from userspace through the kernel entry point into the relevant subsystem, explain page fault and memory allocation behavior, reason through kernel synchronization deadlocks, describe the VFS call chain from read() to the filesystem, and analyze the full process lifecycle from fork() to exec()

Why it matters

Kernel internals knowledge is the foundation for writing correct device drivers, diagnosing hard-to-reproduce system bugs (OOM kills, deadlocks, page faults), and understanding the true cost of operations that appear cheap at the syscall boundary

Example use cases

› Identify why a kernel module locks up under concurrent access by tracing which locks are held, in what order, and whether any code path sleeps while holding a spinlock
› Explain why a process is being killed by the OOM killer on a system with available swap by tracing buddy allocator state, zone watermarks, and GFP flags
› Trace what happens in the kernel when mmap(MAP_ANONYMOUS) is called, including VMA creation, page fault deferral, and when physical pages are actually allocated

Understanding the Linux Kernel Skill

Memory Addressing (Chapter 2)

Three-Level Address Translation (x86)

Logical address → [Segmentation] → Linear address → [Paging] → Physical address

Segmentation in Linux: mostly disabled (flat memory model). CS, DS, SS all map to same base 0, limit 4GB. Only segment descriptor privilege levels (CPL/DPL) are actively used for kernel/user separation.

Paging (2-level on 32-bit):

Linear address: [PGD index | PTE index | Offset]
Page Global Directory (PGD) → Page Table Entry (PTE) → Physical frame

TLB: Translation Lookaside Buffer caches recent translations. Context switch flushes TLB (or uses PCID on modern x86 to tag entries per process).

High memory (32-bit): memory above 896MB can’t be permanently mapped. Kernel accesses it via temporary kmap() mappings.

Processes (Chapter 3)

`task_struct` (Process Descriptor)

Key fields:

state: TASK_RUNNING, TASK_INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, TASK_ZOMBIE, TASK_STOPPED
pid / tgid: process ID / thread group ID
mm: pointer to mm_struct (address space), NULL for kernel threads
files: open file descriptors table
fs: filesystem context (cwd, root)
signal: signal handlers
sched_class / policy: scheduler class (CFS, RT)

Process Creation: `fork()` / `clone()`

do_fork() allocates new task_struct (from slab cache)
Copy-on-write: both parent and child get read-only page mappings; page fault triggers actual copy
clone() flags control what is shared: CLONE_VM, CLONE_FS, CLONE_FILES, CLONE_THREAD
Threads = processes sharing same mm_struct, files, etc.

Process Switch (Context Switch)

Save hardware context (registers) to thread_struct in task_struct
Switch page table root (CR3 register on x86) — changes address space
Load new process’s hardware context
Executed via schedule() → context_switch() → switch_to()

Interrupts and Exceptions (Chapter 4)

Interrupt Descriptor Table (IDT)

256 entries. First 32: CPU exceptions (divide by zero, page fault, etc.). 32-255: hardware IRQs + software interrupts.

Interrupt Handling Flow

CPU saves registers, switches to kernel stack (if in user mode)
Calls interrupt handler registered in IDT
Handler runs with local IRQs disabled
Returns via iret

Bottom Halves: Deferring Work

Softirqs: precompiled, run with interrupts enabled, cannot sleep. Used by network subsystem, block layer.
Tasklets: built on softirqs, serialized per-tasklet, easier to use. For most drivers.
Work queues: deferred work in kernel thread context — can sleep. For slow driver operations.

// Schedule work for later execution in softirq context
tasklet_schedule(&my_tasklet);

// Schedule work that may sleep
queue_work(my_workqueue, &my_work);

Kernel Synchronization (Chapter 5)

Synchronization Primitives

Primitive	Sleeps?	Use case
Atomic ops (`atomic_t`)	No	Counters, flags
Spin lock	No (busy wait)	Short critical sections, interrupt handlers
Read-write spinlock	No	Reader-heavy data
Semaphore	Yes	Longer critical sections
Mutex	Yes	Simpler than semaphore (single-lock only)
RCU	No (readers)	Read-mostly data, list traversal
Seqlock	No	Fast readers, rare writes

Spin Lock Rules

Never hold a spin lock while sleeping (deadlock with the scheduler)
Use spin_lock_irqsave() in interrupt handlers (saves interrupt state)
Spin locks are recursive-unsafe — same CPU holding same lock = deadlock

RCU (Read-Copy-Update)

// Reader (no lock needed):
rcu_read_lock();
p = rcu_dereference(head);
// use p safely
rcu_read_unlock();

// Writer:
new_node = kmalloc(...);
rcu_assign_pointer(head, new_node);
synchronize_rcu();  // wait for all readers to finish
kfree(old_node);

Process Scheduling (Chapter 7)

Scheduling Classes (Linux 2.6)

SCHED_NORMAL (CFS): interactive and batch processes, most processes
SCHED_FIFO: real-time, run until preempted or blocked
SCHED_RR: real-time with round-robin time quantum

O(1) Scheduler (2.6 era)

140 priority levels (0-99 RT, 100-139 nice-adjusted)
Two runqueues per CPU: active + expired
Bitmap of non-empty priority queues → O(1) next task selection
Interactive processes get dynamic priority boost; CPU-hogs get penalties

Load Balancing (SMP)

Per-CPU runqueues; scheduler balances load across CPUs
load_balance() called periodically or when CPU goes idle
Domain hierarchy: hyper-thread siblings → physical cores → NUMA nodes

Memory Management (Chapter 8)

Buddy System (Page Frame Allocator)

Manages contiguous physical page frames in powers of 2:

// Allocate 2^order contiguous pages
struct page *page = alloc_pages(GFP_KERNEL, order);
// Free:
__free_pages(page, order);

Slab Allocator (Object Cache)

For frequently allocated kernel objects (task_struct, inode, dentry):

// Create cache for objects of size sizeof(my_struct):
struct kmem_cache *cache = kmem_cache_create("name", sizeof(my_struct), 0, 0, NULL);
// Allocate:
my_struct *p = kmem_cache_alloc(cache, GFP_KERNEL);
// Free:
kmem_cache_free(cache, p);

GFP Flags

GFP_KERNEL: can sleep, normal kernel allocation
GFP_ATOMIC: cannot sleep (interrupt context)
GFP_DMA: allocate from DMA-accessible memory
__GFP_HIGHMEM: can use high memory

vmalloc vs kmalloc

kmalloc: physically contiguous, max 128KB typically, fastest
vmalloc: virtually contiguous only, larger allocations, slower (TLB misses)

Process Address Space (Chapter 9)

Key Structures

mm_struct: per-process memory descriptor (page table root, list of VMAs, etc.)
vm_area_struct (VMA): describes one contiguous region (code, stack, heap, mmap’d file)

Page Fault Handler

CPU raises page fault exception (interrupt 14)
Kernel checks if address is in a valid VMA
If valid VMA + anonymous: allocate new page, zero-fill, map
If valid VMA + file-backed: read from page cache or file
If invalid: SIGSEGV to process

Memory Maps

// In user space, mmap() creates a new VMA:
// Anonymous: private memory
mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// File-backed: map file into address space
mmap(NULL, size, PROT_READ, MAP_SHARED, fd, offset);

System Calls (Chapter 10)

System Call Entry (32-bit)

User code: int 0x80 or sysenter (faster)
CPU switches to ring 0, uses kernel stack
System call number in eax; arguments in ebx, ecx, edx, esi, edi, ebp
Kernel dispatches via sys_call_table[eax]
Returns to user space via iret or sysexit

64-bit (x86-64)

Uses syscall instruction
Arguments in rdi, rsi, rdx, r10, r8, r9
Return value in rax

Virtual Filesystem (VFS) (Chapter 12)

VFS Object Hierarchy

Superblock → describes mounted filesystem
    └── Inode → represents a file/dir on disk
         └── Dentry → represents a path component (cached)
              └── File → open file instance (per open() call)

VFS Operations

Each object has an operations pointer with function pointers:

struct inode_operations {
    int (*create)(struct inode*, struct dentry*, ...);
    struct dentry* (*lookup)(struct inode*, struct dentry*, ...);
    int (*link)(...); int (*unlink)(...);
    // ...
};

struct file_operations {
    ssize_t (*read)(struct file*, char __user*, size_t, loff_t*);
    ssize_t (*write)(struct file*, const char __user*, size_t, loff_t*);
    int (*mmap)(struct file*, struct vm_area_struct*);
    // ...
};

Dentry Cache

Kernel caches recently accessed path components in the dentry cache. pathname_lookup() traverses path component by component, checking dcache first.

Device Drivers (Chapter 13)

Character vs Block Devices

Character device: byte stream (serial ports, terminals, sensors). No seek.
Block device: fixed-size blocks, random access (disk). Has I/O scheduler.

Device Driver Registration

// Register char device
static struct file_operations my_fops = {
    .open = my_open,
    .read = my_read,
    .write = my_write,
    .release = my_release,
};
register_chrdev(major, "mydev", &my_fops);

I/O Scheduler

Merges and reorders block I/O requests to minimize disk head movement:

CFQ (Completely Fair Queuing): default, fair sharing among processes
Deadline: prevents starvation with hard deadlines
Noop: no reordering (for SSDs)

Page Cache (Chapter 15)

The page cache holds disk data in RAM. All file I/O passes through it:

read(): check page cache → cache hit returns directly; miss reads from disk
write(): write to page cache, mark page dirty → writeback daemon flushes asynchronously

Writeback: pdflush/kflushd writes dirty pages when:

dirty_expire_centisecs exceeded (default 30s)
Memory pressure
sync(), fsync() called explicitly

IPC and Signals (Chapters 11, 19)

Signals

Delivered asynchronously to a process
Kernel sets pending bit in task_struct; checked on return from kernel
User registers handler via sigaction(); kernel sets up signal frame on user stack
Default actions: terminate, core dump, ignore, stop

Pipes and FIFOs

Pipe: anonymous, two file descriptors (read/write ends), kernel buffer (64KB default)
FIFO (named pipe): filesystem entry, same mechanism, accessible by name

System V IPC

Message queues: kernel-buffered messages, typed
Semaphores: counting semaphores for process synchronization
Shared memory: fastest IPC — map same physical pages into multiple processes

Program Execution (Chapter 20)

ELF Binary Format

ELF header → program headers (PT_LOAD segments) → sections

Key segments:

PT_LOAD with R-X: text (code)
PT_LOAD with RW-: data + BSS
PT_INTERP: dynamic linker path

exec() Flow

execve() syscall
Kernel reads ELF header, finds interpreter (ld.so)
Maps segments into new address space
Transfers control to ELF entry point (dynamic linker bootstraps, then main())