Understanding the Linux Kernel
Comprehensive deep-dive into Linux 2.6 kernel internals — memory addressing (paging/segmentation), process management (task_struct, context switch, CoW), scheduling, synchronization primitives, memory management (buddy/slab), VFS, device drivers, page cache, IPC, and ELF execution.
- › Explain x86 memory addressing: segmentation → paging → physical address translation
- › Trace process creation from fork() through copy-on-write to exec()
- › Explain kernel synchronization: spinlocks, semaphores, RCU, atomic ops — and when each applies
- › Describe Linux scheduling: priority levels, O(1) scheduler, SMP load balancing
- › Explain buddy system and slab allocator for kernel memory management
- › Trace page fault handling from exception to page allocation or file read
- › Explain system call entry path (int 0x80/syscall) and argument passing
- › Describe VFS object hierarchy: superblock, inode, dentry, file and their operation tables
- › Explain page cache writeback, dirty page management, and sync vs fsync
- › Describe ELF binary format and exec() flow including dynamic linking
Install this skill and Claude can trace any system call from userspace through the kernel entry point into the relevant subsystem, explain page fault and memory allocation behavior, reason through kernel synchronization deadlocks, describe the VFS call chain from read() to the filesystem, and analyze the full process lifecycle from fork() to exec()
Kernel internals knowledge is the foundation for writing correct device drivers, diagnosing hard-to-reproduce system bugs (OOM kills, deadlocks, page faults), and understanding the true cost of operations that appear cheap at the syscall boundary
- › Identify why a kernel module locks up under concurrent access by tracing which locks are held, in what order, and whether any code path sleeps while holding a spinlock
- › Explain why a process is being killed by the OOM killer on a system with available swap by tracing buddy allocator state, zone watermarks, and GFP flags
- › Trace what happens in the kernel when mmap(MAP_ANONYMOUS) is called, including VMA creation, page fault deferral, and when physical pages are actually allocated
Understanding the Linux Kernel Skill
Memory Addressing (Chapter 2)
Three-Level Address Translation (x86)
Logical address → [Segmentation] → Linear address → [Paging] → Physical address
Segmentation in Linux: mostly disabled (flat memory model). CS, DS, SS all map to same base 0, limit 4GB. Only segment descriptor privilege levels (CPL/DPL) are actively used for kernel/user separation.
Paging (2-level on 32-bit):
Linear address: [PGD index | PTE index | Offset]
Page Global Directory (PGD) → Page Table Entry (PTE) → Physical frame
TLB: Translation Lookaside Buffer caches recent translations. Context switch flushes TLB (or uses PCID on modern x86 to tag entries per process).
High memory (32-bit): memory above 896MB can’t be permanently mapped. Kernel accesses it via temporary kmap() mappings.
Processes (Chapter 3)
task_struct (Process Descriptor)
Key fields:
state: TASK_RUNNING, TASK_INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, TASK_ZOMBIE, TASK_STOPPEDpid/tgid: process ID / thread group IDmm: pointer tomm_struct(address space), NULL for kernel threadsfiles: open file descriptors tablefs: filesystem context (cwd, root)signal: signal handlerssched_class/policy: scheduler class (CFS, RT)
Process Creation: fork() / clone()
do_fork()allocates newtask_struct(from slab cache)- Copy-on-write: both parent and child get read-only page mappings; page fault triggers actual copy
clone()flags control what is shared:CLONE_VM,CLONE_FS,CLONE_FILES,CLONE_THREAD- Threads = processes sharing same
mm_struct,files, etc.
Process Switch (Context Switch)
- Save hardware context (registers) to
thread_structintask_struct - Switch page table root (
CR3register on x86) — changes address space - Load new process’s hardware context
- Executed via
schedule()→context_switch()→switch_to()
Interrupts and Exceptions (Chapter 4)
Interrupt Descriptor Table (IDT)
256 entries. First 32: CPU exceptions (divide by zero, page fault, etc.). 32-255: hardware IRQs + software interrupts.
Interrupt Handling Flow
- CPU saves registers, switches to kernel stack (if in user mode)
- Calls interrupt handler registered in IDT
- Handler runs with local IRQs disabled
- Returns via
iret
Bottom Halves: Deferring Work
- Softirqs: precompiled, run with interrupts enabled, cannot sleep. Used by network subsystem, block layer.
- Tasklets: built on softirqs, serialized per-tasklet, easier to use. For most drivers.
- Work queues: deferred work in kernel thread context — can sleep. For slow driver operations.
// Schedule work for later execution in softirq context
tasklet_schedule(&my_tasklet);
// Schedule work that may sleep
queue_work(my_workqueue, &my_work);
Kernel Synchronization (Chapter 5)
Synchronization Primitives
| Primitive | Sleeps? | Use case |
|---|---|---|
Atomic ops (atomic_t) | No | Counters, flags |
| Spin lock | No (busy wait) | Short critical sections, interrupt handlers |
| Read-write spinlock | No | Reader-heavy data |
| Semaphore | Yes | Longer critical sections |
| Mutex | Yes | Simpler than semaphore (single-lock only) |
| RCU | No (readers) | Read-mostly data, list traversal |
| Seqlock | No | Fast readers, rare writes |
Spin Lock Rules
- Never hold a spin lock while sleeping (deadlock with the scheduler)
- Use
spin_lock_irqsave()in interrupt handlers (saves interrupt state) - Spin locks are recursive-unsafe — same CPU holding same lock = deadlock
RCU (Read-Copy-Update)
// Reader (no lock needed):
rcu_read_lock();
p = rcu_dereference(head);
// use p safely
rcu_read_unlock();
// Writer:
new_node = kmalloc(...);
rcu_assign_pointer(head, new_node);
synchronize_rcu(); // wait for all readers to finish
kfree(old_node);
Process Scheduling (Chapter 7)
Scheduling Classes (Linux 2.6)
- SCHED_NORMAL (CFS): interactive and batch processes, most processes
- SCHED_FIFO: real-time, run until preempted or blocked
- SCHED_RR: real-time with round-robin time quantum
O(1) Scheduler (2.6 era)
- 140 priority levels (0-99 RT, 100-139 nice-adjusted)
- Two runqueues per CPU: active + expired
- Bitmap of non-empty priority queues → O(1) next task selection
- Interactive processes get dynamic priority boost; CPU-hogs get penalties
Load Balancing (SMP)
- Per-CPU runqueues; scheduler balances load across CPUs
load_balance()called periodically or when CPU goes idle- Domain hierarchy: hyper-thread siblings → physical cores → NUMA nodes
Memory Management (Chapter 8)
Buddy System (Page Frame Allocator)
Manages contiguous physical page frames in powers of 2:
// Allocate 2^order contiguous pages
struct page *page = alloc_pages(GFP_KERNEL, order);
// Free:
__free_pages(page, order);
Slab Allocator (Object Cache)
For frequently allocated kernel objects (task_struct, inode, dentry):
// Create cache for objects of size sizeof(my_struct):
struct kmem_cache *cache = kmem_cache_create("name", sizeof(my_struct), 0, 0, NULL);
// Allocate:
my_struct *p = kmem_cache_alloc(cache, GFP_KERNEL);
// Free:
kmem_cache_free(cache, p);
GFP Flags
GFP_KERNEL: can sleep, normal kernel allocationGFP_ATOMIC: cannot sleep (interrupt context)GFP_DMA: allocate from DMA-accessible memory__GFP_HIGHMEM: can use high memory
vmalloc vs kmalloc
kmalloc: physically contiguous, max 128KB typically, fastestvmalloc: virtually contiguous only, larger allocations, slower (TLB misses)
Process Address Space (Chapter 9)
Key Structures
mm_struct: per-process memory descriptor (page table root, list of VMAs, etc.)vm_area_struct(VMA): describes one contiguous region (code, stack, heap, mmap’d file)
Page Fault Handler
- CPU raises page fault exception (interrupt 14)
- Kernel checks if address is in a valid VMA
- If valid VMA + anonymous: allocate new page, zero-fill, map
- If valid VMA + file-backed: read from page cache or file
- If invalid: SIGSEGV to process
Memory Maps
// In user space, mmap() creates a new VMA:
// Anonymous: private memory
mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// File-backed: map file into address space
mmap(NULL, size, PROT_READ, MAP_SHARED, fd, offset);
System Calls (Chapter 10)
System Call Entry (32-bit)
- User code:
int 0x80orsysenter(faster) - CPU switches to ring 0, uses kernel stack
- System call number in
eax; arguments inebx,ecx,edx,esi,edi,ebp - Kernel dispatches via
sys_call_table[eax] - Returns to user space via
iretorsysexit
64-bit (x86-64)
- Uses
syscallinstruction - Arguments in
rdi,rsi,rdx,r10,r8,r9 - Return value in
rax
Virtual Filesystem (VFS) (Chapter 12)
VFS Object Hierarchy
Superblock → describes mounted filesystem
└── Inode → represents a file/dir on disk
└── Dentry → represents a path component (cached)
└── File → open file instance (per open() call)
VFS Operations
Each object has an operations pointer with function pointers:
struct inode_operations {
int (*create)(struct inode*, struct dentry*, ...);
struct dentry* (*lookup)(struct inode*, struct dentry*, ...);
int (*link)(...); int (*unlink)(...);
// ...
};
struct file_operations {
ssize_t (*read)(struct file*, char __user*, size_t, loff_t*);
ssize_t (*write)(struct file*, const char __user*, size_t, loff_t*);
int (*mmap)(struct file*, struct vm_area_struct*);
// ...
};
Dentry Cache
Kernel caches recently accessed path components in the dentry cache. pathname_lookup() traverses path component by component, checking dcache first.
Device Drivers (Chapter 13)
Character vs Block Devices
- Character device: byte stream (serial ports, terminals, sensors). No seek.
- Block device: fixed-size blocks, random access (disk). Has I/O scheduler.
Device Driver Registration
// Register char device
static struct file_operations my_fops = {
.open = my_open,
.read = my_read,
.write = my_write,
.release = my_release,
};
register_chrdev(major, "mydev", &my_fops);
I/O Scheduler
Merges and reorders block I/O requests to minimize disk head movement:
- CFQ (Completely Fair Queuing): default, fair sharing among processes
- Deadline: prevents starvation with hard deadlines
- Noop: no reordering (for SSDs)
Page Cache (Chapter 15)
The page cache holds disk data in RAM. All file I/O passes through it:
read(): check page cache → cache hit returns directly; miss reads from diskwrite(): write to page cache, mark page dirty → writeback daemon flushes asynchronously
Writeback: pdflush/kflushd writes dirty pages when:
dirty_expire_centisecsexceeded (default 30s)- Memory pressure
sync(),fsync()called explicitly
IPC and Signals (Chapters 11, 19)
Signals
- Delivered asynchronously to a process
- Kernel sets pending bit in
task_struct; checked on return from kernel - User registers handler via
sigaction(); kernel sets up signal frame on user stack - Default actions: terminate, core dump, ignore, stop
Pipes and FIFOs
- Pipe: anonymous, two file descriptors (read/write ends), kernel buffer (64KB default)
- FIFO (named pipe): filesystem entry, same mechanism, accessible by name
System V IPC
- Message queues: kernel-buffered messages, typed
- Semaphores: counting semaphores for process synchronization
- Shared memory: fastest IPC — map same physical pages into multiple processes
Program Execution (Chapter 20)
ELF Binary Format
ELF header → program headers (PT_LOAD segments) → sections
Key segments:
PT_LOADwithR-X: text (code)PT_LOADwithRW-: data + BSSPT_INTERP: dynamic linker path
exec() Flow
execve()syscall- Kernel reads ELF header, finds interpreter (
ld.so) - Maps segments into new address space
- Transfers control to ELF entry point (dynamic linker bootstraps, then
main())