Understanding Linux Network Internals
Deep dive into Linux kernel networking internals: sk_buff socket buffer lifecycle (allocation, pointer manipulation, stack traversal), net_device NIC driver registration, softirq receive/transmit path (NAPI), IPv4 forwarding and fragmentation, ARP/neighboring subsystem state machine, and routing table FIB lookup with route cache.
- › Allocate and manipulate sk_buff socket buffers using skb_reserve/skb_put/skb_push/skb_pull for zero-copy packet construction
- › Register NIC drivers via alloc_etherdev/register_netdev and implement the net_device function pointer interface
- › Implement NAPI poll-based packet receive to reduce interrupt overhead on high-throughput NICs
- › Trace the IPv4 receive path from NIC ISR through ip_rcv → ip_route_input → ip_forward/ip_local_deliver
- › Understand ARP NUD state machine (INCOMPLETE/REACHABLE/STALE/DELAY/PROBE/FAILED) and neighbor resolution
- › Navigate the routing subsystem: FIB lookup, route cache (dst_entry), and procfs tuning knobs
Install this skill and Claude can trace a network packet's complete journey through the Linux kernel from NIC ISR to socket delivery, write correct sk_buff manipulation code for zero-copy packet construction, implement NAPI-compliant receive paths, debug ARP resolution failures via the NUD state machine, and analyze FIB routing decisions
Issues that don't surface at the socket API level — sk_buff ownership bugs, NAPI scheduling errors, ARP state machine deadlocks — require kernel internals knowledge to diagnose; this skill prevents entire classes of kernel panics and memory corruption in network driver and packet processing code
- › Implement a NAPI-compliant receive path for a simulated NIC driver including ISR scheduling, the poll() function with budget enforcement, and netif_rx_complete on drain
- › Identify why a custom protocol driver is corrupting packet data by tracing incorrect use of skb_push vs. skb_put and missing skb_reserve headroom alignment
- › Explain why packets destined for a specific subnet are being forwarded out the wrong interface by tracing FIB lookup priority and dst_entry output function selection
Linux Network Internals Skill
sk_buff: The Kernel Socket Buffer
Every network packet in the Linux kernel is represented by an sk_buff structure. It has two separate memory allocations:
- The
sk_buffheader (metadata) - The data buffer (actual packet bytes)
Buffer lifecycle
// Allocate buffer — use in process context
struct sk_buff *skb = alloc_skb(MAX_HEADER + payload_size, GFP_KERNEL);
// Allocate buffer — use in interrupt context (driver ISR)
struct sk_buff *skb = dev_alloc_skb(length);
// dev_alloc_skb adds 16 bytes headroom and uses GFP_ATOMIC
// Free buffer (decrements users refcount; frees when reaches 0)
kfree_skb(skb); // called by protocol layers
dev_kfree_skb(skb); // alias for use in device drivers
Buffer pointer manipulation (Figure 2-4 idiom)
sk_buff pointers:
head → start of allocated buffer
data → start of current packet data (moves as headers added/removed)
tail → end of current packet data
end → end of allocated buffer (skb_shared_info lives here)
len → data length (tail - data)
// Reserve headroom before writing (must call before any data written)
// Shifts data and tail pointers forward by len bytes
skb_reserve(skb, NET_IP_ALIGN); // typically 2 bytes for IP alignment
// Add data to the TAIL of the buffer (returns pointer to new space)
void *ptr = skb_put(skb, payload_size);
memcpy(ptr, user_data, payload_size);
// Add header to the HEAD (returns pointer to new space)
struct iphdr *iph = (struct iphdr *)skb_push(skb, sizeof(struct iphdr));
// Remove header from the HEAD (move data pointer forward)
skb_pull(skb, sizeof(struct ethhdr));
Stack traversal pattern (TX direction)
TCP layer:
1. alloc_skb with MAX_TCP_HEADER headroom (worst-case for all layers)
2. Copy payload data to tail (skb_put)
3. skb_push → write TCP header
IP layer:
4. skb_push → write IP header
Ethernet driver:
5. skb_push → write Ethernet header
6. DMA to NIC
Key sk_buff fields
struct sk_buff {
struct sk_buff *next, *prev; // sk_buff list
struct net_device *dev; // device that received/will send
unsigned char *head, *data, // buffer pointers
*tail, *end;
unsigned int len; // data length
unsigned char pkt_type; // PACKET_HOST/BROADCAST/MULTICAST/etc.
unsigned short protocol; // L3 protocol (ETH_P_IP, ETH_P_ARP, ...)
unsigned int priority; // QoS class
unsigned char ip_summed; // checksum status
union { struct iphdr *iph; ... } nh; // L3 header pointer
union { struct tcphdr *th; ... } h; // L4 header pointer
unsigned char cb[48]; // protocol control block (layer-specific scratch space)
};
pkt_type values
| Value | Meaning |
|---|---|
PACKET_HOST | Destination is this interface — process it |
PACKET_MULTICAST | Destination is a registered multicast group |
PACKET_BROADCAST | Broadcast to all on this LAN |
PACKET_OTHERHOST | Not for us — forward if routing enabled |
PACKET_OUTGOING | Packet being sent out |
PACKET_LOOPBACK | Sent to loopback device |
net_device: NIC Driver Registration
Allocation and registration
// Allocate net_device with driver private data
// "eth%d" → kernel assigns eth0, eth1, etc.
struct net_device *dev = alloc_netdev(sizeof(struct my_priv), "eth%d", ether_setup);
// ether_setup initializes Ethernet-common fields
// Convenient wrappers:
struct net_device *dev = alloc_etherdev(sizeof(struct my_priv));
// Register with kernel
int err = register_netdev(dev);
// Unregister + free
unregister_netdevice(dev);
free_netdev(dev);
net_device function pointers (driver fills these in probe())
struct net_device {
// Device driver must set (in xxx_probe):
int (*open)(struct net_device *dev); // ifconfig up
int (*stop)(struct net_device *dev); // ifconfig down
int (*hard_start_xmit)(struct sk_buff *skb, // transmit a packet
struct net_device *dev);
void (*tx_timeout)(struct net_device *dev); // called if TX hangs
int watchdog_timeo; // TX timeout interval
// Set by xxx_setup (Ethernet-common):
int (*change_mtu)(...);
void (*set_mac_address)(...);
int (*rebuild_header)(...);
// Filled by register_netdev:
unsigned long state; // device state flags
struct net_device_stats *(*get_stats)(...); // ifconfig stats
};
Driver state flags
// Test/set device state with netif_* API:
netif_carrier_on(dev); // cable connected
netif_carrier_off(dev); // cable disconnected
netif_start_queue(dev); // allow TX
netif_stop_queue(dev); // stop TX (buffer full)
netif_wake_queue(dev); // re-enable TX after stop
netif_running(dev); // test IFF_RUNNING flag
netif_queue_stopped(dev); // test if TX queue is stopped
Interrupt and Softirq Architecture
Top half (ISR) vs. bottom half (softirq)
Hardware IRQ fires
└→ ISR (top half): MINIMAL work — disable IRQ, enqueue skb, schedule softirq
└→ mark_bh(NET_BH) [2.2] or raise_softirq(NET_RX_SOFTIRQ) [2.4+]
└→ net_rx_action softirq handler: process skb queue in bulk
Rule: ISR must be short. Complex protocol processing happens in the softirq.
softirq types for networking (2.4+)
enum {
NET_TX_SOFTIRQ, // triggered by hard_start_xmit failures / TX done
NET_RX_SOFTIRQ, // triggered by netif_rx / NAPI poll
};
// Multiple instances can run concurrently on different CPUs
// But only ONE instance per CPU at a time
netif_rx: old interrupt-driven receive path
// Driver ISR calls this after receiving a frame
netif_rx(skb); // enqueues to backlog, raises NET_RX_SOFTIRQ
// Processed later in net_rx_action softirq
NAPI: New API (poll-based, reduces interrupt overhead)
// For high-throughput NICs: driver registers a poll function
struct net_device {
int (*poll)(struct net_device *dev, int *budget); // NAPI poll
int quota; // max packets to process per poll call
};
// Driver ISR calls instead of netif_rx:
netif_rx_schedule(dev); // marks device for polling; disables RX interrupts
// net_rx_action calls dev->poll() in a loop
// poll() reads N packets from NIC and calls netif_receive_skb() for each
// When no more packets: netif_rx_complete(dev) — re-enable RX interrupts
IPv4 Receive Path
NIC ISR → netif_rx(skb) → NET_RX_SOFTIRQ
└→ net_rx_action → netif_receive_skb
└→ deliver to protocol handler by skb->protocol
└→ ip_rcv (ETH_P_IP) → IP header validation
└→ ip_route_input: route lookup
├→ Forward: ip_forward → ip_output
└→ Local: ip_local_deliver
└→ deliver to TCP/UDP/ICMP by ip_proto field
Key IPv4 forwarding functions
ip_rcv() // entry point from L2; validates IP header
ip_route_input() // route lookup: local delivery or forward?
ip_forward() // decrement TTL, re-route, transmit
ip_local_deliver() // deliver to L4 (TCP/UDP/etc.)
ip_output() // outbound packet: fragment if needed, pass to L2
ARP / Neighboring Subsystem
Neighbor states (NUD — Network Unreachability Detection)
INCOMPLETE → sent ARP request, waiting for reply
REACHABLE → recently confirmed reachable (validity timer running)
STALE → reachable timer expired; still usable but will re-probe
DELAY → in grace period before probing
PROBE → sending ARP requests to confirm reachability
FAILED → ARP failed; packet dropped
NOARP → ARP not needed (e.g., loopback, point-to-point)
PERMANENT → static entry, never expires
ARP flow
1. ip_output needs MAC address for next hop
2. Looks up neighbour entry: neigh_lookup()
3. If STALE/DELAY → send packet but schedule reachability probe
4. If INCOMPLETE → queue packet, send ARP REQUEST
5. On ARP REPLY: update neighbour, flush queued packets
6. Gratuitous ARP: device sends ARP with own IP to update neighbors' caches
User-space ARP management
arp -n # show ARP cache
arp -s 192.168.1.1 aa:bb:cc:dd:ee:ff # static entry
ip neigh show # modern equivalent
ip neigh add 192.168.1.1 lladdr aa:bb:cc:dd:ee:ff dev eth0 nud permanent
Routing Subsystem
Routing lookup flow
ip_route_input(skb, dst, src, tos, dev)
└→ rt_hash_code(dst, src, tos) → check route cache
├→ Cache hit: use cached route (dst_entry)
└→ Cache miss: fib_lookup (FIB = Forwarding Information Base)
└→ fib_rules → select FIB table
└→ fib_table_lookup → find best route (longest prefix)
└→ insert into route cache
Route cache entry (dst_entry / rtable)
struct rtable {
struct dst_entry dst; // nexthop, output function, etc.
struct in_addr rt_src; // source address
struct in_addr rt_dst; // destination address
struct in_device *idev; // output device
};
// dst_entry contains:
struct dst_entry {
int (*output)(struct sk_buff *skb); // either ip_output or ip_forward
struct neighbour *neighbour; // L2 next hop (ARP resolved)
unsigned short pmtu; // path MTU
};
Procfs routing controls
# Enable IP forwarding (router mode)
echo 1 > /proc/sys/net/ipv4/ip_forward
# View routing table
ip route show
route -n
# View FIB cache
cat /proc/net/rt_cache
# ARP tuning
/proc/sys/net/ipv4/neigh/eth0/gc_stale_time # seconds before stale
/proc/sys/net/ipv4/neigh/eth0/base_reachable_time # NUD reachable timer
Kernel Coding Patterns
do_something vs __do_something
// Convention: double-underscore version = no locks, no checks
// single-underscore version = adds locking, sanity checks
// Direct call to __version only when you already hold the lock
kfree_skb(skb); // safe, checks refcount
__kfree_skb(skb); // internal, assumes refcount == 1
dev_queue_xmit(skb); // public transmit with all checks
__dev_queue_xmit(skb); // internal variant
Notification chains (event callbacks)
// Register for device events
register_netdevice_notifier(&my_notifier);
// Notification events:
NETDEV_UP // interface brought up
NETDEV_DOWN // interface going down
NETDEV_REGISTER // new device registered
NETDEV_UNREGISTER // device being removed
NETDEV_CHANGEADDR // MAC address changed
NETDEV_CHANGEMTU // MTU changed
procfs/sysctl for configuration
Pattern: every subsystem exposes tunables via /proc/sys/net/
- /proc/sys/net/ipv4/ — IPv4 tuning
- /proc/sys/net/core/ — core network settings
- /proc/net/ — runtime statistics (read-only)