Understanding Linux Network Internals

Understanding Linux Network Internals · Christian Benvenuti ·1280 pages

Deep dive into Linux kernel networking internals: sk_buff socket buffer lifecycle (allocation, pointer manipulation, stack traversal), net_device NIC driver registration, softirq receive/transmit path (NAPI), IPv4 forwarding and fragmentation, ARP/neighboring subsystem state machine, and routing table FIB lookup with route cache.

Capabilities (6)
  • Allocate and manipulate sk_buff socket buffers using skb_reserve/skb_put/skb_push/skb_pull for zero-copy packet construction
  • Register NIC drivers via alloc_etherdev/register_netdev and implement the net_device function pointer interface
  • Implement NAPI poll-based packet receive to reduce interrupt overhead on high-throughput NICs
  • Trace the IPv4 receive path from NIC ISR through ip_rcv → ip_route_input → ip_forward/ip_local_deliver
  • Understand ARP NUD state machine (INCOMPLETE/REACHABLE/STALE/DELAY/PROBE/FAILED) and neighbor resolution
  • Navigate the routing subsystem: FIB lookup, route cache (dst_entry), and procfs tuning knobs
How to use

Install this skill and Claude can trace a network packet's complete journey through the Linux kernel from NIC ISR to socket delivery, write correct sk_buff manipulation code for zero-copy packet construction, implement NAPI-compliant receive paths, debug ARP resolution failures via the NUD state machine, and analyze FIB routing decisions

Why it matters

Issues that don't surface at the socket API level — sk_buff ownership bugs, NAPI scheduling errors, ARP state machine deadlocks — require kernel internals knowledge to diagnose; this skill prevents entire classes of kernel panics and memory corruption in network driver and packet processing code

Example use cases
  • Implement a NAPI-compliant receive path for a simulated NIC driver including ISR scheduling, the poll() function with budget enforcement, and netif_rx_complete on drain
  • Identify why a custom protocol driver is corrupting packet data by tracing incorrect use of skb_push vs. skb_put and missing skb_reserve headroom alignment
  • Explain why packets destined for a specific subnet are being forwarded out the wrong interface by tracing FIB lookup priority and dst_entry output function selection

Linux Network Internals Skill

sk_buff: The Kernel Socket Buffer

Every network packet in the Linux kernel is represented by an sk_buff structure. It has two separate memory allocations:

  1. The sk_buff header (metadata)
  2. The data buffer (actual packet bytes)

Buffer lifecycle

// Allocate buffer — use in process context
struct sk_buff *skb = alloc_skb(MAX_HEADER + payload_size, GFP_KERNEL);

// Allocate buffer — use in interrupt context (driver ISR)
struct sk_buff *skb = dev_alloc_skb(length);
// dev_alloc_skb adds 16 bytes headroom and uses GFP_ATOMIC

// Free buffer (decrements users refcount; frees when reaches 0)
kfree_skb(skb);        // called by protocol layers
dev_kfree_skb(skb);    // alias for use in device drivers

Buffer pointer manipulation (Figure 2-4 idiom)

sk_buff pointers:
  head    → start of allocated buffer
  data    → start of current packet data (moves as headers added/removed)
  tail    → end of current packet data
  end     → end of allocated buffer (skb_shared_info lives here)
  len     → data length (tail - data)
// Reserve headroom before writing (must call before any data written)
// Shifts data and tail pointers forward by len bytes
skb_reserve(skb, NET_IP_ALIGN);  // typically 2 bytes for IP alignment

// Add data to the TAIL of the buffer (returns pointer to new space)
void *ptr = skb_put(skb, payload_size);
memcpy(ptr, user_data, payload_size);

// Add header to the HEAD (returns pointer to new space)
struct iphdr *iph = (struct iphdr *)skb_push(skb, sizeof(struct iphdr));

// Remove header from the HEAD (move data pointer forward)
skb_pull(skb, sizeof(struct ethhdr));

Stack traversal pattern (TX direction)

TCP layer:
  1. alloc_skb with MAX_TCP_HEADER headroom (worst-case for all layers)
  2. Copy payload data to tail (skb_put)
  3. skb_push → write TCP header
IP layer:
  4. skb_push → write IP header
Ethernet driver:
  5. skb_push → write Ethernet header
  6. DMA to NIC

Key sk_buff fields

struct sk_buff {
    struct sk_buff *next, *prev;   // sk_buff list
    struct net_device *dev;        // device that received/will send
    unsigned char *head, *data,    // buffer pointers
                  *tail, *end;
    unsigned int   len;            // data length
    unsigned char  pkt_type;       // PACKET_HOST/BROADCAST/MULTICAST/etc.
    unsigned short protocol;       // L3 protocol (ETH_P_IP, ETH_P_ARP, ...)
    unsigned int   priority;       // QoS class
    unsigned char  ip_summed;      // checksum status
    union { struct iphdr *iph; ... } nh;  // L3 header pointer
    union { struct tcphdr *th; ... } h;   // L4 header pointer
    unsigned char  cb[48];         // protocol control block (layer-specific scratch space)
};

pkt_type values

ValueMeaning
PACKET_HOSTDestination is this interface — process it
PACKET_MULTICASTDestination is a registered multicast group
PACKET_BROADCASTBroadcast to all on this LAN
PACKET_OTHERHOSTNot for us — forward if routing enabled
PACKET_OUTGOINGPacket being sent out
PACKET_LOOPBACKSent to loopback device

net_device: NIC Driver Registration

Allocation and registration

// Allocate net_device with driver private data
// "eth%d" → kernel assigns eth0, eth1, etc.
struct net_device *dev = alloc_netdev(sizeof(struct my_priv), "eth%d", ether_setup);
// ether_setup initializes Ethernet-common fields

// Convenient wrappers:
struct net_device *dev = alloc_etherdev(sizeof(struct my_priv));

// Register with kernel
int err = register_netdev(dev);

// Unregister + free
unregister_netdevice(dev);
free_netdev(dev);

net_device function pointers (driver fills these in probe())

struct net_device {
    // Device driver must set (in xxx_probe):
    int   (*open)(struct net_device *dev);        // ifconfig up
    int   (*stop)(struct net_device *dev);        // ifconfig down
    int   (*hard_start_xmit)(struct sk_buff *skb, // transmit a packet
                             struct net_device *dev);
    void  (*tx_timeout)(struct net_device *dev);  // called if TX hangs
    int   watchdog_timeo;                         // TX timeout interval

    // Set by xxx_setup (Ethernet-common):
    int   (*change_mtu)(...);
    void  (*set_mac_address)(...);
    int   (*rebuild_header)(...);

    // Filled by register_netdev:
    unsigned long state;                          // device state flags
    struct net_device_stats *(*get_stats)(...);   // ifconfig stats
};

Driver state flags

// Test/set device state with netif_* API:
netif_carrier_on(dev);     // cable connected
netif_carrier_off(dev);    // cable disconnected
netif_start_queue(dev);    // allow TX
netif_stop_queue(dev);     // stop TX (buffer full)
netif_wake_queue(dev);     // re-enable TX after stop
netif_running(dev);        // test IFF_RUNNING flag
netif_queue_stopped(dev);  // test if TX queue is stopped

Interrupt and Softirq Architecture

Top half (ISR) vs. bottom half (softirq)

Hardware IRQ fires
  └→ ISR (top half): MINIMAL work — disable IRQ, enqueue skb, schedule softirq
         └→ mark_bh(NET_BH)  [2.2] or raise_softirq(NET_RX_SOFTIRQ)  [2.4+]
             └→ net_rx_action softirq handler: process skb queue in bulk

Rule: ISR must be short. Complex protocol processing happens in the softirq.

softirq types for networking (2.4+)

enum {
    NET_TX_SOFTIRQ,   // triggered by hard_start_xmit failures / TX done
    NET_RX_SOFTIRQ,   // triggered by netif_rx / NAPI poll
};
// Multiple instances can run concurrently on different CPUs
// But only ONE instance per CPU at a time

netif_rx: old interrupt-driven receive path

// Driver ISR calls this after receiving a frame
netif_rx(skb);   // enqueues to backlog, raises NET_RX_SOFTIRQ
// Processed later in net_rx_action softirq

NAPI: New API (poll-based, reduces interrupt overhead)

// For high-throughput NICs: driver registers a poll function
struct net_device {
    int   (*poll)(struct net_device *dev, int *budget);  // NAPI poll
    int   quota;     // max packets to process per poll call
};

// Driver ISR calls instead of netif_rx:
netif_rx_schedule(dev);  // marks device for polling; disables RX interrupts

// net_rx_action calls dev->poll() in a loop
// poll() reads N packets from NIC and calls netif_receive_skb() for each
// When no more packets: netif_rx_complete(dev) — re-enable RX interrupts

IPv4 Receive Path

NIC ISR → netif_rx(skb) → NET_RX_SOFTIRQ
  └→ net_rx_action → netif_receive_skb
       └→ deliver to protocol handler by skb->protocol
           └→ ip_rcv (ETH_P_IP) → IP header validation
                └→ ip_route_input: route lookup
                     ├→ Forward: ip_forward → ip_output
                     └→ Local: ip_local_deliver
                          └→ deliver to TCP/UDP/ICMP by ip_proto field

Key IPv4 forwarding functions

ip_rcv()              // entry point from L2; validates IP header
ip_route_input()      // route lookup: local delivery or forward?
ip_forward()          // decrement TTL, re-route, transmit
ip_local_deliver()    // deliver to L4 (TCP/UDP/etc.)
ip_output()           // outbound packet: fragment if needed, pass to L2

ARP / Neighboring Subsystem

Neighbor states (NUD — Network Unreachability Detection)

INCOMPLETE → sent ARP request, waiting for reply
REACHABLE  → recently confirmed reachable (validity timer running)
STALE      → reachable timer expired; still usable but will re-probe
DELAY      → in grace period before probing
PROBE      → sending ARP requests to confirm reachability
FAILED     → ARP failed; packet dropped
NOARP      → ARP not needed (e.g., loopback, point-to-point)
PERMANENT  → static entry, never expires

ARP flow

1. ip_output needs MAC address for next hop
2. Looks up neighbour entry: neigh_lookup()
3. If STALE/DELAY → send packet but schedule reachability probe
4. If INCOMPLETE → queue packet, send ARP REQUEST
5. On ARP REPLY: update neighbour, flush queued packets
6. Gratuitous ARP: device sends ARP with own IP to update neighbors' caches

User-space ARP management

arp -n                     # show ARP cache
arp -s 192.168.1.1 aa:bb:cc:dd:ee:ff  # static entry
ip neigh show              # modern equivalent
ip neigh add 192.168.1.1 lladdr aa:bb:cc:dd:ee:ff dev eth0 nud permanent

Routing Subsystem

Routing lookup flow

ip_route_input(skb, dst, src, tos, dev)
  └→ rt_hash_code(dst, src, tos) → check route cache
       ├→ Cache hit: use cached route (dst_entry)
       └→ Cache miss: fib_lookup (FIB = Forwarding Information Base)
              └→ fib_rules → select FIB table
                   └→ fib_table_lookup → find best route (longest prefix)
                        └→ insert into route cache

Route cache entry (dst_entry / rtable)

struct rtable {
    struct dst_entry dst;        // nexthop, output function, etc.
    struct in_addr rt_src;       // source address
    struct in_addr rt_dst;       // destination address
    struct in_device *idev;      // output device
};

// dst_entry contains:
struct dst_entry {
    int    (*output)(struct sk_buff *skb);  // either ip_output or ip_forward
    struct neighbour *neighbour;             // L2 next hop (ARP resolved)
    unsigned short   pmtu;                  // path MTU
};

Procfs routing controls

# Enable IP forwarding (router mode)
echo 1 > /proc/sys/net/ipv4/ip_forward

# View routing table
ip route show
route -n

# View FIB cache
cat /proc/net/rt_cache

# ARP tuning
/proc/sys/net/ipv4/neigh/eth0/gc_stale_time   # seconds before stale
/proc/sys/net/ipv4/neigh/eth0/base_reachable_time  # NUD reachable timer

Kernel Coding Patterns

do_something vs __do_something

// Convention: double-underscore version = no locks, no checks
//             single-underscore version = adds locking, sanity checks
// Direct call to __version only when you already hold the lock
kfree_skb(skb);          // safe, checks refcount
__kfree_skb(skb);        // internal, assumes refcount == 1

dev_queue_xmit(skb);     // public transmit with all checks
__dev_queue_xmit(skb);   // internal variant

Notification chains (event callbacks)

// Register for device events
register_netdevice_notifier(&my_notifier);

// Notification events:
NETDEV_UP         // interface brought up
NETDEV_DOWN       // interface going down
NETDEV_REGISTER   // new device registered
NETDEV_UNREGISTER // device being removed
NETDEV_CHANGEADDR // MAC address changed
NETDEV_CHANGEMTU  // MTU changed

procfs/sysctl for configuration

Pattern: every subsystem exposes tunables via /proc/sys/net/
- /proc/sys/net/ipv4/    — IPv4 tuning
- /proc/sys/net/core/    — core network settings
- /proc/net/             — runtime statistics (read-only)