Understanding the Cilium Datapath: How Packets Actually Move Between Pods

I traced every packet path through Cilium's datapath: veth pairs, eBPF hooks, native routing with auto injection, static routes, live BGP with FRR and ECMP, VXLAN encapsulation, and Geneve. Four kind clusters, real tcpdump captures, all from my lab.
Understanding the Cilium Datapath: How Packets Actually Move Between Pods

On this page

I wanted to understand exactly what happens to a packet after it leaves a pod. Not the high-level "Cilium uses eBPF" explanation you find in every overview talk. The actual path. Which interfaces does the packet touch? What eBPF programs fire? When does encapsulation happen, and when doesn't it? How do routes get populated, and what breaks when they don't?

So I built it. Four different kind clusters, four different routing configurations, packet captures on every one of them. This post covers intranode forwarding through veth pairs, internode native routing with three different route propagation methods (auto injection, static routes, and live BGP with FRR), VXLAN tunnel encapsulation with full packet dissection, and Geneve as an alternative overlay. Everything runs on Docker Desktop with kind on Apple Silicon. Every command in this post produced real output from my lab.

Click any command below to see the real output from my lab run.

The Linux Primitives That Make This Work

Before tracing any packets, you need to understand two kernel constructs that Cilium builds on top of.

Network namespaces give each pod an isolated copy of the networking stack: its own interfaces, routing table, and firewall rules. The host node also has a network namespace shared by system processes, including the Cilium agent.

Veth pairs are virtual Ethernet devices that come in linked pairs. Whatever enters one end comes out the other. Cilium uses them to connect each pod's namespace to the host namespace. Inside the pod, you see eth0. On the host, you see the paired lxc* interface. Cilium attaches eBPF programs to the host side of every veth pair, so packets are inspected and forwarded the moment they cross the namespace boundary.

Intranode: Two Pods on the Same Node

I started with the simplest case. Two pods, same node, one ping. The cluster runs Cilium with VXLAN tunnel mode (the default), but intranode traffic never touches the tunnel. It stays entirely within the host namespace.

Phase 2: Intranode Connectivity
Verify pods land on the same node
$kubectl get pods -o wideexpand
NAME    READY   STATUS    AGE   IP             NODE
pod-a   1/1     Running   31s   10.244.1.98    kind-worker
pod-b   1/1     Running   31s   10.244.1.200   kind-worker
$kubectl exec pod-a -- ping -c 3 10.244.1.200expand
64 bytes from 10.244.1.200: icmp_seq=1 ttl=63 time=0.085 ms
64 bytes from 10.244.1.200: icmp_seq=2 ttl=63 time=0.062 ms
64 bytes from 10.244.1.200: icmp_seq=3 ttl=63 time=0.179 ms
3 packets transmitted, 3 received, 0% packet loss
Cilium interfaces and eBPF hooks on kind-worker
$docker exec kind-worker ip address show cilium_hostexpand
13: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP> mtu 65520
    inet 10.244.1.77/32 scope global cilium_host
$bpftool net show | grep -E 'cilium_host|cilium_net|lxc_health'expand
cilium_net(12)  tcx/ingress  cil_to_host
cilium_host(13) tcx/ingress  cil_to_host
cilium_host(13) tcx/egress   cil_from_host
lxc_health(16)  tcx/ingress  cil_from_container
$docker exec kind-worker ip route | grep 10.244expand
10.244.0.0/24 via 10.244.1.77 dev cilium_host # overlay: routes → cilium_host
10.244.1.0/24 via 10.244.1.77 dev cilium_host
10.244.2.0/24 via 10.244.1.77 dev cilium_host

Every lxc* interface has cil_from_container attached at ingress. "Ingress on the host-side veth" means "packets leaving the pod." The moment a packet exits pod-a, it hits the eBPF program before it even enters the host namespace. Cilium makes its forwarding decision right there, at the earliest possible point in the stack.

The path for intranode traffic is: pod-a eth0 → host lxc*(pod-a) → eBPF cil_from_container → host lxc*(pod-b) → pod-b eth0. No tunnel, no encapsulation, no extra headers. Pure in-kernel forwarding through the host namespace.

Internode: Native Routing

When pods are on different nodes, the packet has to cross the physical (or virtual) network between them. Cilium supports two models for this: native routing and encapsulation. Native routing means the packet leaves the node with its pod IP visible on the wire. The underlay has to know how to forward it.

The book describes three ways nodes can learn about each other's PodCIDRs. I labbed all three.

3A: Auto Route Injection

This is the simplest option. You set autoDirectNodeRoutes: true in the Helm values, and Cilium automatically populates every node's routing table with routes to remote PodCIDRs, using node IPs as next hops. The catch is that all nodes must be on the same L2 network.

I deployed Cilium with multi-pool IPAM (10.10.0.0/16, /27 per node) in native routing mode:

Phase 3A: Auto Route Injection (Native Routing)
CiliumNode /27 pool allocations
$kubectl get ciliumnodes kind-worker -o yaml | yq .spec.ipam.poolsexpand
allocated:
  - cidrs:
      - 10.10.0.0/27
    pool: default
Routes point to node IPs via eth0 (not cilium_host)
$docker exec kind-worker ip route | grep 10.10expand
10.10.0.13 dev lxc3c2ed4711743 proto kernel scope link
10.10.0.32/27 via 172.18.0.3 dev eth0 proto kernel # → kind-worker2
10.10.0.64/27 via 172.18.0.4 dev eth0 proto kernel # → control-plane
No tunnel interfaces in native mode
$docker exec kind-worker ip link show cilium_vxlanexpand
Device "cilium_vxlan" does not exist.
tcpdump: raw pod IPs on the wire, no encapsulation
$tcpdump -n -i eth0 icmp (during cross-node ping)expand
17:12:17 10.10.0.13 > 10.10.0.42: ICMP echo request
17:12:17 10.10.0.42 > 10.10.0.13: ICMP echo reply
17:12:18 10.10.0.13 > 10.10.0.42: ICMP echo request
17:12:18 10.10.0.42 > 10.10.0.13: ICMP echo reply
No VXLAN. No UDP/8472. No outer IP. Native routing.

Compare these routes to the Phase 2 overlay routing table. In overlay mode, remote PodCIDRs point to cilium_host. In native mode, they point to node IPs via eth0. That routing table difference tells you instantly which mode you're in.

3B: Static Routing (Breaking Things on Purpose)

I deployed a fresh cluster with autoDirectNodeRoutes: false to prove what happens when no one tells the nodes about remote PodCIDRs:

Phase 3B: Static Routing (Prove the Failure)
No remote PodCIDR routes without autoDirectNodeRoutes
$kubectl exec pod-worker -- ping -c 2 -W 3 10.10.0.44expand
PING 10.10.0.44 (10.10.0.44) 56(84) bytes of data.

--- 10.10.0.44 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1056ms
Add static routes manually, then retest
$docker exec kind-worker ip route add 10.10.0.32/27 via 172.18.0.3expand
(route added)
$kubectl exec pod-worker -- ping -c 3 10.10.0.44expand
64 bytes from 10.10.0.44: icmp_seq=1 ttl=62 time=0.229 ms
64 bytes from 10.10.0.44: icmp_seq=2 ttl=62 time=0.371 ms
64 bytes from 10.10.0.44: icmp_seq=3 ttl=62 time=0.237 ms
3 packets transmitted, 3 received, 0% packet loss

The point of this exercise is to feel why static routing doesn't scale. Every node needs routes to every other node's PodCIDR. Adding or removing a node means updating every routing table in the cluster by hand.

This post is for paying subscribers only

Subscribe to LevelUp I.T. newsletter and stay updated.

Don't miss anything. Get all the latest posts delivered straight to your inbox. It's free!
Great! Check your inbox and click the link to confirm your subscription.
Error! Please enter a valid email address!