Replacing iptables with eBPF in Kubernetes with Cilium

Replacing iptables with eBPF in
Kubernetes with Cilium
Cilium, eBPF, Envoy, Istio, Hubble
Michal Rostecki
Software Engineer
mrostecki@suse.com
mrostecki@opensuse.org
Swaminathan Vasudevan
Software Engineer
svasudevan@suse.com

22
What’s wrong with iptables?

3
IPtables runs into a couple of significant problems:
● Iptables updates must be made by recreating and updating all rules in a
single transaction.
● Implements chains of rules as a linked list, so almost all operations are O(n).
● The standard practice of implementing access control lists (ACLs) as
implemented by iptables was to use sequential list of rules.
● It’s based on matching IPs and ports, not aware about L7 protocols.
● Every time you have a new IP or port to match, rules need to be added and
the chain changed.
● Has high consumption of resources on Kubernetes.
What’s wrong with legacy iptables?

4
Complexity of iptables
● Linked list.
● All rules in the chain have to be replaced as a whole.
Rule 1
Rule 2
Rule n
...
Search O(n)
Insert O(1)
Delete O(n)

5
Kubernetes uses iptables for...
● kube-proxy - the component which implements Services and load
balancing by DNAT iptables rules
● the most of CNI plugins are using iptables for Network Policies

7
HW Bridge OVS .
Netdevice / Drivers
Traffic Shaping
Ethernet
IPv4 IPv6
Netfilter
TCP UDP Raw
Sockets
System Call Interface
Process Process Process
● The Linux kernel stack is split into multiple abstraction
layers.
● Strong userspace API compatibility in Linux for years.
● This shows how complex the linux kernel is and its years
of evolution.
● This cannot be replaced in a short term.
● Very hard to bypass the layers.
● Netfilter module has been supported by linux for more
than two decades and packet filtering has to applied to
packets that moves up and down the stack.
Linux Network Stack

8
HW Bridge OVS .
Netdevice / Drivers
Traffic Shaping
Ethernet
IPv4 IPv6
Netfilter
TCP UDP Raw
Sockets
System Call Interface
Process Process Process
BPF System calls
BPF Sockmap and
Sockops
BPF TC hooks
BPF XDP
BPF kernel hooks
BPF cGroups

10
PREROUTING INPUT OUTPUTFORWARD POSTROUTING
FILTER
FILTER FILTER
NAT
NAT
Routing
Decision
NAT
Routing
Decision
Routing
Decision
Netdev
(Physical or
virtual Device)
Netdev
(Physical or
virtual Device)
Local
Processes
eBPF
Code
eBPF
Code
IPTables
netfilter
hooks
eBPF
TC
hooks
XDP
hooks
BPF replaces IPtables

11
NetFilter NetFilter
To Linux
Stack
From Linux
Stack
Netdev
(Physical or
virtual Device)
Netdev
(Physical or
virtual Device)
Ingress
Chain
Selector
INGRESS
CHAIN
FORWARD
CHAIN
[local dst]
[rem
ote
dst]
TC/XDP Ingress
hook
TC Egress hook
Egress Chain
Selector
OUTPUT
CHAIN
[local src]
[remote
src]
Update
session
Label Packet
Update
session
Label Packet
Store
session
Store
session
Store
session
Update
session
Label Packet
Connection Tracking
BPF based filtering architecture

12
….
Headers
parsing
IP.dst
lookup
IP1 bitv1
IP2 bitv2
IP3 bitv3
eBPF Program #1 eBPF Program #2 eBPF Program #3
IP.proto
lookup
* bitv1
udp bitv2
tcp bitv3
Bitwise
AND
bit-vectors
Search
first
Matching
rule
Update
counters
ACTION
(drop/
accept)
rule1 act1
rule2 act2
rule3 act3
rule1 cnt1
rule2 cnt2
eBPF
Program
eBPF Program #N
Packet in
Packet out
From eBPF hook
To eBPF hook
Tailcall
Tailcall
Tailcall
Tailcall
Packet header offsets
Bitvector with temporary result
per cpu _array shared across the entire program chain
per cpu _array shared across the entire program chain
Each eBPF program can exploit a
different matching algorithm (e.g.,
exact match, longest prefix match,
etc).
Each eBPF program is
injected only if there are
rules operating on that
field.
LBVS is implemented
with a chain of eBPF
programs, connected
through tail calls.
Header parsing is done
once and results are kept
in a shared map for
performance reasons
BPF based tail calls

13
BPF goes into...
● Load balancers - katran
● perf
● systemd
● Suricata
● Open vSwitch - AF_XDP
● And many many others

17
CNI Functionality
CNI is a CNCF ( Cloud Native Computing Foundation) project for Linux Containers
It consists of specification and libraries for writing plugins.
Only care about networking connectivity of containers
● ADD/DEL
General container runtime considerations for CNI:
The container runtime must
● create a new network namespace for the container before invoking any plugins
● determine the network for the container and add the container to the each network by calling the corresponding plugins for each network
● not invoke parallel operations for the same container.
● order ADD and DEL operations for a container, such that ADD is always eventually followed by a corresponding DEL.
● not call ADD twice ( without a corresponding DEL ) for the same ( network name, container id, name of the interface inside the container).
When CNI ADD call is invoked it tries to add the network to the container with respective veth pairs and assigning IP address from the respective IPAM
Plugin or using the Host Scope.
When CNI DEL call is invoked it tries to remove the container network, release the IP Address to the IPAM Manager and cleans up the veth pairs.

18
Kubernetes API Server
Kubelet
CRI-Containerd
CNI-Plugin (Cilium)
Cilium Agent
eth0
BPF Maps
Container2
Container1
Linux Kernel
Network
Stack 000 c1 FE 0A
001 54 45 31
002 A1 B1 C1
004 32 66 AA
cni-add()..
Kubectl
K8s Pod
Userspace
Kernel
bpf_syscall()
BPF
Hook
Cilium CNI Plugin control Flow

19
Cilium Components with BPF hook points and BPF maps shown in
Linux Stack Orchestrator

20
container A container B container C
eth0 eth0 eth0
lxc0 lxc0 lxc1
eth0 eth0

21
Networking modes
Use case:
Cilium handling routing between nodes
Encapsulation
Use case:
Using cloud provider routers, using BGP
routing daemon
Direct routing
Node A
Node B
Node C
VXLAN
VXLAN
VXLAN
Node A
Node B Node C
Cloud or BGP
routing

24
L3 filtering – label based, ingress
Pod
Labels: role=frontend
IP: 10.0.0.1
Pod
Labels: role=frontend
IP: 10.0.0.2
Pod
IP: 10.0.0.5
Pod
Labels: role=backend
IP: 10.0.0.3
Pod
IP: 10.0.0.4
allow
deny

25
L3 filtering – label based, ingress
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "Allow frontends to access backends"
metadata:
name: "frontend-backend"
spec:
endpointSelector:
matchLabels:
role: backend
ingress:
- fromEndpoints:
- matchLabels:
class: frontend

26
L3 filtering – CIDR based, egress
IP: 10.0.1.1
Subnet: 10.0.1.0/24
IP: 10.0.2.1
Subnet: 10.0.2.0/24
allow
deny
Cluster A
Pod
IP: 10.0.0.1
Any IP not belonging
to 10.0.1.0/24

27
L3 filtering – CIDR based, egress
description: "Allow backends to access 10.0.1.0/24"
metadata:
spec:
endpointSelector:
matchLabels:
role: backend
egress:
- toCIDR:
- IP: “10.0.1.0/24”

28
L4 filtering
Pod
IP: 10.0.0.1
allow
deny
TCP/80
Any other port

29
L4 filtering
description: "Allow to access backends only on TCP/80"
metadata:
spec:
endpointSelector:
matchLabels:
role: backend
ingress:
- toPorts:
- ports:
- port: “80”
protocol: “TCP”

30
L7 filtering – API Aware Security
Pod
Labels: role=api
IP: 10.0.0.1
GET /articles/{id}
GET /private
Pod
IP: 10.0.0.5

31
L7 filtering – API Aware Security
description: "L7 policy to restict access to specific HTTP endpoints"
metadata:
endpointSelector:
matchLabels:
role: backend
ingress:
- toPorts:
- ports:
- port: “80”
protocol: “TCP”
rules:
http:
- method: "GET"
path: "/article/$"

32
Standalone proxy, L7 filtering
Node A
Pod A
+ BPF
Envoy
Generating BPF programs for
L7 filtering through libcilium.so
Node B
Pod B
+ BPF
Envoy
Generating BPF programs for
L7 filtering through libcilium.so
Generating BPF programs
for L3/L4 filtering
Generating BPF programs
for L3/L4 filtering
VXLAN

34
Cluster Mesh
Cluster A Cluster B
Node A
Pod A
+ BPF
Node B
+ BPF
Container
eth0
Pod B
Container
eth0
Pod C
Container
eth0
External etcd
Node A
Pod A
+ BPF
Container
eth0

35
Socket Socket Socket Socket
Service Service
Socket
TCP/IP
Ethernet
eth0
Socket
TCP/IP
Ethernet
eth0
Network
TCP/IP
Ethernet
IPtables
TCP/IP
Ethernet
IPtables
Loopback
IPtables IPtables
TCP/IP TCP/IP
Ethernet Ethernet
Loopback

36
Cilium CNI Cilium CNI
Socket Socket Socket Socket
Service Service
Socket
TCP/IP
Ethernet
eth0
Socket
TCP/IP
Ethernet
eth0
Network

37
Service A Service B Service C

39
Service A Service B
External
Github
Service
External
Cloud
Network

40
Kubernetes Services
● Hash table.
BPF, Cilium
● Linked list.
● All rules in the chain have to be
replaced as a whole.
Iptables, kube-proxy
Key
Key
Key
Value
Value
Value
Rule 1
Rule 2
Rule n
...
Search O(1)
Insert O(1)
Delete O(1)
Search O(n)
Insert O(1)
Delete O(n)

41
usec
number of services in cluster

42
CNI chaining
Policy enforcement, load balancing,
multi-cluster connectivity
IP allocation, configuring network
interface, encapsulation/routing
inside the cluster

44
●
●
●
●
●
●
●

45
●
○
●
○
●
○
●
○

50
Why Cilium is awesome?
● It makes disadvantages of iptables disappear. And always gets the best
from the Linux kernel.
● Cluster Mesh / multi-cluster.
● Makes Istio faster.
● Offers L7 API Aware filtering as a Kubernetes resource.
● Integrates with the other popular CNI plugins – Calico, Flannel, Weave,
Lyft, AWS CNI.

Replacing iptables with eBPF in Kubernetes with Cilium

Replacing iptables with eBPF in Kubernetes with Cilium

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to Replacing iptables with eBPF in Kubernetes with Cilium

Similar to Replacing iptables with eBPF in Kubernetes with Cilium (20)

Recently uploaded

Recently uploaded (20)

Replacing iptables with eBPF in Kubernetes with Cilium