The document discusses using eBPF filters to optimize Netlink performance in SONiC by filtering unnecessary Netlink messages. It proposes:
1. Using eBPF/CBPF socket filtering to drop unwanted Netlink messages in the kernel before they are sent to applications.
2. Implementing filters using either eBPF assembly or Clang/LLVM for easier development and debugging.
3. Developing a customized eBPF library for SONiC with predefined filter rules and actions to simplify application-specific filtering.
3. Netlink Messaging Framework
• SONiC mainly uses NETLINK_ROUTE family for Interface
notifications
• It is a broadcast domain
● All Network interface updates are grouped under
NETLINK_ROUTE family.
● Each netdevice notifies the NETLINK subsystem about the
change in its port-properties.
● NETLINK subsystem posts a message(pkt holding “struct
nlmsghdr”) to socket recv-Q of all the registered application.
● Application then reads the message from the recv-Q,
4. Teamd STPd
NETLINK subsystem
Device Driver
Network Interfaces: Bridge, Vlan, Eth, PO etc.
Other Apps
RTM_NEWLINK/ RTM_DELLINK
Multicasted to all registered apps
Applications interested in NETLINK_ROUTE family updates.
Vlanmgr Portmgrd Other Apps
Applications creating/updating the NetDevice Properties
User Space
Kernel Space
Application Interaction with Netlink
5. Vlanmgrd
Ex: Ethernet0 is added to 4K Vlans
<<config vlan member range add 2 4094 Ethernet0>>
User Space
Kernel Space
NETLINK subsystem
4K Vlans Ethernet0
NetDevices
8190
Teamsyncd
8190
STPd UDLD
8190 8190
Without Filter
6. Vlanmgrd
Ethernet0 is added to 4K Vlans
<< config vlan member range add 2 4094 Ethernet0>>
User Space
Kernel Space
NETLINK subsystem
4K Vlans Ethernet0
NetDevices
8190
Teamsyncd
Dropped
8190
STPd UDLD
8190
Teamsyncd & STPd - Binded with eBPF Filter to drop all
Vlan-Member add.
With eBPF Filter
Dropped
8190
7. nl_msg_hdr (for msg_type == RTM_NEWLINK/DELLINK)
ifinfomsg
Attribute-1
T = IFLA_ADDRESS
V = MAC
Attribute-2
T = IFLA_IFNAME
V = if_name
Attribute-3
T = IFLA_LINKINFO
V = Nested TLVs
T = IFLA_INFO_KIND
V = Team/Vlan
T = IFLA_INFO_SLAVE_KIND
V = Team/Vlan
TLV-N
Attribute-N
nlmsghdr (Carries single netdevice attributes)
sk_buff->data
Netlink Message Format
Every attribute change in the interface will generate the
RTM_NEWLINK message with all the attributes
8. nlmsghdr-1
sk_buff->data
nlmsghdr-2 nlmsghdr-3 nlmsghdr-N
nlmsghdr-1 nlmsghdr-2 nlmsghdr-3 nlmsghdr-N
nlmsghdr-1 nlmsghdr-2 nlmsghdr-3 nlmsghdr-N
sk_buff->data
sk_buff->data
NetLink Dump will continue untill the complete DB is sent to Application.
Each DUMP reply will have NLM_F_MULTI flag and the last DUMP msg will have NLMSG_DONE. Which is used in filter to trap all DUMP-replies.
NetLink Dump
9. SONiC Netlink Message - Scaling Issue
• Every net device has multiple attributes
• Any attribute change will generate an net-link message notification
• Application has to process all the netlink messages generated by all the net-devices.
• There is no way to register only for a specific interface or a specific attribute change.
• When 4K VLAN is configured per port
• It generates ~8K Netlink messages
• On a scaled system
• Each process registers for kernel link notification
• Each process suffers from the same bursty notification issue as seen with Teamd
• Easley more than 1M unnecessary messages are getting broadcasted across system.
• Application is not able to process all the messages during config reload and also system reboot
• When socket queue is getting full, messages are dropped with ENOBUF error. No way to retrieve
the lost notification
10. Netlink Filter
• Berkeley Socket Filter (BPF)
• Interface to execute Micro ASM in the kernel as Minimal VM
• ASM Filter code gets executed for every packet reception
• Return value decides whether to accept/drop the packet
• Gets executed as part of Netlink message sender context
• Filter execution doesn’t affect much of the CPU performance
11. Netlink socket filtering – CBPF/EBPF
• CBPF /EBPF
• Micro code assembly
• Performance – Optimized flow
• Easy to attach filter
• Limitations
• No loops
• Limited set of registers
• Jump tracing is very hard to debug
• No Local storage – Array/maps –
CBPF
• No NLATTR helper function in EBPF
fd = socket(NETLINIK_ROUTE)
Socket fd
Receive netlink message
struct bpf_insn prog[] = {
BPF_MOV64_REG(R6, R1),
BPF_LD_ABS(BPF_B, 14 + 9), /* Protocol offset */
BPF_JMP_IMM(BPF_NEQ, R0, 7, 1), /* UDP(7) */
BPF_MOV64_IMM(R0, 0xFFFF) /* 0xFFFF- ACCEPT */
BPF_EXIT_INSN(),
};
setsockoption(fd, SO_ATTACH_BPF..)
BPF verifier
BPF JIT compiler
BPF in
Native code
User
Kernel
recvmsg(fd..)
Netlink subsystem
12. Netlink socket filtering – Clang/LLVM
• Clang/LLVM
• Restricted C
• Array and Hash map
support
• Easy to write and debug
the filter code
• Limitations
• Not an optimized
instruction flow
fd = socket(NETLINIK_ROUTE)
Socket fd
Receive netlink message
SEC("socket") int bpf_prog1(struct __sk_buff *skb)
{
uint16_t flags = load_half(skb, offsetof(struct nlmsghdr, nlmsg_flags));
if ( flags & NLM_F_MULTI)
return ACCEPT_PKT;
else
return DROP_PKT;
}
Clang/llvm
compilation
BPF verifier
BPF JIT compiler
BPF in
Native code
User
Kernel
recvmsg(fd..)
Netlink subsystem
load_and_attach(fd, SO_ATTACH_BPF..)
filter-obj.bpf
13. PoC with TeamD
• Arlo [ JIRA-7122 ] is fixed
• Verified the ENOBUF issue is not
seen with 4K VLAN sanity suite.
• Thanks to Madhukar
• Helping to understand the teamd
filter requirements
• Validating the PoC filter
FILTER DROP COUNT
Dropped in
Kernel
Trapped to
Application
Dropped %
Teamd
(Per port-channel)
79814 238 99.7%
teamsyncd 214510 42696 83.4%
14. Design for PoC verification
• Added Kernel patch for nlattr
and nestednlattr search helper
function
• Customized EBPF filter logic for
TeamD
• Clang/LLVM compiler integration
fd = socket(NETLINIK_ROUTE)
Socket fd
Receive netlink message
Hash MAP
DB
BPF Filter
User
Kernel
Netlink subsystem
KEY /
IFINDEX
VALUE/
Attributes
1 [ s:1, f:2, v:3 ]
64 [ s:1, f:3, v:7 ]
23 [ s:1, f:5, v:6 ]
Access from User space
16. EBPF assembly filter
• 11 Register set
• Kernel helper functions
• Kernel trace printk
• Array/Hash map APIs
• Tail calls
• Redirects
EBPF
Register
Description
R0 Return value from in-kernel
function, and exit value for eBPF
program
R1 ~ R5 Arguments from eBPF program to
in-kernel function
R6 ~ R9 Callee saved registers that in-kernel
function will preserve
R10 Read-only frame pointer to access
stack
17. Clang/LLVM
• Clang/LLVM compiler integration
• Build infra for compilation of
application specific filter
• Libsbpf.so - library
• Application interface
• Loads the ebpf object into kernel
• Attaches the ebpf filter code into
application socket
• Application
• App User will write custom filter
for their needs
Application
attach_filter(fd,”myfilter.o”)
libsbpf.so
attach_filter(fd, fobj)
BPF Filter build framework
BPF bytecode
compiler
MyFilter
[ My filter logic – myfilter.c ]
myfilter.o
filter callback
load_filter(fd, fobj)
18. Customized EBPF library for SONiC (Idea)
• Set of BPF filter rules and actions
• Rules can be
• Offset lookup and match
• Attribute lookup and match
• Nested attribute lookup and match
• Save result into a variable
• Action can be
• Accept
• Drop
• Jump to Nth rule
Label Rule Offset Mask Exp Action
FCHECK OFFSET 0x20 0xFF 0xaa ACCEPT
NLCHECK NLMATCH 0x56 0FE 0xbb GOTO NESTCHECK
DROP DROP 0x00 0x00 0x00 DROP
NESTCHECK NAMATCH 0x89 0xAF 0xcc ACCEPT
RETURN DROP 0x00 0x00 0x00 DROP
19. BPF Possibilities
• Time critical protocol packets can be generated from kernel.
• Statistics collection
• Custom user code injection
• And Much more …