9

When I need to capture some packets using tcpdump, I use command like:

tcpdump -i eth0 "dst host 192.168.1.0"

I always think the dst host 192.168.1.0 part is something called BPF, Berkeley Packet Filter. To me, it's a simple language to filter network packets. But today my roommate tells me that BPF can be used to capture performance info. According to his description, it's like the tool perfmon on Windows. Is it true? Is it the same BPF as I mentioned in the beginning of the question?

0

2 Answers 2

19

What is BPF?

BPF (or more commonly, the extended version, eBPF) is a language that was originally used exclusively for filtering packets, but it is capable of quite a lot more. On Linux, it can be used for many other things, including system call filters for security, choosing processes to kill when the system runs out of memory, and sophisticated performance monitoring, as you pointed out. While Windows did add eBPF support, that is not what Windows' perfmon utility uses. Windows only added support for compatibility with non-Windows utilities that rely on OS support for eBPF.

The eBPF programs are not executed in userspace. Instead, the application creates and sends an eBPF program to the kernel, which executes it. It is actually machine code for a virtual processor that is implemented in the form of an interpreter in the kernel, although it can also use JIT compilation to enhance performance considerably. The program has access to some basic interfaces in the kernel, including those related to performance and networking. The eBPF program then communicates with the kernel to provide it the computational results (such as dropping a packet).

Restrictions on eBPF programs

In order to protect from denial-of-service attacks or accidental crashes, the kernel first verifies the code before it is compiled. Before being run, the code is subject to several important checks:

  • The program consists of no more than 4096 instructions in total for unprivileged users.

  • Backwards jumps cannot occur, with the exception of bounded loops and function calls.

  • There are no instructions that are always unreachable.

The upshot is that the verifier must be able to prove that the eBPF program halts. It hasn't found a solution to the halting problem, of course, which is why it only accepts programs that it knows will halt. To do this, it represents the program as a directed acyclic graph. In addition to this, it tries to prevent information leaks and out-of-bounds memory access by preventing the actual value of a pointer from being revealed while still allowing limited operations to be performed on it:

  • Pointers cannot be compared, stored, or returned as a value that can be examined.

  • Pointer arithmetic can only be done against a scalar (a value not derived from a pointer).

  • No pointer arithmetic can result in pointing outside the designated memory map.

The verifier is rather complex and does far more, although it has itself been the source of serious security bugs, at least when the bpf(2) syscall is not disabled for unprivileged users.

Viewing the code

The dst host 192.168.1.0 component of the command is not BPF. That is just syntax which is used by tcpdump. However, the command you give it is used to generate a BPF program which is then sent to the kernel. Note that it is not eBPF which is used in this case, but the older cBPF. There are several important differences between the two (although the kernel internally converts cBPF into eBPF). The -d flag can be used to see the cBPF code that is to be sent to the kernel:

# tcpdump -i eth0 "dst host 192.168.1.0" -d
(000) ldh      [12]
(001) jeq      #0x800           jt 2    jf 4
(002) ld       [30]
(003) jeq      #0xc0a80100      jt 8    jf 9
(004) jeq      #0x806           jt 6    jf 5
(005) jeq      #0x8035          jt 6    jf 9
(006) ld       [38]
(007) jeq      #0xc0a80100      jt 8    jf 9
(008) ret      #262144
(009) ret      #0

More complicated filters result in more complicated bytecode. Try some of the examples in the manpage and append the -d flag to see what bytecode would be loaded into the kernel. In order to understand how to read the disassembly, review the BPF filter documentation. If you're reading an eBPF program, you should take a look at the eBPF instruction set for the virtual CPU.

Understanding the code

For simplicity, I'll assume you specified a destination IP of 192.168.1.1 instead of 192.168.1.0 and wanted to match IPv4 only, which shrinks the code quite a bit as it no longer has to handle IPv6:

# tcpdump -i eth0 "dst host 192.168.1.1 and ip" -d
(000) ldh      [12]
(001) jeq      #0x800           jt 2    jf 5
(002) ld       [30]
(003) jeq      #0xc0a80101      jt 4    jf 5
(004) ret      #262144
(005) ret      #0

Let's walk through what the above bytecode actually does. Each time a packet is received on the interface specified, the BPF bytecode is run. The packet contents (including the Ethernet header, if applicable) are put in a buffer that the BPF code has access to. If the packet matches the filter, the code will return the size of the capture buffer (262144 bytes by default), otherwise it returns 0.

Let's assume you are running this filter and it receives a packet sending an ICMP message with an empty payload from 192.168.1.142 to 192.168.1.1. The source MAC is aa:aa:aa:aa:aa:aa and the destination MAC is bb:bb:bb:bb:bb:bb. The contents of the Ethernet frame, in hexadecimal, are:

aa aa aa aa aa aa bb bb bb bb bb bb 08 00 45 00
00 1c 77 71 40 00 40 01 3f 92 c0 a8 01 8e c0 a8
01 01 08 00 c1 c0 36 0e 00 01

The first instruction is ldh [12]. This loads a half-word (two bytes) located at an offset of 12 bytes into the packet into the A register. This is the value 0x0800 (remember that network data is always big-endian). The second instruction is jeq #0x800, which will compare an immediate with the value in the A register. If they are equal, it will jump to instruction 2, otherwise 5. The value 0x800 at that offset in the Ethernet frame specifies the IPv4 protocol. Because the comparison evaluates true, the code now jumps to instruction 2. If the payload was not IPv4, it would have jumped to 5.

Instruction 2 (the third) is ld [30]. This loads an entire 4-byte word at an offset of 30 into the A register. In our Ethernet frame, this is 0xc0a80101. The next instruction, jeq #0xc0a80101, will compare an immediate against the contents of the A register and will jump to 4 if true, otherwise 5. This value is the destination address (0xc0a80101 is the big-endian representation of 192.168.1.1). The values do indeed match, so the program counter is now set to 4.

Instruction 4 is ret #262144. This terminates the BPF program and returns the integer 262144 to the calling program. This tells the calling program, tcpdump in this case, that the packet was caught by the filter, so it requests the contents of the packet from the kernel, decodes it more thoroughly, and writes the information to your terminal. If the destination address did not match what the filter was looking for or the protocol type was not IPv4, the code would have jumped to instruction 5 instead, where it would have been met with ret #0. This would have terminated without a match.

This is all just a way to return 262144 if the half-word at offset 12 into the packet is 0x800 AND the word at offset 30 is 0xc0a80101, and return 0 otherwise. Because this is all done in the kernel (optionally after being converted into native machine code by the JIT engine), no expensive context switches or passing buffers between kernelspace and userspace are required, so the filter is fast.

More advanced examples

The BPF code is not limited to being used by tcpdump. A number of other utilities can use it. You can even create an iptables rule with a BPF filter by using the xt_bpf module! However, you have to be careful when generating the bytecode with tcpdump -ddd because it expects to consume a layer 2 header, whereas iptables does not. To make them compatible, you have to adjust the offsets.

Furthermore, a number of auxiliary functions are provided that provide information that can't be obtained by reading the raw packet contents such as the packet length, the payload start offset, the CPU the packet was received on, the NetFilter mark, etc. From the filter documentation:

The Linux kernel also has a couple of BPF extensions that are used along with the class of load instructions by “overloading” the k argument with a negative offset + a particular extension offset. The result of such BPF extensions are loaded into A.

The supported BPF extensions are:

Extension Description
len skb->len
proto skb->protocol
type skb->pkt_type
poff Payload start offset
ifidx skb->dev->ifindex
nla Netlink attribute of type X with offset A
nlan Nested Netlink attribute of type X with offset A
mark skb->mark
queue skb->queue_mapping
hatype skb->dev->type
rxhash skb->hash
cpu raw_smp_processor_id()
vlan_tci skb_vlan_tag_get(skb)
vlan_avail skb_vlan_tag_present(skb)
vlan_tpid skb->vlan_proto
rand prandom_u32()

For example, to match all packets that are received on CPU 3, you could do:

    ld #cpu
    jneq #3, drop
    ret #262144
drop:
    ret #0

Note that this is using BPF assembly syntax compatible with bpf_asm, whereas the other assembly listings here are using tcpdump syntax. The main difference is that the former's syntax uses named labels whereas the latter's BPF syntax labels each instruction with a line number. This assembly translates to the following bytecode (commas delimit instructions):

4,32 0 0 4294963236,21 0 1 1,6 0 0 262144,6 0 0 0,

This can then be used with iptables using the xt_bpf module:

iptables -A INPUT -m bpf --bytecode "4,32 0 0 4294963236,21 0 1 1,6 0 0 262144,6 0 0 0," -j CPU3

This will jump to target chain CPU3 for any packets received on that CPU.

If this seems powerful, remember that this is all cBPF. Although cBPF is translated into eBPF internally, all this is nothing compared to what raw eBPF can do!

For more information

I highly recommend you read this article to understand how tcpdump uses cBPF.

After reading that, read this explanation of how tcpdump turns expressions into bytecode.

If you want to learn everything else about it, you can always check out the source code!

7
  • 3
    Nice answer! The bytecode generated by tcpdump is however cBPF (classic BPF), not eBPF. They are two different bytecodes even if eBPF originated from cBPF. The documentation you linked to only discussed eBPF, but that one discusses both.
    – pchaigno
    Commented Apr 19, 2022 at 7:09
  • You allude to this in a reference to "security" purposes - but to elaborate on that: It can be used to describe/implement full sandboxing - look for seccomp-bpf.
    – davidbak
    Commented Apr 19, 2022 at 15:15
  • @pchaigno Good catch! Thank you. I've corrected my answer.
    – forest
    Commented Apr 19, 2022 at 19:11
  • @davidbak I did mention that it was used for syscall filtering. Edited the answer to link to the seccomp documentation. Thanks.
    – forest
    Commented Apr 19, 2022 at 19:21
  • The 4096-instructions limit is no longer relevant for privileged users; instead there are limits on complexity (in particular, a max of 1M insns that the verifier can check in total, when going through the different branches of the DAC). Function calls can also result in backward jumps. This page can give an overview of projects relying on eBPF. For performance tracing, I'd suggest having a look at BCC or bpftrace.
    – Qeole
    Commented Apr 25, 2022 at 8:36
3

The eBPF programs are not executed in userspace. Instead, the application creates and sends an eBPF program to the kernel, which executes it.

To complement @forest's good answer, we can maybe elaborate a little on how those programs are executed.

cBPF, as used by tcpdump, has few hooks: it can be attached to sockets, in order to run when a packet arrives (this is what tcpdump does, to filter packets received on the socket, and to pass only the desired ones to userspace), or they can be attached to the seccomp hooks, so as to do some filtering on system calls and their arguments.

One of the important features of eBPF is that it can be attached to a wider selection of hooks in the kernel (although it doesn't do seccomp). For networking, there are sockets, but also TC (traffic control) hooks, XDP (driver-level hooks for fast networking), or a few others. With regard to your question: programs can also be attached to tracepoints in the kernel (pre-defined hooks on some specific functions, e.g. syscalls or “important” functions in the kernel), or on kernel probes (kprobes), making them able to trace any function in the kernel (provided it was not inlined at compilation time). Then other types exist, for example LSM for security use cases.

Tracing usually rely on tracepoints or kprobes to attach an eBPF program to a function, and to run it every time this function is called in the kernel. The program can access the arguments of the function or (if it's attached at the exit) to the return value. Through the use of maps, special kernel memory area such as arrays or hash maps, dedicated to share data between eBPF programs and/or user space, the programs can collect metrics or share states between consecutive runs.

For example, opensnoop from BCC will attach to the tracepoints at the entry and the exit of the open() and openat() syscalls. At the entry, it collects the path of the file being opened, and the PID of the process opening it, and stores it in a hash map. When the syscall exits, the second probe collects the return value and, based on the PID, updates the relevant entry in the hash map. Then user space can collect and dump all entries from the hash map to show what files have been opened by what processes, and what the return values were.

https://ebpf.io/ is a nice place to get started with eBPF.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .