What is BPF?
BPF (or more commonly, the extended version, eBPF) is a language that was originally used exclusively for filtering packets, but it is capable of quite a lot more. On Linux, it can be used for many other things, including system call filters for security, choosing processes to kill when the system runs out of memory, and sophisticated performance monitoring, as you pointed out. While Windows did add eBPF support, that is not what Windows' perfmon
utility uses. Windows only added support for compatibility with non-Windows utilities that rely on OS support for eBPF.
The eBPF programs are not executed in userspace. Instead, the application creates and sends an eBPF program to the kernel, which executes it. It is actually machine code for a virtual processor that is implemented in the form of an interpreter in the kernel, although it can also use JIT compilation to enhance performance considerably. The program has access to some basic interfaces in the kernel, including those related to performance and networking. The eBPF program then communicates with the kernel to provide it the computational results (such as dropping a packet).
Restrictions on eBPF programs
In order to protect from denial-of-service attacks or accidental crashes, the kernel first verifies the code before it is compiled. Before being run, the code is subject to several important checks:
The program consists of no more than 4096 instructions in total for unprivileged users.
Backwards jumps cannot occur, with the exception of bounded loops and function calls.
There are no instructions that are always unreachable.
The upshot is that the verifier must be able to prove that the eBPF program halts. It hasn't found a solution to the halting problem, of course, which is why it only accepts programs that it knows will halt. To do this, it represents the program as a directed acyclic graph. In addition to this, it tries to prevent information leaks and out-of-bounds memory access by preventing the actual value of a pointer from being revealed while still allowing limited operations to be performed on it:
Pointers cannot be compared, stored, or returned as a value that can be examined.
Pointer arithmetic can only be done against a scalar (a value not derived from a pointer).
No pointer arithmetic can result in pointing outside the designated memory map.
The verifier is rather complex and does far more, although it has itself been the source of serious security bugs, at least when the bpf(2)
syscall is not disabled for unprivileged users.
Viewing the code
The dst host 192.168.1.0
component of the command is not BPF. That is just syntax which is used by tcpdump
. However, the command you give it is used to generate a BPF program which is then sent to the kernel. Note that it is not eBPF which is used in this case, but the older cBPF. There are several important differences between the two (although the kernel internally converts cBPF into eBPF). The -d
flag can be used to see the cBPF code that is to be sent to the kernel:
# tcpdump -i eth0 "dst host 192.168.1.0" -d
(000) ldh [12]
(001) jeq #0x800 jt 2 jf 4
(002) ld [30]
(003) jeq #0xc0a80100 jt 8 jf 9
(004) jeq #0x806 jt 6 jf 5
(005) jeq #0x8035 jt 6 jf 9
(006) ld [38]
(007) jeq #0xc0a80100 jt 8 jf 9
(008) ret #262144
(009) ret #0
More complicated filters result in more complicated bytecode. Try some of the examples in the manpage and append the -d
flag to see what bytecode would be loaded into the kernel. In order to understand how to read the disassembly, review the BPF filter documentation. If you're reading an eBPF program, you should take a look at the eBPF instruction set for the virtual CPU.
Understanding the code
For simplicity, I'll assume you specified a destination IP of 192.168.1.1 instead of 192.168.1.0 and wanted to match IPv4 only, which shrinks the code quite a bit as it no longer has to handle IPv6:
# tcpdump -i eth0 "dst host 192.168.1.1 and ip" -d
(000) ldh [12]
(001) jeq #0x800 jt 2 jf 5
(002) ld [30]
(003) jeq #0xc0a80101 jt 4 jf 5
(004) ret #262144
(005) ret #0
Let's walk through what the above bytecode actually does. Each time a packet is received on the interface specified, the BPF bytecode is run. The packet contents (including the Ethernet header, if applicable) are put in a buffer that the BPF code has access to. If the packet matches the filter, the code will return the size of the capture buffer (262144 bytes by default), otherwise it returns 0.
Let's assume you are running this filter and it receives a packet sending an ICMP message with an empty payload from 192.168.1.142 to 192.168.1.1. The source MAC is aa:aa:aa:aa:aa:aa and the destination MAC is bb:bb:bb:bb:bb:bb. The contents of the Ethernet frame, in hexadecimal, are:
aa aa aa aa aa aa bb bb bb bb bb bb 08 00 45 00
00 1c 77 71 40 00 40 01 3f 92 c0 a8 01 8e c0 a8
01 01 08 00 c1 c0 36 0e 00 01
The first instruction is ldh [12]
. This loads a half-word (two bytes) located at an offset of 12 bytes into the packet into the A register. This is the value 0x0800 (remember that network data is always big-endian). The second instruction is jeq #0x800
, which will compare an immediate with the value in the A register. If they are equal, it will jump to instruction 2, otherwise 5. The value 0x800 at that offset in the Ethernet frame specifies the IPv4 protocol. Because the comparison evaluates true, the code now jumps to instruction 2. If the payload was not IPv4, it would have jumped to 5.
Instruction 2 (the third) is ld [30]
. This loads an entire 4-byte word at an offset of 30 into the A register. In our Ethernet frame, this is 0xc0a80101. The next instruction, jeq #0xc0a80101
, will compare an immediate against the contents of the A register and will jump to 4 if true, otherwise 5. This value is the destination address (0xc0a80101 is the big-endian representation of 192.168.1.1). The values do indeed match, so the program counter is now set to 4.
Instruction 4 is ret #262144
. This terminates the BPF program and returns the integer 262144 to the calling program. This tells the calling program, tcpdump
in this case, that the packet was caught by the filter, so it requests the contents of the packet from the kernel, decodes it more thoroughly, and writes the information to your terminal. If the destination address did not match what the filter was looking for or the protocol type was not IPv4, the code would have jumped to instruction 5 instead, where it would have been met with ret #0
. This would have terminated without a match.
This is all just a way to return 262144 if the half-word at offset 12 into the packet is 0x800 AND the word at offset 30 is 0xc0a80101, and return 0 otherwise. Because this is all done in the kernel (optionally after being converted into native machine code by the JIT engine), no expensive context switches or passing buffers between kernelspace and userspace are required, so the filter is fast.
More advanced examples
The BPF code is not limited to being used by tcpdump
. A number of other utilities can use it. You can even create an iptables rule with a BPF filter by using the xt_bpf
module! However, you have to be careful when generating the bytecode with tcpdump -ddd
because it expects to consume a layer 2 header, whereas iptables does not. To make them compatible, you have to adjust the offsets.
Furthermore, a number of auxiliary functions are provided that provide information that can't be obtained by reading the raw packet contents such as the packet length, the payload start offset, the CPU the packet was received on, the NetFilter mark, etc. From the filter documentation:
The Linux kernel also has a couple of BPF extensions that are used along with the class of load instructions by “overloading” the k argument with a negative offset + a particular extension offset. The result of such BPF extensions are loaded into A.
The supported BPF extensions are:
Extension |
Description |
len |
skb->len |
proto |
skb->protocol |
type |
skb->pkt_type |
poff |
Payload start offset |
ifidx |
skb->dev->ifindex |
nla |
Netlink attribute of type X with offset A |
nlan |
Nested Netlink attribute of type X with offset A |
mark |
skb->mark |
queue |
skb->queue_mapping |
hatype |
skb->dev->type |
rxhash |
skb->hash |
cpu |
raw_smp_processor_id() |
vlan_tci |
skb_vlan_tag_get(skb) |
vlan_avail |
skb_vlan_tag_present(skb) |
vlan_tpid |
skb->vlan_proto |
rand |
prandom_u32() |
For example, to match all packets that are received on CPU 3, you could do:
ld #cpu
jneq #3, drop
ret #262144
drop:
ret #0
Note that this is using BPF assembly syntax compatible with bpf_asm
, whereas the other assembly listings here are using tcpdump
syntax. The main difference is that the former's syntax uses named labels whereas the latter's BPF syntax labels each instruction with a line number. This assembly translates to the following bytecode (commas delimit instructions):
4,32 0 0 4294963236,21 0 1 1,6 0 0 262144,6 0 0 0,
This can then be used with iptables
using the xt_bpf
module:
iptables -A INPUT -m bpf --bytecode "4,32 0 0 4294963236,21 0 1 1,6 0 0 262144,6 0 0 0," -j CPU3
This will jump to target chain CPU3
for any packets received on that CPU.
If this seems powerful, remember that this is all cBPF. Although cBPF is translated into eBPF internally, all this is nothing compared to what raw eBPF can do!
For more information
I highly recommend you read this article to understand how tcpdump
uses cBPF.
After reading that, read this explanation of how tcpdump
turns expressions into bytecode.
If you want to learn everything else about it, you can always check out the source code!