Linux

Analyzing KVM Hypercalls with eBPF Tracing

I still visit my research lab quite often. It’s always nice to be in the zone where boundaries of knowledge are being pushed further and excitement is in the air. I like this ritual as this is a place where one¬†can¬†discuss linux kernel code and philosophy of life all in a single sentence while we quietly sip our hipster¬†coffee cups.

In the lab, Abder is working these days on some cool hypercall stuff. The exact details of his work are quite complex, but he calls his tool hypertrace. I think I am sold on the name already. Hypercalls are like syscalls, but instead of such calls going to a kernel, they are made from guest to the hypervisor with arguments and an identifier value (just like syscalls). The hypervisor then brokers the privileged operation for the guest. Pretty neat. Here is some kernel documentation on hypercalls.

So Abder¬†steered one recent discussion¬†towards internals of KVM and wanted to know the latency caused by a hypercall he was introducing for his experiments (as I said he is introducing new hypercall with an id – for example 42). His analysis usecase was quite specific – he wanted to trace the kvm_exit ->¬†kvm_hypercall ->¬†kvm_entry sequence to know the exact latency he was causing for a given duration. In addition the tracing overhead needs to be minimum and is for a short duration only. This is somewhat trivial. These 3 tracepoints are there in the kernel already and he could latch on to them. Essentially, he needs to look for¬†exit_reason argument of¬†the kvm_exit tracepoint and it should be a¬†VMCALL (18), which would denote that a hypercall is coming up next. Then he could look at the next kvm_exit event and find the time-delta to get the latency. Even though it is¬†possible by traditional tracing such as LTTng and Ftrace to record events, Abder¬†was only interested in his specific hypercall (nr = 42) along with¬†the kvm_exit that happened before (with exit_reason = 18) and kvm_entry after that. This is not straightforward. It’s not possible to do such specific¬†tracing¬†with traditional tracers at a high speed and low overhead. This means the selection of events should not just be a simple filter, but should be stateful. Just when Abder¬†was about to embark on a journey of kprobes and kernel modules, I once more got the opportunity¬†of being Morpheus and said¬†“What if I told you…

The rest is history. (dramatic pause)

Jokes apart, here is a small demo of eBPF/BCC script that allows us to hook onto the 3 aforementioned tracepoints in the Linux kernel and conditionally record the trace events:

from __future__ import print_function
from bcc import BPF

# load BPF program
b = BPF(text="""
#define EXIT_REASON 18
BPF_HASH(start, u8, u8);

TRACEPOINT_PROBE(kvm, kvm_exit) {
    u8 e = EXIT_REASON;
    u8 one = 1;
    if (args->exit_reason == EXIT_REASON) {
        bpf_trace_printk("KVM_EXIT exit_reason : %d\\n", args->exit_reason);
        start.update(&e, &one);
    }
    return 0;
}

TRACEPOINT_PROBE(kvm, kvm_entry) {
    u8 e = EXIT_REASON;
    u8 zero = 0;
    u8 *s = start.lookup(&e);
    if (s != NULL && *s == 1) {
        bpf_trace_printk("KVM_ENTRY vcpu_id : %u\\n", args->vcpu_id);
        start.update(&e, &zero);
    }
    return 0;
}

TRACEPOINT_PROBE(kvm, kvm_hypercall) {
    u8 e = EXIT_REASON;
    u8 zero = 0;
    u8 *s = start.lookup(&e);
    if (s != NULL && *s == 1) {
        bpf_trace_printk("HYPERCALL nr : %d\\n", args->nr);
    }
    return 0;
};
""")

# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "EVENT"))

# format output
while 1:
    try:
        (task, pid, cpu, flags, ts, msg) = b.trace_fields()
    except ValueError:
        continue

    print("%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))

The TRACEPOINT_PROBE() interface in BCC allows us¬†to use static tracepoints in the kernel. For example, whenever¬†a kvm_exit occurs in the kernel, the first probe is executed and it records the event¬†if the exit reason was VMCALL. At the same time it updates a BPF hash map, which basically acts like a flag here for other events.¬†I recommend you to check out Lesson 12 from the BCC Python Developer Tutorial¬†if this seems interesting to you ūüôā In addition, perhaps the reference¬†guide¬†lists the most important¬†C and Python APIs for BCC.

To test the above example out, we can introduce our own hypercall in the VM using this small test program :

#define do_hypercall(nr, p1, p2, p3, p4) \
__asm__ __volatile__(".byte 0x0F,0x01,0xC1\n"::"a"(nr), \
    "b"(p1), \
    "c"(p2), \
    "d"(p3), \
    "S"(p4))

void main()
{
    do_hypercall(42, 0, 0, 0, 0);
}

While the BPF program is running, if we do a hypercall, we get the following output :

TIME(s)            COMM             PID    EVENT
1795.472072000     CPU 0/KVM        7288   KVM_EXIT exit_reason : 18
1795.472112000     CPU 0/KVM        7288   HYPERCALL nr : 42
1795.472119000     CPU 0/KVM        7288   KVM_ENTRY vcpu_id : 0

So we see how in a few minutes we could precisely gather only those events that were of interest to us Рsaving us from the hassle of setting up other traces or kernel modules. eBPF/BCC based analysis allowed us to conditionally trace only a certain subsection of events instead of the huge flow of events that we would have had to analyze offline. KVM  internals are like a dark dungeon and I feel as if I am embarking on a quest here. There are a lot more upcoming KVM related analysis we are doing with eBPF/BCC. Stay tuned for updates! If you find any more interesting usecases for eBPF in the meantime, let me know. I would love to try them out! As always, comments, corrections and criticisms are welcome.

Advertisements
Standard
GDB, Linux

Fast Tracing with GDB

Even though GDB is a traditional debugger, it provides support for dynamic fast user-space tracing. Tracing in simple terms is super fast data logging from a target application or the kernel. The data is usually a superset of what a user would normally want from debugging but cannot get because of the debugger overhead. The traditional debugging approach can indeed alter the correctness of the application output or alter its behavior. Thus, the need for tracing arises. GDB in fact is one of the first projects which tried to have an integrated approach of debugging and tracing using the same tool. It has been designed in a manner such that sufficient decoupling is maintained Рallowing it to expand and be flexible. An example is the use of In-Process Agent (IPA) which is crucial to fast tracing in GDB but is not necessary for TRAP-based normal tracepoints.

GDB’s Tracing Infrastructure

The tracing is performed by trace and collect commands. The location where the user wants to collect some data is called a tracepoint. It is just a special type of breakpoint without support of running GDB commands on a tracepoint hit. As the program progresses, and passes the tracepoint, data (such as register values, memory values etc) gets collected based on certain conditions (if desired so). The data collection is done in a trace buffer when the tracepoint is hit. Later, that data can be examined from the collected trace snapshot using tfind. However, tracing for now is restricted to remote targets (such as gdbserver). Apart from this type of dynamic tracing, there is also support for static tracepoints in which instrumentation points known as markers are embedded in the target and can be activated or deactivated.  The process of installing these static tracepoints is known as probing a marker. Considering that you have started GDB and your binary is loaded, a sample trace session may look something like this :

(gdb) trace foo
(gdb) actions
Enter actions for tracepoint #1, one per line.
> collect $regs,$locals
> while-stepping 9
  > collect $regs
  > end
> end
(gdb) tstart
[program executes/continues...]
(gdb) tstop

This puts up a tracepoint at foo, collects all register values at tracepoint hit, and then for subsequent 9 instruction executions, collects all register values. We can now analyze the data using tfind or tdump.

(gdb) tfind 0
Found trace frame 0, tracepoint 1
54    bar    = (bar & 0xFF000000) >> 24;

(gdb) tdump
Data collected at tracepoint 1, trace frame 0:
rax    0x2000028 33554472
rbx    0x0 0
rcx    0x33402e76b0 220120118960
rdx    0x1 1
rsi    0x33405bca30 220123089456
rdi    0x2000028 33554472
rbp    0x7fffffffdca0 0x7fffffffdca0
rsp    0x7fffffffdca0 0x7fffffffdca0
.
.
rip    0x4006f1 0x4006f1 <foo+7>
[and so on...]

(gdb) tfind 4
Found trace frame 4, tracepoint 1
0x0000000000400700 55    r1 = (bar & 0x00F00000) >> 20;

(gdb) p $rip
$1 = (void (*)()) 0x400700 <foo+22>

So one can observe data collected from different trace frames in this manner and even output to a separate file if required. Going more in depth to know how tracing works, lets see the GDB’s two tracing mechanisms :

Normal Tracepoints

These type of tracepoints are the basic default tracepoints. The idea of their use is similar to breakpoints where GDB replaces the target instruction with a TRAP or any other exception causing instruction. On x86, this can usually be an int 3 which has a special single byte instruction Р0xCC reserved for it. Replacing a target instruction with this 1 byte ensures that the normal instructions are not corrupted. So, during the execution of the process, the OS hits the int 3 where it halts and program state is saved. The OS sends a SIGTRAP signal to the process. As GDB is attached or is running the process, it receives a SIGCHLD as a notification, that something happened with a child. It does a wait(), which will tell it that process has received a SIGTRAP. Thus the SIGTRAP never reaches the process as GDB intercepts it. The original instruction is first restored, or executed out-of-line for non-stop multi-threaded debugging. GDB transfers the control to the trace collection which does the data collection part upon evaluating any condition set. The data is stored into a trace buffer. Then, the original instruction is replaced again with the tracepoint and normal execution continues. This all fine and good, but there is a catch Рthe TRAP mechanism alters the flow of the application and the control is passed to the OS which leads to some delay a compromise in speed. But even with that, because of a very restrictive conditional tracing design, and better interaction of interrupt-driven approaches with instruction caches, normal interrupt- based tracing in GDB is a robust technique. A faster solution would indeed be a pure user-space approach, where everything is done at the application level.

Fast Tracepoints

Owing to the limitations stated above, a fast tracepoint approach was developed. This special type of tracepoint uses a dynamic tracing approach. Instead of using the breakpoint approach, GDB uses a mix of IPA and remote target (gdbserver) to replace the target instruction with a 5 byte jump to a special section in memory called a jump-pad. This jump-pad, also known as a trampoline, first pushes all registers to stack (saving the program state). Then, it calls the collector  function for trace data collection, it executes the displaced instruction out-of-line, and jumps back to the original instruction stream. I will probably write something more about how trampolines work and some techniques used in dynamic instrumentation frameworks like Dyninst in a subsequent post later on.

gdb-working

Fast tracepoints can be used with command ftrace, almost exactly like with the trace command. A special command in the following format is sent to the In-Process Agent by gdbserver as,

FastTrace:<tracepoint_object><jump_pad>

where <tracepoint object> is the object containing bytecode for conditional tracing, address, type, action etc. and <jump pad> is the 8-byte long head of the jump pad location in the memory. The IPA prepares all that and if all goes well, responds to such a  query by,

OK<target_address><jump_pad><fjump_size><fjump>

where <target address> is the address where the tracepoint is put in the inferior, <jump_pad> is the updated address of the jump pad head and the <fjump> and <fjump_size> are the jump instruction sequence and its size copied to the command buffer, sent back by IPA. The remote target (gdbserver) then modifies the memory of the target process. Much more fine grained information about fast tracepoint installation is available in the GDB documentation. This is a very pure user-space approach to tracing. However, there is a catch Рthe target instruction to be replaced should be at least 5 bytes long for this to work as the jump is itself 5 byte long (on Intel x86). This means that fast tracepoints using GDB cannot be put everywhere. How code is modified when patching a 5 bytes instruction is a discussion of another time. This is probably the fastest way yet to perform dynamic tracing and is a good candidate for such work.

The GDB traces can be saved with tsave and with the –ctf switch may be exported to CTF also. I have not tried this but hopefully it should at least open the traces with TraceCompass for further analysis. The GDB’s fast tracepoint mechanism is quite¬†fast I must say – but in terms of usability, LTTng is a far better and advanced¬†option. GDB however allows dynamic insertion of tracepoints and the tracing features are¬†well integrated in your friendly neighborhood debugger. Happy Tracing!

Standard