Kernel, Linux, Perf, Qt

Deconstructing Perf’s Data File

It is no mystery that Perf is like a giant organism written in C with an infinitely complex design. Of course, there is no such thing. Complexity is just a state of mind they would say and yes, it starts fading away as soon as you get enlightened. So, one fine day, I woke up and decided to understand how the perf.data file works because I wanted to extract the Intel PT binary data from it. I approached Francis and we started off on an amazing adventure (which is still underway). If you are of the impatient kind, here is the code.

A Gentle Intro to Perf

I would not delve deep into Perf right now. However, the basics are simple to grasp. It is like a Swiss army knife which contains tools to understand your system from either a very coarse to a quite fine granularity level. It can go all the way from profiling, static/dynamic tracing to custom analyses build up on hardware performance counters. With custom scripts, you can generate call-stacks, Flame Graphs and what not! Many tracing tools such as LTTng also support adding perf contexts to their own traces. My personal experience with Perf has usually been just to profile small piece of code. Sometimes I use its annotate feature to look at the disassembly to see instruction profiling right from my terminal. Occasionally, I use it to get immediate stats on system events such as syscalls etc. Its fantastic support with the Linux kernel owing to the fact that it is tightly bound to each release, means that you can always have reliable information. Brendan Gregg has written so much about it as part of his awesome Linux performance tools posts. He has some some actual useful stuff you can do with Perf. My posts here just talks about some of its internals. So, if Perf was a dinosaur, I am just talking about its toe in this post.

Perf contains a kernel part and a userspace part. The userspace part of Perf is located in the kernel directory tools/perf. The perf command that we use is compiled here. It reads kernel data from the Perf buffer based on the events you select for recording. For a list of all events you can use, do perf list or sudo perf list. The data from the Perf’s buffer is then written to the perf.data file. For hardware traces such as in Intel PT, the extra data is written in auxiliary buffers and saved to the data file. So to get your own custom stuff out from Perf, just read its data file. There are multiple ways like using scripts too, but reading a binary directly allows for a better learning experience. But the perf.data is like a magical output file that contains a plethora of information based on what events you selected, how the perf record command was configured. With hardware trace enabled, it can generate a 200MB+ file in 3-4 seconds (yes, seriously!). We need to first know how it is organized and how the binary is written.

Dissection Begins

Rather than going deep and trying to understand scripted ways to decipher this, we went all in and opened the file with a hex editor. The goal here was to learn how the Intel PT data can be extracted from the AUX buffers that Perf used and wrote in the perf.data file. By no means is this the only correct way to do this. There are more elegant solutions I think, esp. if you see some kernel documentation and the uapi perf_event.h file or see these scripts for custom analysis. Even then, this can surely be a good example to tinker around more with Perf. Here is the workflow:

  1. Open the file as hex. I use either Vim with :%!xxd command or Bless. This will come in handly later.
  2. Use perf report -D to keep track of how Perf is decoding and visualizing events in the data file in hex format.
  3. Open the above command with GDB along with the whole Perf source code. It is in the tools/perf directory in kernel source code.

If you setup your IDE to debug, you would also have imported the Perf source code. Now, we just start moving incrementally – looking at the bytes in the hex editor and correlating them with the magic perf report is doing in the debugger. You’ll see lots of bytes like these :

Screenshot from 2016-06-16 19-01-42

A cursory looks tells us that the file starts with a magic – PERFFILE2. Searching it in the source code eventually leads to the structure that defines the file header:

 struct perf_file_header {
   u64 magic;
   u64 size;
   u64 attr_size;
   struct perf_file_section attrs;
   struct perf_file_section data;
   /* event_types is ignored */
   struct perf_file_section event_types;
   DECLARE_BITMAP(adds_features, HEADER_FEAT_BITS);
};

So we start by mmaping the whole file to buf and just typecasting it to this. The header->data element is an interesting thing. It contains an offset and size as part of perf_file_section struct. We observe, that the offset is near the start of some strings – probably some event information? Hmm.. so lets try to typecast this offset position in the mmap buffer (pos + buf) to perf_event_header struct :

struct perf_event_header {
   __u32 type;
   __u16 misc;
   __u16 size;
};

For starters, lets further print this h->type and see what the first event is. With our perf.data file, the perf report -D command as a reference tells us that it may be the event type 70 (0x46) with 136 (0x88) bytes of data in it. Well, the hex says its the same thing at (buf + pos) offset. This in interesting! Probably we just found our event. Lets just iterate over the whole buffer while adding the h->size. We will print the event types as well.

while (pos < file.size()) {
    struct perf_event_header *h = (struct perf_event_header *) (buf + pos);
    qDebug() << "Event Type" <type;
    qDebug() << "Event Size" <size;
    pos += h->size;
}

Nice! We have so many events. Who knew? Perhaps the data file is not a mystery anymore. What are these event types though? The perf_event.h file has a big enum with event types and some very useful documentation. Some more mucking around leads us to the following enum :

enum perf_user_event_type { /* above any possible kernel type */
    PERF_RECORD_USER_TYPE_START = 64,
    PERF_RECORD_HEADER_ATTR = 64, 
    PERF_RECORD_HEADER_EVENT_TYPE = 65, /* depreceated */
    PERF_RECORD_HEADER_TRACING_DATA = 66,
    PERF_RECORD_HEADER_BUILD_ID = 67,
    PERF_RECORD_FINISHED_ROUND = 68,
    PERF_RECORD_ID_INDEX = 69,
    PERF_RECORD_AUXTRACE_INFO = 70,
    PERF_RECORD_AUXTRACE = 71,
    PERF_RECORD_AUXTRACE_ERROR = 72,
    PERF_RECORD_HEADER_MAX
};

So event 70 was PERF_RECORD_AUXTRACE_INFO. Well, the Intel PT folks indicate in the documentation that they store the hardware trace data in an AUX buffer. And perf report -D also shows event 71 with some decoded PT data. Perhaps, that is what we want. A little more fun with GDB on perf tells us that while iterating perf itself uses the union perf_event from event.h which contains an auxtrace_event struct as well.

struct auxtrace_event {
    struct perf_event_header header;
    u64 size;
    u64 offset;
    u64 reference;
    u32 idx;
    u32 tid;
    u32 cpu;
    u32 reserved__; /* For alignment */
};

So, this is how they lay out the events in the file. Interesting. Well, it seems we can just look for event type 71 and then typecast it to this struct. Then extract the size amount of bytes from this and move on. Intel PT documentation further says that the aux buffer was per-CPU so we may need to extract separate files for each CPU based on the cpu field in the struct. We do just that and get our extracted bytes as raw PT packets which the CPUs generated when the intel_pt event was used with Perf.

A Working Example

The above exercise was surprisingly easy once we figured out stuff so we just did a small prototype for our lab’s research purposes.  There are lot of things we learnt. For example, the actual bytes for the header (containing event stats etc. – usually the thing that Perf prints on top if you do perf report --header) are actually written at the end of the Perf’s data file. How endianness of file is determined by magic. Just before the header in the end, there are some bytes which I still have not figured out (near event 68) how they can be handled. Perhaps it is too easy, and I just don’t know the big picture yet. We just assume there are no more events if the event size is 0 😉 Works for now. A more convenient way that this charade is to use scripts such as this for doing custom analyses. But I guess it is less fun that going all l33t on the data file.

I’ll try to get some more events out along with the Intel PT data and see what all stuff is hidden inside. Also, Perf is quite tightly bound to the kernel for various reasons. Custom userspace APIs may not always be the safest solution. There is no guarantee that analyzing binary from newer versions of Perf would always work with the approach of our experimental tool. I’ll keep you folks posted as I discover more about Perf internals.

Advertisement
Standard