Linux

Analyzing KVM Hypercalls with eBPF Tracing

I still visit my research lab quite often. It’s always nice to be in the zone where boundaries of knowledge are being pushed further and excitement is in the air. I like this ritual as this is a place where one can discuss linux kernel code and philosophy of life all in a single sentence while we quietly sip our hipster coffee cups.

In the lab, Abder is working these days on some cool hypercall stuff. The exact details of his work are quite complex, but he calls his tool hypertrace. I think I am sold on the name already. Hypercalls are like syscalls, but instead of such calls going to a kernel, they are made from guest to the hypervisor with arguments and an identifier value (just like syscalls). The hypervisor then brokers the privileged operation for the guest. Pretty neat. Here is some kernel documentation on hypercalls.

So Abder steered one recent discussion towards internals of KVM and wanted to know the latency caused by a hypercall he was introducing for his experiments (as I said he is introducing new hypercall with an id – for example 42). His analysis usecase was quite specific – he wanted to trace the kvm_exit -> kvm_hypercall -> kvm_entry sequence to know the exact latency he was causing for a given duration. In addition the tracing overhead needs to be minimum and is for a short duration only. This is somewhat trivial. These 3 tracepoints are there in the kernel already and he could latch on to them. Essentially, he needs to look for exit_reason argument of the kvm_exit tracepoint and it should be a VMCALL (18), which would denote that a hypercall is coming up next. Then he could look at the next kvm_exit event and find the time-delta to get the latency. Even though it is possible by traditional tracing such as LTTng and Ftrace to record events, Abder was only interested in his specific hypercall (nr = 42) along with the kvm_exit that happened before (with exit_reason = 18) and kvm_entry after that. This is not straightforward. It’s not possible to do such specific tracing with traditional tracers at a high speed and low overhead. This means the selection of events should not just be a simple filter, but should be stateful. Just when Abder was about to embark on a journey of kprobes and kernel modules, I once more got the opportunity of being Morpheus and said “What if I told you…

The rest is history. (dramatic pause)

Jokes apart, here is a small demo of eBPF/BCC script that allows us to hook onto the 3 aforementioned tracepoints in the Linux kernel and conditionally record the trace events:

from __future__ import print_function
from bcc import BPF

# load BPF program
b = BPF(text="""
#define EXIT_REASON 18
BPF_HASH(start, u8, u8);

TRACEPOINT_PROBE(kvm, kvm_exit) {
    u8 e = EXIT_REASON;
    u8 one = 1;
    if (args->exit_reason == EXIT_REASON) {
        bpf_trace_printk("KVM_EXIT exit_reason : %d\\n", args->exit_reason);
        start.update(&e, &one);
    }
    return 0;
}

TRACEPOINT_PROBE(kvm, kvm_entry) {
    u8 e = EXIT_REASON;
    u8 zero = 0;
    u8 *s = start.lookup(&e);
    if (s != NULL && *s == 1) {
        bpf_trace_printk("KVM_ENTRY vcpu_id : %u\\n", args->vcpu_id);
        start.update(&e, &zero);
    }
    return 0;
}

TRACEPOINT_PROBE(kvm, kvm_hypercall) {
    u8 e = EXIT_REASON;
    u8 zero = 0;
    u8 *s = start.lookup(&e);
    if (s != NULL && *s == 1) {
        bpf_trace_printk("HYPERCALL nr : %d\\n", args->nr);
    }
    return 0;
};
""")

# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "EVENT"))

# format output
while 1:
    try:
        (task, pid, cpu, flags, ts, msg) = b.trace_fields()
    except ValueError:
        continue

    print("%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))

The TRACEPOINT_PROBE() interface in BCC allows us to use static tracepoints in the kernel. For example, whenever a kvm_exit occurs in the kernel, the first probe is executed and it records the event if the exit reason was VMCALL. At the same time it updates a BPF hash map, which basically acts like a flag here for other events. I recommend you to check out Lesson 12 from the BCC Python Developer Tutorial if this seems interesting to you 🙂 In addition, perhaps the reference guide lists the most important C and Python APIs for BCC.

To test the above example out, we can introduce our own hypercall in the VM using this small test program :

#define do_hypercall(nr, p1, p2, p3, p4) \
__asm__ __volatile__(".byte 0x0F,0x01,0xC1\n"::"a"(nr), \
    "b"(p1), \
    "c"(p2), \
    "d"(p3), \
    "S"(p4))

void main()
{
    do_hypercall(42, 0, 0, 0, 0);
}

While the BPF program is running, if we do a hypercall, we get the following output :

TIME(s)            COMM             PID    EVENT
1795.472072000     CPU 0/KVM        7288   KVM_EXIT exit_reason : 18
1795.472112000     CPU 0/KVM        7288   HYPERCALL nr : 42
1795.472119000     CPU 0/KVM        7288   KVM_ENTRY vcpu_id : 0

So we see how in a few minutes we could precisely gather only those events that were of interest to us – saving us from the hassle of setting up other traces or kernel modules. eBPF/BCC based analysis allowed us to conditionally trace only a certain subsection of events instead of the huge flow of events that we would have had to analyze offline. KVM  internals are like a dark dungeon and I feel as if I am embarking on a quest here. There are a lot more upcoming KVM related analysis we are doing with eBPF/BCC. Stay tuned for updates! If you find any more interesting usecases for eBPF in the meantime, let me know. I would love to try them out! As always, comments, corrections and criticisms are welcome.

Standard
Kernel, Linux, Perf, Qt

Deconstructing Perf’s Data File

It is no mystery that Perf is like a giant organism written in C with an infinitely complex design. Of course, there is no such thing. Complexity is just a state of mind they would say and yes, it starts fading away as soon as you get enlightened. So, one fine day, I woke up and decided to understand how the perf.data file works because I wanted to extract the Intel PT binary data from it. I approached Francis and we started off on an amazing adventure (which is still underway). If you are of the impatient kind, here is the code.

A Gentle Intro to Perf

I would not delve deep into Perf right now. However, the basics are simple to grasp. It is like a Swiss army knife which contains tools to understand your system from either a very coarse to a quite fine granularity level. It can go all the way from profiling, static/dynamic tracing to custom analyses build up on hardware performance counters. With custom scripts, you can generate call-stacks, Flame Graphs and what not! Many tracing tools such as LTTng also support adding perf contexts to their own traces. My personal experience with Perf has usually been just to profile small piece of code. Sometimes I use its annotate feature to look at the disassembly to see instruction profiling right from my terminal. Occasionally, I use it to get immediate stats on system events such as syscalls etc. Its fantastic support with the Linux kernel owing to the fact that it is tightly bound to each release, means that you can always have reliable information. Brendan Gregg has written so much about it as part of his awesome Linux performance tools posts. He has some some actual useful stuff you can do with Perf. My posts here just talks about some of its internals. So, if Perf was a dinosaur, I am just talking about its toe in this post.

Perf contains a kernel part and a userspace part. The userspace part of Perf is located in the kernel directory tools/perf. The perf command that we use is compiled here. It reads kernel data from the Perf buffer based on the events you select for recording. For a list of all events you can use, do perf list or sudo perf list. The data from the Perf’s buffer is then written to the perf.data file. For hardware traces such as in Intel PT, the extra data is written in auxiliary buffers and saved to the data file. So to get your own custom stuff out from Perf, just read its data file. There are multiple ways like using scripts too, but reading a binary directly allows for a better learning experience. But the perf.data is like a magical output file that contains a plethora of information based on what events you selected, how the perf record command was configured. With hardware trace enabled, it can generate a 200MB+ file in 3-4 seconds (yes, seriously!). We need to first know how it is organized and how the binary is written.

Dissection Begins

Rather than going deep and trying to understand scripted ways to decipher this, we went all in and opened the file with a hex editor. The goal here was to learn how the Intel PT data can be extracted from the AUX buffers that Perf used and wrote in the perf.data file. By no means is this the only correct way to do this. There are more elegant solutions I think, esp. if you see some kernel documentation and the uapi perf_event.h file or see these scripts for custom analysis. Even then, this can surely be a good example to tinker around more with Perf. Here is the workflow:

  1. Open the file as hex. I use either Vim with :%!xxd command or Bless. This will come in handly later.
  2. Use perf report -D to keep track of how Perf is decoding and visualizing events in the data file in hex format.
  3. Open the above command with GDB along with the whole Perf source code. It is in the tools/perf directory in kernel source code.

If you setup your IDE to debug, you would also have imported the Perf source code. Now, we just start moving incrementally – looking at the bytes in the hex editor and correlating them with the magic perf report is doing in the debugger. You’ll see lots of bytes like these :

Screenshot from 2016-06-16 19-01-42

A cursory looks tells us that the file starts with a magic – PERFFILE2. Searching it in the source code eventually leads to the structure that defines the file header:

struct perf_file_header {
   u64 magic;
   u64 size;
   u64 attr_size;
   struct perf_file_section attrs;
   struct perf_file_section data;
   /* event_types is ignored */
   struct perf_file_section event_types;
   DECLARE_BITMAP(adds_features, HEADER_FEAT_BITS);
};

So we start by mmaping the whole file to buf and just typecasting it to this. The header->data element is an interesting thing. It contains an offset and size as part of perf_file_section struct. We observe, that the offset is near the start of some strings – probably some event information? Hmm.. so lets try to typecast this offset position in the mmap buffer (pos + buf) to perf_event_header struct :

struct perf_event_header {
   __u32 type;
   __u16 misc;
   __u16 size;
};

For starters, lets further print this h->type and see what the first event is. With our perf.data file, the perf report -D command as a reference tells us that it may be the event type 70 (0x46) with 136 (0x88) bytes of data in it. Well, the hex says its the same thing at (buf + pos) offset. This in interesting! Probably we just found our event. Lets just iterate over the whole buffer while adding the h->size. We will print the event types as well.

while (pos < file.size()) {
    struct perf_event_header *h = (struct perf_event_header *) (buf + pos);
    qDebug() << "Event Type" << h->type;
    qDebug() << "Event Size" << h->size;
    pos += h->size;
}

Nice! We have so many events. Who knew? Perhaps the data file is not a mystery anymore. What are these event types though? The perf_event.h file has a big enum with event types and some very useful documentation. Some more mucking around leads us to the following enum :

enum perf_user_event_type { /* above any possible kernel type */
    PERF_RECORD_USER_TYPE_START = 64,
    PERF_RECORD_HEADER_ATTR = 64, 
    PERF_RECORD_HEADER_EVENT_TYPE = 65, /* depreceated */
    PERF_RECORD_HEADER_TRACING_DATA = 66,
    PERF_RECORD_HEADER_BUILD_ID = 67,
    PERF_RECORD_FINISHED_ROUND = 68,
    PERF_RECORD_ID_INDEX = 69,
    PERF_RECORD_AUXTRACE_INFO = 70,
    PERF_RECORD_AUXTRACE = 71,
    PERF_RECORD_AUXTRACE_ERROR = 72,
    PERF_RECORD_HEADER_MAX
};

So event 70 was PERF_RECORD_AUXTRACE_INFO. Well, the Intel PT folks indicate in the documentation that they store the hardware trace data in an AUX buffer. And perf report -D also shows event 71 with some decoded PT data. Perhaps, that is what we want. A little more fun with GDB on perf tells us that while iterating perf itself uses the union perf_event from event.h which contains an auxtrace_event struct as well.

struct auxtrace_event {
    struct perf_event_header header;
    u64 size;
    u64 offset;
    u64 reference;
    u32 idx;
    u32 tid;
    u32 cpu;
    u32 reserved__; /* For alignment */
};

So, this is how they lay out the events in the file. Interesting. Well, it seems we can just look for event type 71 and then typecast it to this struct. Then extract the size amount of bytes from this and move on. Intel PT documentation further says that the aux buffer was per-CPU so we may need to extract separate files for each CPU based on the cpu field in the struct. We do just that and get our extracted bytes as raw PT packets which the CPUs generated when the intel_pt event was used with Perf.

A Working Example

The above exercise was surprisingly easy once we figured out stuff so we just did a small prototype for our lab’s research purposes.  There are lot of things we learnt. For example, the actual bytes for the header (containing event stats etc. – usually the thing that Perf prints on top if you do perf report --header) are actually written at the end of the Perf’s data file. How endianness of file is determined by magic. Just before the header in the end, there are some bytes which I still have not figured out (near event 68) how they can be handled. Perhaps it is too easy, and I just don’t know the big picture yet. We just assume there are no more events if the event size is 0 😉 Works for now. A more convenient way that this charade is to use scripts such as this for doing custom analyses. But I guess it is less fun that going all l33t on the data file.

I’ll try to get some more events out along with the Intel PT data and see what all stuff is hidden inside. Also, Perf is quite tightly bound to the kernel for various reasons. Custom userspace APIs may not always be the safest solution. There is no guarantee that analyzing binary from newer versions of Perf would always work with the approach of our experimental tool. I’ll keep you folks posted as I discover more about Perf internals.

Standard
Linux

FUDCon Pune 2015

venue_mit

This year’s FUDCon for the APAC region was held once more in the same city of Pune. Attending FUDCon reminded me of 2011 – the last time this event was in Pune. I had submitted some talks and sessions as I still feel more of an APAC guy even though I have changed zones for sometime now. Hoping that there would be enough folks interested to know what I have been working on for the last couple of years, I submitted a talk “Kernel and Userspace tracing with LTTng and friends”. You can see the slides here. Of course, systems performance consumes most of my waking hours and I thought that it would benefit Fedora as well. I was happy when I saw that the talk was selected and there was an opportunity for me to share my experiences with others in Pune. Along with this talk I was also going to take a Kenrel module workshop and AskFedora UX/UI hackfest that me and Sarup decided to run. I knew that my FUDCon would be packed 🙂

I arrived on 24th night, all jetlagged and tired from a long journey. I met Izhar and Somvannda at Mumbai and we all set out for Pune. To our surprise, Siddhesh and Kushal were waiting for our arrival at 3am in the hotel. Thanks guys for your seamless efforts in co-ordinating travel for speakers! (and of course a whole lot of other things you did). We quickly hit the sack. Most of the next day was spent in doing some chores for FUDCon organization – packing the goodie bags with Ani and Danishka at Siddhesh’s house. We went to the Red Hat Pune office subsequently where I met Jared, Shreyank, Prasad, Harish, Sinny et al.

rhposter2

Also, as you can see, Izhar was not afraid of some fizzy-drinks fireworks in the RH office as well. Chillax. It was just a photo-op 🙂

Day 1

I had a very small selection of talks to attend. The day started with Harish’s keynote and then a Education panel discussion. I soon diverted to some other talks. I started with the kdump and kernel crash analysis workshop by Buland Singh and Gopal Tiwari. Their slides and explaination was good but unfortunateley the demo failed. I moved on to Sinny’s presentation on ABI compatability. This one was delivered quite well IMO. I wanted to attend Vaidik’s Vagrant talk but settled on for Samikshan’s talk on his “spartakus” tool to detect kernel ABI breakages. It was something done based on the “sparse” tool. I went to the FUDCon APAC BoF next to see how palnning was being done. I don’t remember exactly but probably the day ended with a visit to a local microbrewery.

Day 2

I met Sankarshan after a long time. He was manning the Fedora booth like a soldier in the vanguard. I also saw the FUDCon T-shirts that I designed today. They looked quite well done which of course made me happy. I picked up some FUDCoins (aka Fedora pin-badges). Legend (me) says that you can not buy worldly stuff but just pure emotions with such coins. I soon moved to the opening keynote by Jiri was nice – mostly becasue he told us that the mp3 paptent was expiring soon and possibly Fedora would support mp3 soon out of the box. Next was my talk on tracing. Dunno how that went, but some folks met me in the end demanding the copy of  Brendan’s performance tools cheat-sheet. Felt nice that people there cared about this 🙂 HasGeek folks tell me that the videos will be available soon. By that time, here are the slides. I continued to Pravin’s talk on Internationalization – quite nice, and then to an old friend Kiran’s talk on Wifi internals. This one was sufficiently detailed and quite informative. I then went on to deliver a workshop on Kernel module programming where I basically started with a simple hello world module and ended with a netfilter hooks based small packet filter. Some first year students from Amrita univeristy looked very enthusiastic. They even met me and asked me how to begin kernel programming. I was impressed how much pumped up they were even in the first year about kernel proramming!

Look who's trying to bore people to death

Look who’s trying to bore people to death

This day ended with the customary FUDPub. We also spent the night talking late at night about life universe and everything with Sinny and Charul – while seeing a buzzed Sarup struggling to make coffee and tea for us as he intermittently poured in his inputs 🙂 This was somewhat like the famous pink slippers incident of FUDCon 2011

Best. FUDPub. Ever.

I don’t think I can explain how awesome a FUDPub can be when you have awesome food, drinks and a whole bowling alley booked for the volunteers and speakers. It was truly awesome. We all agreed that this has set a threshold for all the future FUDPubs now!

blue-o

Day 3

The last day was more of hackfests and some workshops such as Docker workshop by Aditya, Lalatendu and Shivprasad (which I did not attend, but have been told that it was really good). I however attended a really good workshop on Inkscape by Sirko and then a small part of the Blender workshop by Ryan Lerch. It was nice seeing some folks pouring in with their Blender model renders in Harish’s keysigning party and looking content with they dancing cube 🙂 I am sure Ryan did an awesome job in showing them the power of Blender! I was tired by this time and the attendance was thinning, but me and Sarup still managed the AskFedora hackfest. There were a few folks but still we managed to get some good feedback on the UI done till now by our GSoC student Anuradha from particiapants Charul and Sinny. I have to prepare a feedback soon for her so that she can make changes. We ended the day with yet another long night of discussions with Siddhesh, Kushal, Charul, Sarup and Sinny.

In the end, I would say – it was an awesome event. The quality of talks was really good. I hope it benefited students and the industry folks that attended these. Also, Sarup is an all round awesome guy and a nice roommate. I will update this if I remember something more and if I manage to get some more photos from the event.

EDIT: Added photos. Venue and my talk photo shamelessly taken from Sinny’s photostream on Flickr.

Standard
Embedded, Kernel, Linux

BPF Internals – I

Recent post by Brendan Gregg inspired me to write my own blog post about my findings of how Berkeley Packet Filter (BPF) evolved, it’s interesting history and the immense powers it holds – the way Brendan calls it ‘brutal’. I came across this while studying interpreters and small process virtual machines like the proposed KTap’s VM. I was looking at some known papers on register vs stack basd VMs, their performances and various code dispatch mechanisms used in these small VMs. The review of state-of-the-art soon moved to native code compilation and a discussion on LWN caught my eye. The benefits of JIT were too good to be overlooked, and BPF’s application in things like filtering, tracing and seccomp (used in Chrome as well) made me interested. I knew that the kernel devs were on to something here. This is when I started digging through the BPF background.

Background

Network packet analysis requires an interesting bunch of tech. Right from the time a packet reaches the embedded controller on the network hardware in your PC (hardware/data link layer) to the point they do someting useful in your system, such as display something in your browser (application layer). For connected systems evolving these days, the amount of data transfer is huge, and the support infrastructure for the network analysis needed a way to filter out things pretty fast. The initial concept of packet filtering developed keeping in mind such needs and there were many stategies discussed with every filter such as CMU/Stanford packet Filter (CSPF), Sun’s NIT filter and so on. For example, some earlier filtering approaches used a tree based model (in CSPF) to represenf filters and filter them out using predicate-tree walking. This earlier approach was also inherited in the Linux kernel’s old filter in the net subsystem.

Consider an engineer’s need to have a probably simple and unrealistic filter on the network packets with the predicates P1, P2, P3 and P4 :

equation

Filtering approach like the one of CSPF would have represented this filter in a expression tree structure as follows:

tree

It is then trivial to walk the tree evaluating each expression and performing operations on each of them. But this would mean there can be extra costs assiciated with evaluating the predicates which may not necessarily have to be evaluated. For example, what if the packet is neither an ARP packet nor an IP packet? Having the knowledge that P1 and P2 predicates are untrue, we may need not have to evaluate other 2 predicates and perform 2 other boolean operation on them to determine the outcome.

In 1992-93, McCanne et al. proposed a BSD Packet Filter with a new CFG-bytecode based filter design. This was an in-kernel approach where a tiny interpreter would evaluate expressions represented as BPF bytecodes. Instead of simple expression trees, they proposed a CFG based filter design. One of the control flow graph representation of the same filter above can be:

cfg

The evaluation can start from P1 and the right edge is for FALSE and left is for TRUE with each predicate being evaluated in this fashion until the evaluation reaches the final result of TRUE or FALSE. The inherent property of ‘remembering’ in the CFG, i.e, if P1 and P2 are false, the path reaches a final FALSE is remembered and P3 and P4 need not be evaluated. This was then easy to represent in bytecode form where a minimal BPF VM can be designed to evaluate these predicates with jumps to TRUE or FALSE targets.

The BPF Machine

A pseudo-instruction representation of the same filter described above for earlier versions of BPF in Linux kernel can be shown as,

l0:	ldh [12]
l1:	jeq #0x800, l3, l2
l2:     jeq #0x805, l3, l8
l3:	ld [26]
l4:	jeq #SRC, l4, l8
l5:     ld len
l6:     jlt 0x400, l7, l8
l7:	ret #0xffff
l8:	ret #0

To know how to read these BPF instructions, look at the filter documentation in Kernel source and see what each line does. Each of these instructions are actually just bytecodes which the BPF machine interprets. Like all real machines, this requires a definition of how the VM internals would look like. In the Linux kernel’s version of the BPF based in-kernel filtering technique they adopted, there were initially just 2 important registers, A and X with another 16 register ‘scratch space’ M[0-15]. The Instruction format and some sample instructions for this earlier version of BPF are shown below:

/* Instruction format: { OP, JT, JF, K }
 * OP: opcode, 16 bit
 * JT: Jump target for TRUE
 * JF: Jump target for FALSE
 * K: 32 bit constant
 */

/* Sample instructions*/
{ 0x28,  0,  0, 0x0000000c },     /* 0x28 is opcode for ldh */
{ 0x15,  1,  0, 0x00000800 },     /* jump next to next instr if A = 0x800 */
{ 0x15,  0,  5, 0x00000805 },     /* jump to FALSE (offset 5) if A != 0x805 */
..

There were some radical changes done to the BPF infrastructure recently – extensions to its instruction set, registers, addition of things like BPF-maps etc. We shall discuss what those changes in detail, probably in the next post in this series. For now we’ll just see the good ol’ way of how BPF worked.

Interpreter

Each of the instructions seen above are represented as arrays of these 4 values and each program is an array of such instructions. The BPF interpreter sees each opcode and performs the operations on the registers or data accordingly after it goes through a verifier for a sanity check to make sure the filter code is secure and would not cause harm. The program which consists of these instructions, then passes through a dispatch routine. As an example, here is a small snippet from the BPF instruction dispatch for the instruction ‘add’ before it was restructured in Linux kernel v3.15 onwards,

127         u32 A = 0;                      /* Accumulator */
128         u32 X = 0;                      /* Index Register */
129         u32 mem[BPF_MEMWORDS];          /* Scratch Memory Store */
130         u32 tmp;
131         int k;
132
133         /*
134          * Process array of filter instructions.
135          */
136         for (;; fentry++) {
137 #if defined(CONFIG_X86_32)
138 #define K (fentry->k)
139 #else
140                 const u32 K = fentry->k;
141 #endif
142 
143                 switch (fentry->code) {
144                 case BPF_S_ALU_ADD_X:
145                         A += X;
146                         continue;
147                 case BPF_S_ALU_ADD_K:
148                         A += K;
149                         continue;
150 ..

Above snippet is taken from net/core/filter.c in Linux kernel v3.14. Here, fentry is the socket_filter structure and the filter is applied to the sk_buff data element. The dispatch loop (136), runs till all the instructions are exhaused. The dispatch is basically a huge switch-case dispatch with each opcode being tested (143) and necessary action being taken. For example, here an ‘add’ operation on registers would add A+X and store it in A. Yes, this is simple isn’t it? Let us take it a level above.

JIT Compilation

This is nothing new. JIT compilation of bytecodes has been there for a long time. I think it is one of those eventual steps taken once an interpreted language decides to look for optimizing bytecode execution speed. Interpreter dispatches can be a bit costly once the size of the filter/code and the execution time increases. With high frequency packet filtering, we need to save as much time as possible and a good way is to convert the bytecode to native machine code by Just-In-Time compiling it and then executing the native code from the code cache. For BPF, JIT was discussed first in the BPF+ research paper by Begel etc al. in 1999. Along with other optimizations (redundant predicate elimination, peephole optimizations etc,) a JIT assembler for BPF bytecodes was also discussed. They showed improvements from 3.5x to 9x in certain cases. I quickly started seeing if the Linux kernel had done something similar. And behold, here is how the JIT looks like for the ‘add’ instruction we discussed before (Linux kernel v3.14),

288                switch (filter[i].code) {
289                case BPF_S_ALU_ADD_X: /* A += X; */
290                        seen |= SEEN_XREG;
291                        EMIT2(0x01, 0xd8);              /* add %ebx,%eax */
292                        break;
293                case BPF_S_ALU_ADD_K: /* A += K; */
294                        if (!K)
295                                break;
296                        if (is_imm8(K))
297                                EMIT3(0x83, 0xc0, K);   /* add imm8,%eax */
298                        else
299                                EMIT1_off32(0x05, K);   /* add imm32,%eax */
300                        break;

As seen above in arch/x86/net/bpf_jit_comp.c for v3.14, instead of performing operations during the code dispatch directly, the JIT compiler emits the native code to a memory area and keeps it ready for execution.The JITed filter image is built like a function call, so we add some prologue and epilogue to it as well,

/* JIT image prologue */
221                EMIT4(0x55, 0x48, 0x89, 0xe5); /* push %rbp; mov %rsp,%rbp */
222                EMIT4(0x48, 0x83, 0xec, 96);    /* subq  $96,%rsp       */

There are rules to BPF (such as no-loop etc.) which the verifier checks before the image is built as we are now in dangerous waters of executing external machine code inside the linux kernel. In those days, all this would have been done by bpf_jit_compile which upon completion would point the filter function to the filter image,

774                 fp->bpf_func = (void *)image;

Smooooooth… Upon execution of the filter function, instead of interpreting, the filter will now start executing the native code. Even though things have changed a bit recently, this had been indeed a fun way to learn how interpreters and JIT compilers work in general and the kind of optimizations that can be done. In the next part of this post series, I will look into what changes have been done recently, the restructuring and extension efforts to BPF and its evolution to eBPF along with BPF maps and the very recent and ongoing efforts in hist-triggers. I will discuss about my experiemntal userspace eBPF library and it’s use for LTTng’s UST event filtering and its comparison to LTTng’s bytecode interpreter. Brendan’s blog-post is highly recommended and so are the links to ‘More Reading’ in that post.

Thanks to Alexei Starovoitov, Eric Dumazet and all the other kernel contributors to BPF that I may have missed. They are doing awesome work and are the direct source for my learnings as well. It seems, looking at versatility of eBPF, it’s adoption in newer tools like shark, and with Brendan’s views and first experiemnts, this may indeed be the next big thing in tracing.

Standard
Embedded, Kernel, Linux

Jumping the Kernel-Userspace Boundary – Procfs and Ioctl

I recently had a need to have a very fast and scalable way to share moderate chunks of data between my experimental kernel module and the userspace application. Of course, there are many ways already available. Some of them are documented very nicely here. I will be writing in a few blog posts sharing what all mechanisms I have used to transfer data and provide such interfaces.

Procfs

I have used the Procfs before (with the seq_file API) when I needed to read my experimental results back in userspace and perform aggregation and further analysis there only. It usually consisted of a stream of data which I sent to my /proc/foo file. From a userspace perspective, it is essentially a trivial read-only operation in my case,

/* init stuff */
static struct proc_dir_entry *proc_entry;
/* Create procfs entry in module init */
proc_entry = proc_create("foo", 0, NULL, &foo_fops);
/* The operations*/
static const struct file_operations foo_fops = {
    .owner = THIS_MODULE,
    .open = foo_open,
    .read = seq_read,
    .llseek = seq_lseek,
    .release = single_release,
};
/ *Use seq_printf to provide access to some value from module */
static int foo_print(struct seq_file *m, void *v) {
    seq_printf(m, val);
    return 0;
}

static int foo_open(struct inode *inode, struct  file *file) {
    return single_open(file, foo_print, NULL);
}
/* Remove procfs entry in module exit */
remove_proc_entry("foo", NULL);

Ioctl

I also used ioctl before (More importantly, I call them eye-awk-till. *grins*). They are used in situations when the interaction between your userspace applicatoin and the module resembles actual commands on which action from the kernel has to be performed. With each command, the userspace can send a message containing some data which the module can use to take actions. As an example, consider a device driver for a device which measures temperature from 2 sensors in a cold room. The driver can provide certain commands which are executed when the userspace makes ioctls. Each commad is associated with a number called as ioctl number which the device developer chooses. In a smiliar fashion to Procfs interface, file_operations struct can be defined with a new entry and initializations are done in the module,

/* File operations */
static const struct file_operations temp_fops = {  
       .owner = THIS_MODULE,  
       .unlocked_ioctl = temp_ioctl, 
};
/* The ioctl */
int temp_ioctl(struct file *filep, unsigned int cmd, unsigned long arg) {
	switch(cmd) {
	case READ_TEMP_SENSOR_1:	
		copy_to_user((char *)arg, temp_buff, 8);
		break;	
	case READ_TEMP_SENSOR_2:
		copy_to_user((char *)arg, temp_buff, 8);
		break;
        }
}

There are other complexities involved as well, such as using _IO(), _IOR() macros to define safe ioctl numbers. To know more about ioctl() call and how it is used, I suggest you read Chapter 7 from LKMPG. Note that newer kernels have some minor changes in code, hence refer to some device drivers using ioctls inlatest kernel releases. Each ioctl in our case means we have to use copy data from user to kernel or from kernel to user using copy_fom_user() or copy_to_user() functions. There is also no way to avoid the context switch. For small readings done ocassionally, this is an OK mechanism I would say. Consider that in a parallel universe, this sensor system aggregates temperature as well as a high quality thermal image in addition to each measurement. Also, there are thousands of such sensors spread across a lego factory and are being read each second from a common terminal. For such huge chunks of data accessed very frequently this each additional copy is a performance penalty. For such scenarios, I used the mmap() functionalty provided to share a part of memory between the kernel and userspace. I shall discuss more about Mmap in my next post.

Standard
Linux, UX/UI

Ask Fedora UX Redesign Updates #2

So taking into consideration most of the suggestions I had obtained during the recent Design FAD at Red Hat, the initial slides of desktop version are also ready. Maybe these are enough to start off with a basic CSS template which we can build upon iteratively while we work on more mockups for other major pages in parallel. Here is the updated mockup:

home-desktopJust a reminder that the color palette used here is not standard and its indicative. The standard Fedora color palette will be used in the code. The next mockups to be done are for the views for individual question pages, contributor/people page, the ‘ask question’ page and maybe the registration page if required. Most of the other stuff I think would be best done directly in the code. I was anticipating a bigger challenge in streamlining the design but personally I think, as Askbot apes the StackOverflow and similar Q&A websites’ UX to a large extent, the users already are familiar with flow of AskFedora. In any case, later on we can have something like this : http://askubuntu.com/tour – an Ask Fedora Tour page for beginners/new users so that the initial user-inertia reduces.

Contributions and comments welcome.

Standard
India, Linux, UX/UI

Sad State of Sarkari Sites

I still remember the day when I first accessed internet in India. It was a hot Delhi summer day in August 1995 and I was in my dad’s lab. VSNL had rolled out internet first in government research labs I assume. My dad had brought me specially to office on saturday to show me this wonder. I fired up Netscape Navigator (many of you may remember it as the precursor to Mozilla’s web technology) and an IT assistant told me to type http://www.yahoo.com in the browser. Some lights blinked and approximately 10 seconds later some images appeared from a far off land with the help of some tired electrons that made this journey to my screen. I typed “Australlia” – yes with the typo in the search bar and half an hour later I knew a lot about Aborigines, Uluru, their art, culture, destinations to visit and such. I never had an encyclopedia at home and this was magic for my tiny brain. Soon, Indian government started making their web presence known and developed websites for their departments, ministries etc. They called them portals (not the one which Valve makes). It was all amazing. Such a giant leap for us.

Except for that fact that those portals with their ancient technologies have pretty much remained same till now. Duh.

To give you a glimpse, here is the website of Indian Meteorological Department (IMD). I urge you to go at http://imd.gov.in and explore the ancient wisdom of golden era engineers in their full glory.

Screenshot from 2014-07-26 19:27:07

IMD website. Such beauty, much wow!

Here are some more screenshots of famous websites that people visit :

Screenshot from 2014-07-26 19:30:16

Indian Railways website – Best viewed on Internet Explorer 1024×768. Wow!

Screenshot from 2014-07-26 19:31:00

CBSE website – Teaching future generations HTML 1.0 tech

Screenshot from 2014-07-26 19:43:57

University of Mumbai – A flood of NEW

Screenshot from 2014-07-27 13:42:01

MTNL Delhi – We provide 4G services and online bill payment – but have the worst website on planet

Screenshot from 2014-07-27 13:42:12

Ministry of RT and Highways – I’ll let you know who built all this stuff

Oh yes, that is a marquee and those tiny “NEW” things are so fabulous. Its like web fashion from 1990s returning never changed since 1996. Wait don’t go. Here is a list of websites I urge you to try :

http://mtnldelhi.in/

http://www.indianrail.gov.in/

http://cbse.nic.in/welcome.htm

https://negp.gov.in/index.php   <– security certificate expired

http://www.mu.ac.in/

http://www.isro.org/

EDIT:

http://www.nift.ac.in/ <– This is a design school. I know, borderline technical, but still its a design school!

There are numerous others with suffixes as  .gov.in or .ac.in represented by Government of India in some form or other but absolutely unusable by general public. I don’t think babus ever visit the websites or I wonder if someone actually ever cares about the websites from the sense of UX/UI perspective.

I have categorized some observations and here goes the actual rant..

The “New” GIF Syndrome

You may start crying that there is no uniformity in .gov.in domains or any government website. Well, you are wrong. Most of them suffer from this syndrome which make the UI puke-worthy.  Its the “New” gif. A remnant from the dinosaur era, these small gifs were created by web enthusiasts when gif was cool (the first time). It was an amazing feature of that era (no jokes) but its (really)^infinity annoying in the age of HTML5 and CSS. All government websites have made it a point to annoy people while making announcements by putting these tiny devilish GIFs. People may see it as just a blinking “New”. But here is what it actually looks like to me :

1386272038724160_animate

The Marquee Dance

Remember the <marquee> tag? Oh this wonderful tool to have fun with in the 90s. There is no shortage of this piece of archaic technology in these websites too.  I think its the worst tag ever designed. I don’t know how others see it, but its just so annoying to wait for 2 minutes just for the relevant announcement to scroll like a snail on the screen and then trying to catch it using the mouse as if a cat is chasing a mouse.

Those “Notice” PDF Links

Many times on such websites, from time to time, some links right from the home page (which for example says “Bus Schedule”) will point you to a shabbily scanned 10 page incomprehensible document of a schedule with all the official signatures, and rubber stamps still intact. There would even be a chai stain  or staple-pin mark somewhere or other if looked closely. I mean seriously guys? Make a schedule in Excel -> print it -> get it approved by 3 people -> page crumbles -> scan it -> convert to pdf using a shitty free PDF converter -> put it up as hyperlink on a govt website?? Even Jackie Chan’s logic would get screwed here. This is the worst form of information delivery. But at least its better than no information. When will the process get streamlined?

The Babu photu effect

Madame, ismyle pleej!“. Many websites also have images of ministers right in the headers. Refer to the images above. This is not so common, but not so uncommon too. Remember the fiasco over @PMOIndia twitter handle when it was gonna be archived? Well, ministers will not take any chance here.

The tiny multi-coloured serif fonts

This is another annoying thing. Minuscule serif fonts all scattered over with their red green and blue colors – creating an information overload. Along with the marquee and the NEW GIF (my soul cracks once each time I imagine that) the effect is amplified 100 times. The person just gets lost. Into the bottom of the ocean. Sometimes I feel they design the websites like this knowingly – just to hide some incompetency they may have.

The low-res myopic vision

So many aspiring photographers with their flashy DSLRs on Facebook and not a single one clicking professional photos for Government websites? It seems they have to do their best with 340×480 images, blow them up and plaster on their websites 😦 Look at the glory of those pixels. Such beauty in pixelated images. (/me weeps again). And those horrible logos. Probably done in MS paint 20 years ago and stretched using <img> tag all these years again and again and again with wrong aspect ratio. Even if there is a logo change, its still in MS Paint I think. Such dedication you see.

Unquestionably, India is a global IT leader. They do amazing things. There is no dearth of good web engineers and designers for sure. The websites of TCS, Wipro, Infosys are all well done. All private entities have wonderful websites – Airtel, Tata, Birla, ICICI, Even PSUs like ONGC etc. BUT WHY NOT GOVT WEBSITES? Why not even websites of central or regional academic institutions? Just look at website of any government engineering college in any state. The things which general public needs most is in the worst state. People would argue that internet penetration is not good and hardy anyone uses it. Well, you are wrong folks. Don’t judge a book by its cover.  Just see the amount of bookings done on IRCTC website each day. The indigeneous technology which CRIS has put up is commendable. Thousands of students use CBSE websites to search for study material and see results. Remarkable too, is the work done by organizations such as ISRO – they may have not launched a beautiful and informative website but they have launched a spacecraft to Mars 🙂 If you dig deeper in the IMD website, there are hourly images from Kalpana satellite and quite accurate doppler radar maps of weater in major cities. Just that nobody knows and ever uses it because its hidden beneath layers of obscurity and bad UI. This is the department that prevented a recent catastrophe in AP. We have engineers who can do this : http://railradar.railyatri.in/ A flightradar like app for Indian Railways (I don’t think many railways in the world – except DB provide such APIs to get moderately accurate running info on the web). But we don’t have engineers to design an efficient Indian Railways website? There are 3 separate websites related to Indian Railways – all confusing. I cry in the nights silently seeing these websites, tears rolling while booking railway tickets, looking at election results, so on and so forth. The pain is unbearable. Even if I want to help, maybe I won’t be allowed to as I am not in a sarkari job as the awesome guys there. They are very good and highly apt – but in UI/UX technologies of the 90s i think.

All this does not absolve the government from the leniency and giving arguably the worst UX experience provided on the web. There is no uniformity in the government websites and no way to actually get the head and tail of anything that goes inside. National Informatics Center (NIC) which I presume handles these major websites has at least redone its website (http://www.nic.in) so its not an eyesore anymore. (Thanks a lot guys) Same is the case of recently launched Mygov website (http://mygov.nic.in/), http://www.data.gov.in and Department of Electronics and IT website. I think they even tried to organize a hackathon! Good job. I just hope all the Govt websites start getting a facelift like these ones soon. I don’t know if its gonna be effective or not.

I showed the miserable websites to my friends here and we all enjoyed a few laughs. Then I saw beautiful govt websites of many countries. How can I explain to them that there is a high chance that some of these websites may have been designed by Indians but the Indian Government itself is incompetent to utilize their own engineers to give better service to its citizens. How and what kind of people do they hire in institutions like NIC? Is an entrance exam enough to show aptitude in design and development? Is there some in house training being delivered to engineers to bring them up to date? Foreign investors visit these websites for sure to see infrastructure support. Imagine what they will think when the visit Ministry of Highways website and see the badly “Photoshopped” fake looking images. Sigh.

Lastly, an open request to some randomly web surfing government IT officials and policy makers who may have reached here. Contact me folks, I will do designs for free. Trust me. For FREE! I will take out time from my research, sit in a foreign land and do it for free, just for you India – only if you keep your egos aside and invite the open world to help you. Start a major UX/UI redesign project for web and open source it. You will be surprised how many talented young minds will help you out.

Over and out. Spread the word.

EDIT :

I’ll just record some exceptions to the above here :

http://gectcr.ac.in/ <– This should be the example for all Govt academic institutions. They took the initiative. Kudos!

Government has a web design policy and guidelines doc. Saved for reading later : http://web.guidelines.gov.in/

Standard