Soramichi's blog

Some seek complex solutions to simple problems; it is better to find simple solutions to complex problems

Debian 9 uses Kernel 4.9 that Supports PEBS Better

Preface

In the previous post I installed Debian 8 (jessie) into Thinkpad X260, but I actually changed my mind and re-installed Debian 9 (stretch), because it supports the wifi equipped in Thinkpad X260. A good thing is Debian 9 is already freezed so I can expect there're only a few critical bugs remained (well there're actually one to two hundreds of them as of today, but it's relatively a small number given that it has over 40K packages).

One big difference between Debian 9 and 8 is the kernel versions they use (4.9 vs 3.16), and especially the support for Intel PEBS (Precise Event Based Sampling) is way better (or I have to say way more proper) in kernel 4.9. This post explains what PEBS is a bit and how its support gets better if you use kernel 4.9.

Precise Event Based Sampling (PEBS)

PEBS is an extension of the performance counters, which is a mechanism to measure various hardware events such as number of cache misses, number of branch prediction misses, and many many others. If you're not familiar with the performance counters, please refer another site like this.

PEBS can be used from linux perf tool by specifying pp suffix after the counter name, such as:

# specify a counter by the name
$ perf record -e cpu/mem-loads/pp -- workload
# specify a counter by its number
$ perf record -e r20D1:pp -- workload

An advantage of PEBS against the normal performance counters is that, as the name suggests, PEBS is more precise because it's all hardware-based. For example, a result of measuring r20D1 without pp might look like this (the result is rendered by perf annotate): f:id:sorami_chi:20171217205830p:plain

Because r20D1 measures the number of "Retired load instructions missed L3", it can never happen on instructions other than the ones accessing memory addresses. However this result shows that 2.32% of them occured in a sub between two registers, 9.21% in a mov between two registers, etc etc. (An excuse for this is that, for performance analysis in function-level, this accuracy might be enough. Even if some events are drifted by a few instructions, if you look at them in function-level granularity the outcome can be the same.)

For the explanation of each counter, you can refer section 19 of the volume 3 of the super thick manual from Intel. Note that the event number and the umask have to be specified to perf in the reversed order of how they appear in the manual. For example if you measure a counter whose event number is AA and the umask is BB, you have to do perf record -e rBBAA (not rAABB).

Using PEBS by specifying :pp for the same workload gets a result like this: f:id:sorami_chi:20171217205833p:plain

Now you can see that no r20D1 occurs on any instructions without memory accesses.

Another huge advantage of PEBS is it supports retrieving the register values, the instruction pointer, the memory address accessed, and the source of data at the time the instruction triggering the event occurs. However explaining these requires a whole new long post so I just leave it to another manual from Intel.

How PEBS is handled in the kernel

The Linux kernel holds a list of counters that support PEBS, because not all counters support PEBS so the kernel has to know which ones are PEBS-capable. For Skylake and Kabylake, PEBS is supported for the counters which have "PS" or "PSDLA" in the comment column of the manual. For Broadwell or older CPUs the manual says "Supports PEBS" in the comment column for PEBS-capable counters.

This list is defined in arch/x86/kernel/cpu/perf_event_intel_ds.c in kernel 3.16 and arch/x86/events/intel/ds.c in kernel 4.9. The problem is the list in kernel 3.16 at the time Debian 8 was released was not complete. For a concrete example, r20D1 (event number=0xD1, umask=0x20) used in the above example is PEBS-capable, but it is not listed in linux-source-3.16 of Debian 8. (Note that it is listed in the newest version of kernel 3.16 in kernel.org, which means it was fixed at some point after Debian 8 was released.)

In perf_event_intel_ds.c from linux-source-3.16 package of Debian 8, the list is defined as follows:

struct event_constraint intel_hsw_pebs_event_constraints[] = {
        INTEL_UEVENT_CONSTRAINT(0x01c0, 0x2), /* INST_RETIRED.PRECDIST */
        INTEL_PST_HSW_CONSTRAINT(0x01c2, 0xf), /* UOPS_RETIRED.ALL */
        INTEL_UEVENT_CONSTRAINT(0x02c2, 0xf), /* UOPS_RETIRED.RETIRE_SLOTS */
        INTEL_EVENT_CONSTRAINT(0xc4, 0xf),    /* BR_INST_RETIRED.* */
        INTEL_UEVENT_CONSTRAINT(0x01c5, 0xf), /* BR_MISP_RETIRED.CONDITIONAL */
        INTEL_UEVENT_CONSTRAINT(0x04c5, 0xf), /* BR_MISP_RETIRED.ALL_BRANCHES */
        INTEL_UEVENT_CONSTRAINT(0x20c5, 0xf), /* BR_MISP_RETIRED.NEAR_TAKEN */
        INTEL_PLD_CONSTRAINT(0x01cd, 0x8),    /* MEM_TRANS_RETIRED.* */
        /* MEM_UOPS_RETIRED.STLB_MISS_LOADS */
        INTEL_UEVENT_CONSTRAINT(0x11d0, 0xf),
        /* MEM_UOPS_RETIRED.STLB_MISS_STORES */
        INTEL_UEVENT_CONSTRAINT(0x12d0, 0xf),
        INTEL_UEVENT_CONSTRAINT(0x21d0, 0xf), /* MEM_UOPS_RETIRED.LOCK_LOADS */
        INTEL_UEVENT_CONSTRAINT(0x41d0, 0xf), /* MEM_UOPS_RETIRED.SPLIT_LOADS */
        /* MEM_UOPS_RETIRED.SPLIT_STORES */
        INTEL_UEVENT_CONSTRAINT(0x42d0, 0xf),
        INTEL_UEVENT_CONSTRAINT(0x81d0, 0xf), /* MEM_UOPS_RETIRED.ALL_LOADS */
        INTEL_PST_HSW_CONSTRAINT(0x82d0, 0xf), /* MEM_UOPS_RETIRED.ALL_STORES */
        INTEL_UEVENT_CONSTRAINT(0x01d1, 0xf), /* MEM_LOAD_UOPS_RETIRED.L1_HIT */
        INTEL_UEVENT_CONSTRAINT(0x02d1, 0xf), /* MEM_LOAD_UOPS_RETIRED.L2_HIT */
        INTEL_UEVENT_CONSTRAINT(0x04d1, 0xf), /* MEM_LOAD_UOPS_RETIRED.L3_HIT */
        /* MEM_LOAD_UOPS_RETIRED.HIT_LFB */
        INTEL_UEVENT_CONSTRAINT(0x40d1, 0xf),
        /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS */
        INTEL_UEVENT_CONSTRAINT(0x01d2, 0xf),
        /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT */
        INTEL_UEVENT_CONSTRAINT(0x02d2, 0xf),
        /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM */
        INTEL_UEVENT_CONSTRAINT(0x01d3, 0xf),
        INTEL_UEVENT_CONSTRAINT(0x04c8, 0xf), /* HLE_RETIRED.Abort */
        INTEL_UEVENT_CONSTRAINT(0x04c9, 0xf), /* RTM_RETIRED.Abort */

        EVENT_CONSTRAINT_END
};

I don't explain what each INTEL_* macro means, but the point here is the kernel defines a counter rXXYY is PEBS-capable if there's a line like INTEL_SOMETHING_CONSTRAINT(0xXXYY, 0xf).

You can see there are r01D1, r02D1, r04D1, and r40D1, but no r20D1, even though r20D1 is described to be PEBS-capable in Haswell in the intel manual. Note that Haswell was the latest core generation at the time of kernel 3.16 release, and for newer versions of CPUs such as Skylake the linux kernel just treats them as Haswell.

Therefore, if you try to measure r20D1:pp in Debian 8, it yields an error:

$ perf record -e r20D1:pp -- workload
'precise' request may not be supported. Try removing 'p' modifier.

This issue has been already fixed in kernel 4.9. Therefore Debian 9 that uses kernel 4.9 can properly handle r20D1 as PEBS-capable and it allows perf to measure r20D1:pp.

The Linux kernel 4.9 defines the list of PEBS-supported counters in arch/x86/events/intel/ds.c (only the relavant part is extracted):

struct event_constraint intel_hsw_pebs_event_constraints[] = {
       ...
       INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_XLD(0xd1, 0xf),    /* MEM_LOAD_UOPS_RETIRED.* */
       ...
}

This macro specifies that any counters ending with D1 are PEBS-capable.

Summary

If you use special hadware functionalities such as PEBS, I do recommend to upgrade your distro and the kernel. PEBS has existed since Pentium 4, but the supported counters are ever growing and changing (actually r20D1 was the number of micro operations until Broadwell, but it was changed to the number of instructions since Skylake). So you'd better use a near-latest kernel as long as you can to get a proper support, and using the latest distro might be an easy way to go.