I am trying to reproduce How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures White Paper. This white paper provides a kernel module to accurately measure the execution time of a piece of code, by disabling preempt and using RDTSC, etc.
However, I cannot get the expected low variance when running the benchmark codes as reported in the white paper, which means the technique from the white paper doesn't work. I couldn't find out what's wrong.
The core of the kernel module is just a couple of lines
unsigned int flags;
preempt_disable();
raw_local_irq_save(flags);
asm volatile(
"CPUID\n\t"
"RDTSC\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
: "=r"(cycles_high), "=r"(cycles_low)::"%rax", "%rbx", "%rcx", "%rdx");
/* call the function to measure here */
asm volatile(
"RDTSCP\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
"CPUID\n\t"
: "=r"(cycles_high1), "=r"(cycles_low1)::"%rax", "%rbx", "%rcx", "%rdx");
raw_local_irq_restore(flags);
preempt_enable();
The codes are directly copied from the white paper with the optimizations adopted. From the white paper, the expected output should be
loop_size:995 >>>> variance(cycles): 0; max_deviation: 0 ;min time: 2216
loop_size:996 >>>> variance(cycles): 28; max_deviation: 4 ;min time: 2216
loop_size:997 >>>> variance(cycles): 0; max_deviation: 112 ;min time: 2216
loop_size:998 >>>> variance(cycles): 28; max_deviation: 116 ;min time: 2220
loop_size:999 >>>> variance(cycles): 0; max_deviation: 0 ;min time: 2224
total number of spurious min values = 0
total variance = 1
absolute max deviation = 220
variance of variances = 2
variance of minimum values = 335757
However, what I get is
[1418048.049032] loop_size:42 >>>> variance(cycles): 104027;max_deviation: 92312 ;min time: 17
[1418048.049222] loop_size:43 >>>> variance(cycles): 18694;max_deviation: 43238 ;min time: 17
[1418048.049413] loop_size:44 >>>> variance(cycles): 1;max_deviation: 60 ;min time: 17
[1418048.049602] loop_size:45 >>>> variance(cycles): 1;max_deviation: 106 ;min time: 17
[1418048.049792] loop_size:46 >>>> variance(cycles): 69198;max_deviation: 83188 ;min time: 17
[1418048.049985] loop_size:47 >>>> variance(cycles): 1;max_deviation: 60 ;min time: 17
[1418048.050179] loop_size:48 >>>> variance(cycles): 1;max_deviation: 61 ;min time: 17
[1418048.050373] loop_size:49 >>>> variance(cycles): 1;max_deviation: 58 ;min time: 17
[1418048.050374]
total number of spurious min values = 2
[1418048.050374]
total variance = 28714
[1418048.050375]
absolute max deviation = 101796
[1418048.050375]
variance of variances = 1308070648
a much higher max_deviation and variance(cycles) than the white paper.
(please ignore different min time, since the white paper may be actually benchmarking something, but my codes do not actually benchmark anything.)
Is there anything I missed from the report? Or is the white paper not up-to-date and I missed some techniques in the modern x86 CPUs? How can I measure the execution time of a piece of code with the highest precision in modern intel x86 CPU architecture?
P.S. The code I run is placed here.