Currently, from research and various attempts, I'm pretty sure that the only solution to this problem is to use assembly. I'm posting this question to show an existing problem, and maybe get attention from compiler developers, or get some hits from searches about similar problems.
If anything changes in the future, I will accept it as an answer.
This is a very related question for MSVC.
In x86_64 machines, it is faster to use div/idiv with a 32-bit operand than a 64-bit operand. When the dividend is 64-bit and the divisor is 32-bit, and when you know that the quotient will fit in 32 bits, you don't have to use the 64-bit div/idiv. You can split the 64-bit dividend into two 32-bit registers, and even with this overhead, performing a 32-bit div on two 32-bit registers will be faster than doing a 64-bit div with a full 64-bit register.
The compiler will produce a 64-bit div with this function, and that is correct because for a 32-bit div, if the quotient of the division does not fit in 32 bits, an hardware exception occurs.
uint32_t div_c(uint64_t a, uint32_t b) {
return a / b;
}
However, if the quotient is known to be fit in 32 bits, doing a full 64-bit division is unnecessary. I used __builtin_unreachable to tell the compiler about this information, but it doesn't make a difference.
uint32_t div_c_ur(uint64_t a, uint32_t b) {
uint64_t q = a / b;
if (q >= 1ull << 32) __builtin_unreachable();
return q;
}
For both div_c and div_c_ur, the output from gcc is,
mov rax, rdi
mov esi, esi
xor edx, edx
div rsi
ret
clang does an interesting optimization of checking the dividend size, but it still uses a 64-bit div when the dividend is 64-bit.
mov rax, rdi
mov ecx, esi
mov rdx, rdi
shr rdx, 32
je .LBB0_1
xor edx, edx
div rcx
ret
.LBB0_1:
xor edx, edx
div ecx
ret
I had to write straight in assembly to achieve what I want. I couldn't find any other way to do this.
__attribute__((naked, sysv_abi))
uint32_t div_asm(uint64_t, uint32_t) {__asm__(
"mov eax, edi\n\t"
"mov rdx, rdi\n\t"
"shr rdx, 32\n\t"
"div esi\n\t"
"ret\n\t"
);}
Was it worth it? At least perf reports 49.47% overhead from div_c while 24.88% overhead from div_asm, so on my computer (Tiger Lake), div r32 is about 2 times faster than div r64.
This is the benchmark code.
#include <stdint.h>
#include <stdio.h>
__attribute__((noinline))
uint32_t div_c(uint64_t a, uint32_t b) {
uint64_t q = a / b;
if (q >= 1ull << 32) __builtin_unreachable();
return q;
}
__attribute__((noinline, naked, sysv_abi))
uint32_t div_asm(uint64_t, uint32_t) {__asm__(
"mov eax, edi\n\t"
"mov rdx, rdi\n\t"
"shr rdx, 32\n\t"
"div esi\n\t"
"ret\n\t"
);}
static uint64_t rdtscp() {
uint32_t _;
return __builtin_ia32_rdtscp(&_);
}
int main() {
#define n 500000000ll
uint64_t c;
c = rdtscp();
for (int i = 1; i <= n; ++i) {
volatile uint32_t _ = div_c(i + n * n, i + n);
}
printf(" c%15ul\n", rdtscp() - c);
c = rdtscp();
for (int i = 1; i <= n; ++i) {
volatile uint32_t _ = div_asm(i + n * n, i + n);
}
printf("asm%15ul\n", rdtscp() - c);
}