For this fragment of code (https://godbolt.org/z/s4PY44dha)
int foo(unsigned long long x)
{
return _lzcnt_u64(x);
}
GCC generates 3 asm instructions
xorl %eax, %eax
lzcntq %rdi, %rax
ret
while clang generates only 2
lzcntq %rdi, %rax
retq
Is it possible to change the implementation/signature of foo to help GCC understand that this xor instruction is useless? Why can't gcc perform such simple optimization itself?
The answer to this question Why does breaking the "output dependency" of LZCNT matter? explains that this xor may be useful for some old architectures to break so-called "false dependency" on the destination register. It even mentions that the issue it is supposed to fix is not present in the modern intel architectures starting from "Skylake-S (client)". I tried to
pass newer architectures to the GCC (for example -march=rocketlake, -march=icelake-client) but it still inserts "useless" xor.
In contrast, even for old architectures like haswell clang doesn't insert xor. This means that if one wants to get each bit of performance for certain architecture, then the insertion of xor should be controlled manually.
For example, with this inline assembly, I managed to get the code without xor.
int xorless_lzcntq(unsigned long long x) {
unsigned long long res;
asm ("lzcntq %1, %0" : "=r"(res) : "r"(x));
return res;
}