When cross-compiling using clang and the -target option, targeting the same architecture and hardware as the native system, I've noticed that clang seems to generate worse optimizations than the native-built counter-part for cases where the <sys> in the triple is none.
Consider this simple code example:
int square(int num) {
return num * num;
}
When optimized at -O3 with -target x86_64-linux-elf, the native x86_64 target code generation yields:
square(int):
mov eax, edi
imul eax, edi
ret
The code generated with -target x86_64-none-elf yields:
square(int):
push rbp
mov rbp, rsp
mov eax, edi
imul eax, edi
pop rbp
ret
Despite having the same hardware and optimization flags, clearly something is missing an optimization. The problem goes away if none is replaced with linux in the target triple, despite no system-specific features being used.
At first glance it may look like it simply isn't optimizing at all, but different code segments show that it is performing some optimizations, just not all. For example, loop-unrolling is still occurring.
Although the above examples are simply with x86_64, in practice, this issue is generating code-bloat for an armv7-based constrained embedded system, and I've noticed several missed optimizations in certain circumstances such as:
- Not removing unnecessary setup/cleanup instructions (same as in x86_64)
- Not coalescing certain sequential inlined increments into a single
addinstruction (at-Os, when inliningvector-likepush_backcalls. This optimizes when built natively from an arm-based system running armv7.) - Not coalescing adjacent small integer values into a single
mov(such as merging a 32-bitintwith aboolin anoptionalimplementation. This optimizes when built natively from an arm-based system running armv7.) - etc
I would like to know what I can do, if anything, to achieve the same optimizations when cross-compiling as compiling natively? Are there any relevant flags that can help tweak tuning that are somehow implied with the <sys> in the triple?
If possible, I'd love some insight as to why the cross-compilation target appears to fail to optimize certain things simply because the system is different, despite having the same architecture and abi. My understanding is that LLVM's optimizer operates on the IR, which should generate effectively the same optimizations so long as nothing is reliant on the target system itself.