How can I achieve native-level optimizations when cross-compiling with Clang?

Question

When cross-compiling using clang and the -target option, targeting the same architecture and hardware as the native system, I've noticed that clang seems to generate worse optimizations than the native-built counter-part for cases where the <sys> in the triple is none.

Consider this simple code example:

int square(int num) {
    return num * num;
}

When optimized at -O3 with -target x86_64-linux-elf, the native x86_64 target code generation yields:

square(int):
        mov     eax, edi
        imul    eax, edi
        ret

The code generated with -target x86_64-none-elf yields:

square(int):
        push    rbp
        mov     rbp, rsp
        mov     eax, edi
        imul    eax, edi
        pop     rbp
        ret

Live Example

Despite having the same hardware and optimization flags, clearly something is missing an optimization. The problem goes away if none is replaced with linux in the target triple, despite no system-specific features being used.

At first glance it may look like it simply isn't optimizing at all, but different code segments show that it is performing some optimizations, just not all. For example, loop-unrolling is still occurring.

Although the above examples are simply with x86_64, in practice, this issue is generating code-bloat for an armv7-based constrained embedded system, and I've noticed several missed optimizations in certain circumstances such as:

Not removing unnecessary setup/cleanup instructions (same as in x86_64)
Not coalescing certain sequential inlined increments into a single add instruction (at -Os, when inlining vector-like push_back calls. This optimizes when built natively from an arm-based system running armv7.)
Not coalescing adjacent small integer values into a single mov (such as merging a 32-bit int with a bool in an optional implementation. This optimizes when built natively from an arm-based system running armv7.)
etc

I would like to know what I can do, if anything, to achieve the same optimizations when cross-compiling as compiling natively? Are there any relevant flags that can help tweak tuning that are somehow implied with the <sys> in the triple?

If possible, I'd love some insight as to why the cross-compilation target appears to fail to optimize certain things simply because the system is different, despite having the same architecture and abi. My understanding is that LLVM's optimizer operates on the IR, which should generate effectively the same optimizations so long as nothing is reliant on the target system itself.

Sorry for an offtopic question, but how did you make clang generate the asm listings? — The Dreams Wind, Apr 21 '22 at 22:19
@TheDreamsWind For the most part I've been using [compiler-explorer](https://compiler-explorer.com) for comparisons, which I also have a local instance running with the specific compilers/embedded SDKs I'm working with. You can also generate the full assembly listing by passing the `-S` argument when compiling (e.g. `clang++ foo.cpp -S` which creates `foo.s`) (more info [here](https://stackoverflow.com/a/11957826/1678770) if you want intel-syntax) — Human-Compiler, Apr 21 '22 at 23:36
The optimization here is just avoiding the standard function prologue/epilogue for storing off the stack pointer to the base pointer (to maintain a chain of base pointers for the stack frame). Those base pointer chains on the stack are helpful for debugging and (depending on how exception handling is done), handling exceptions, IIRC, and it's possible some non-Linux systems might rely on them in some way that makes the code not work without them. — ShadowRanger, Apr 21 '22 at 23:54
@ShadowRanger You're right that the one example above is just the prologue/epilogue -- but I have found that it's actually more than just that which appears to be different (which I've also listed in the question). This still might be some weird system-related thing, though it's unclear (to me, at least) why this is the case. Unfortunately I've had some difficulty producing a minimal example for the other situations, since most of it involves proprietary code that I can't share -- though I can see if I can come up with something minimal to replicate them as well. — Human-Compiler, Apr 22 '22 at 00:58

Jérôme Richard · Accepted Answer · 2022-04-22T21:31:10.777

TL;DR: for the x86 targets, frame-pointers are enabled by default when the OS is unknown. You can manually disable them using -fomit-frame-pointer. For ARM platforms, you certainly need to provide more information so that the backend can deduce the target ABI. Use -emit-llvm so to check which part of Clang/LLVM generate an inefficient code.

The Application Binary Interface (ABI) can change from one target to another. There is no standard ABI in C. The chosen one is dependent of several parameters including the architecture, its version, the vendor, the OS, the environment, etc.

The use of the -target parameter help the compiler to select a ABI. The thing is x86_64-none-elf is not complete enough so the backend can actually generate a fast code. In fact, I think this is actually not a valid target since there is a warning from Clang in this case and the same warning appear with wrong random targets. Surprisingly, the compiler still succeed to generate a generic binary with the provided information. Targets like x86_64-windows and x86_64-linux works as well as x86_64-unknown-windows-cygnus (for Cygwin in Windows). You can get the list of the Clang supported platforms, OS, environment, etc. in the code.

One particular aspect of the ABI is the calling conventions. They are different between operating systems. For example, x86-64 Linux platforms uses the calling convention from the System V AMD64 ABI while recent x86-64 Windows platforms uses the vectorcall calling convention based on the older Microsoft x64 one. The situation is more complex for old x86 architectures. For more information about this please read this and this.

In your case, without information about the OS, the compiler might select its own generic ABI resulting in the old push/pop instruction being used. That being said, the compiler assumes that edi contains the argument passed to the function meaning that the chosen ABI is the System V AMD64 (or a derived version). The environment can play an important role in stack optimizations like the access of the stack from outside the function (eg. the callee functions).

My initial guess was that the assembly backend disabled some optimizations due to the lack of information regarding the specified target, but this is not the case for x86-64 ELFs as the same backend is selected. Note that target is parsed by architecture-specific backend (see here for example).

In fact, the issue comes from Clang emitting an IR code with frame-pointer flag set to all. You can check the IR code with the flag -emit-llvm. You can solve this problem using -fomit-frame-pointer.

On ARM, the problem can be different and come from the assembly backend. You should certainly specify the target OS or at least more information like the sub-architecture type (mainly for ARM), the vendor and the environment.

Overall, note that it is reasonable to think that a more generic targets produces a less efficient code due to the lack of information. If you want more information about this, please fill an issue on the Clang or LLVM bug tracker so to track the reason why this happens or/and let developers fix this.

How can I achieve native-level optimizations when cross-compiling with Clang?

1 Answers1