TL:DR: you should only use inline asm if you're going to optimize the whole thing and beat the compiler. Otherwise use intrinsics like __builtin_clzll or GCC-only __builtin_ia32_bsrdi
Your actual problem was unrelated to inline asm.
1 has type int, and unlike most operators, << doesn't promote the left side to match the right side. So you're shifting an int by more than 31 bits, which is undefined behaviour. In practice you get a 32-bit operand-size shift, which you could have noticed by looking at the compiler's asm output. (Generally a good idea whenever you're using inline asm.)
You don't need inline asm for this; it's easy to express using GNU C builtins.
But if performance is critical, you might want to use inline asm to work around compiler missed-optimizations for current microarchitectures in the shift part, especially if you're compiling without BMI2 for efficient variable-count shifts on Intel CPUs.
Note that bsr is quite slow on AMD CPUs while lzcnt is fast on all CPUs that support it. On Intel, both are fast. But unlike for bsf/tzcnt, the instructions produce different results even for non-zero inputs so using rep bsr to get a fast lzcnt on newer CPUs wouldn't be useful the way it sometimes is for tzcnt. (tzcnt runs as bsf on older CPUs, lzcnt runs as bsr on older CPUs. That's because the encoding for lzcnt is a REP prefix in front of bsr)
Unfortunately bsr is quite slow on Atom/Silvermont, too. Like on Goldmont Plus: 11 uops, 9 cycle latency, and 8 cycle throughput.
See my answer on Find most significant bit (left-most) that is set in a bit array for a rundown of how various compilers are dumb with 63-__builtin_clzll(x); which could optimize back to just a bsr, but doesn't.
GCC specifically (not clang) has an intrinsic __builtin_ia32_bsrdi for 64-bit bsr, and both support _bit_scan_reverse for 32-bit.
The obvious way to write this is this portable code that works on any target supported by GCC:
uint64_t LeadingBitIsolate_gnuc(uint64_t a)
{
static_assert( CHAR_BIT * sizeof(unsigned long) == 64, "__builtin_clzll isn't 64-bit operand size");
uint64_t bit = 63-__builtin_clzll(a); // BSR
return a ? (1ULL << bit) : 0; // ULL is guaranteed to be at least a 64-bit type
}
If a is ever a compile-time constant, constant-propagation through this function works. Using inline asm always defeats that unless you use something like if(__builtin_constant_p(a)) { non-asm version } else { asm version}. (https://gcc.gnu.org/wiki/DontUseInlineAsm)
I used uint64_t for portability to x86-64 targets where unsigned long is a 32-bit type. (Linux x32 (an ILP32 ABI), and MS Windows x64). Also to non-x86 targets since this doesn't use any inline asm.
Unfortunately this compiles to pretty bad asm with gcc and clang (Godbolt)
# gcc9.2 -O3 -mtune=skylake
LeadingBitIsolate_gnuc(unsigned long):
movq %rdi, %rax
testq %rdi, %rdi
je .L4
bsrq %rdi, %rax
movl $63, %ecx
xorq $63, %rax # bit = 63 - bit in the low 6 bits, 2's complement bithack
subl %eax, %ecx # bit = 63 - bit
movl $1, %eax # 1 uop
salq %cl, %rax # 3 uops
.L4:
ret
Using BSR64 = __builtin_ia32_bsrdi from my answer on Find most significant bit (left-most) that is set in a bit array we get this from GCC (and similar using cmov from clang and ICC). ICC/MSVC provide intrinsics that return a bool for the ZF output from BSR. test/je has the same cost as je on modern x86, but saving the test instruction in front of cmov does have value.
// See the linked answer for the ICC/MSVC part of this
#ifdef __GNUC__
#ifdef __clang__
static inline unsigned BSR64(uint64_t x) {
return 63-__builtin_clzll(x);
// gcc/ICC can't optimize this back to just BSR, but clang can and doesn't provide alternate intrinsics
// update: clang8.0 regressed here but still doesn't do __builtin_ia32_bsrdi
}
#else
#define BSR64 __builtin_ia32_bsrdi
#endif
#include <x86intrin.h>
#define BSR32(x) _bit_scan_reverse(x)
#endif
uint64_t LeadingBitIsolate_gnuc_v2(uint64_t a)
{
uint64_t bit = BSR64(a);
return a ? (1ULL << bit) : 0;
}
This compiles better with GCC, at least for the BSR part. But still not for the shift part.
# gcc -O3 -mtune=skylake
LeadingBitIsolate_gnuc_v2(unsigned long):
movq %rdi, %rax
testq %rdi, %rdi
je .L9
bsrq %rdi, %rcx
movl $1, %eax
salq %cl, %rax # missed optimization: xor-zero + bts would be much better especially with tune=skylake
.L9:
ret
On the plus side here, we've optimized out most of the BSR overhead by using a BSR intrinsic. And it can still compile to LZCNT on CPUs where that's a better choice, e.g. with clang -march=znver1 (Ryzen). GCC still uses BSR but clang uses 63-lzcnt which will run faster on AMD.
63-__builtin_clzll is portable to non-x86 but __builtin_ia32_bsrdi isn't. (DI = double-word integer = 64-bit).
But the compiler does still know what's happening, and can optimize this into surrounding code, and for compile-time-constant inputs.
E.g. it can branch over other things, too, if the input=0, only testing that once. And it knows that the input only has a single bit set, so in theory if you're using it as a divisor a compiler could be smart enough to AND with res-1 instead of using a slow div instruction. (Division by zero is UB so the compiler can assume that didn't happen, maybe even optimizing away the a==0 ? part of this after inlining.)
Compiling this with BMI2 available for SHLX would make it efficient on modern Intel.
Using inline asm to work around missed optimizations for Sandybridge-family
SnB-family has slower-than-usual variable-count shifts: instead of potential stalls if the flag-result is read, it includes 2 extra uops to check if the count is zero and conditionally merge the shift's flag-result into FLAGS. (shl %cl, %reg has to leave FLAGS unmodified if cl=0. This is the kind of legacy baggage that's part of the "x86 tax" that high performance x86 CPUs have to pay to do superscalar out-of-order execution.)
AMD apparently manages shifts as a single uop without limitations / penalties.
Thus bts into a zeroed register is a better choice for implementing 1ULL << count, especially on Intel CPUs where xor-zeroing is extra cheap, and bts reg,reg is a single uop (instead of 2 on Ryzen).
Compared to the version in @DavidWohlferd's answer (based on my comments), this saves instructions by using a as the zero in the case where a==0, instead of needing an extra zeroed register. And the comments talk about performance implications.
#include <stdint.h>
#include <limits.h>
uint64_t LeadingBitIsolate_asm(uint64_t a)
{
uint64_t bit;
uint64_t bts_target = 0;
asm (
"BSR %[a], %[bit] \n\t" // ZF = (a==0)
"BTS %[bit], %[res] \n\t" // sets CF = old bit = 0 but not ZF
// possible partial-flag stall on some CPUs for reading ZF after writing CF
// but not SnB-family
"CMOVZ %[a], %[res]" // res = a (= 0) if BSR detected a==0
: [bit] "=r" (bit), [res] "+r"(bts_target) // no early clobber: input read before either output written
: [a] "r" (a)
: "cc" // redundant but not harmful on x86: flags clobber is always implied
);
return bts_target;
}
Reading ZF after BTS writes CF is dangerous for older Intel CPUs, where it will cause a partial-flag stall. See also Problems with ADC/SBB and INC/DEC in tight loops on some CPUs
GCC 9.2 gives us a sequence that's only 4 uops on Skylake. (https://agner.org/optimize/). The xor-zeroing needs a front-end uops, but on Sandybridge-family doesn't need a back-end execution unit (0 unfused domain uops).
It's 5 uops on Haswell and earlier where cmov is 2 uops / 2c latency. Since you say your CPU doesn't support lzcnt, you might have an IvyBridge or earlier (2 uop cmov), or an old AMD.
Branching might be worth it if input=0 is expected to be rare or never happen, especially if this is part of a critical path for latency. Especially on older Intel where cmov is 2 uops.
gcc9.2 -O3 -mtune=skylake
LeadingBitIsolate_asm(unsigned long):
xorl %eax, %eax
BSR %rdi, %rdi
BTS %rdi, %rax
CMOVZ %rdi, %rax
ret
A "rm"(a) input constraint would be possible, but
- we read it twice so it's better to have the compiler load into a register once
- clang is dumb and always picks
m when it's an option, even if that means spilling a value that was already in a register. clang (LLVM) inline assembly - multiple constraints with useless spills / reloads
bsr has a "false" dependency on existing CPUs, so it's better if the compiler happens to pick the same register for res and a instead of writing a new register. This is what usually happens if a isn't needed later.
BSR isn't really a false dependency: if the input is zero, the destination is left unmodified. AMD even documents this, but Intel implements it in their hardware while leaving their documentation saying "undefined" contents.
We can't take advantage of this anyway, we have 65 possible outputs and BTS into a zeroed reg can only produce 64 different outputs.
You might be tempted to use rcl (rotate-through-carry) on a register holding 1 but first of all rcl %cl, %reg is quite slow, and 2nd it still masks the shift with & 63 so it can't shift a 1 all the way out into CF.
No partial-flag stalls, and shorter critical path on older Intel
Using BTC and taking advantage of BSR zero-input behaviour we can make a version that's possibly better on older Intel P6-family, like Nehalem and Core 2 (where cmov is 2 uops and partial-flag stalls are a problem).
It doesn't save uops though because it requires an extra xor-zeroed register. It does shorten the critical path latency, and could shorten it even more if we used test/setz in parallel with 3-cycle-latency bsr, instead of using the ZF output from BSR. Or if the compiler already did math on the input, it might already have ZF set appropriately. Unfortunately there's no way to query that.
btc = complement = flip a bit, instead of unconditionally setting. This can create a zero result if the destination register is 1 instead of 0, if we know the bit-index for that case.
xor-zero / set flags / setcc is a standard idiom, generally at least as efficient as set flags / setcc / movzx because xor-zeroing can be even cheaper than movzx, and it's off the critical path for latency. setcc is a 1-uop instruction on all x86-64 CPUs.
// depends on BSR behaviour that only AMD documents, but all real CPUs implement (for now?)
// it seems unlikely that this will change, though.
// The main advantage of this version is on P6-family and early Sandybridge
// where it definitely does work safely.
uint64_t LeadingBitIsolate_asm_p6(uint64_t a)
{
uint64_t btc_target;
uint64_t bit = 0;
//bool input_was_zero;
asm (
"xor %k[res], %k[res]\n\t" // make sure we avoid P6-family partial-reg stalls with setz + reading full reg by forcing xor-zeroing, not MOV
"bsr %[a], %[bit] \n\t" // bit=count or unmodified (if a==0)
"setz %b[res]\n\t" // res = (a==0)
"btc %[bit], %[res] \n\t" // flip a bit. For a==0, flips bit 0 which SETZ set to 1
: [bit] "+r" (bit), [res] "=&q"(btc_target) // q = reg accessible as low-8, like RAX has AL. Any x86-64 reg
// early-clobber: [res] is written before a, but [bit] isn't.
// ,"=@ccz" (input_was_zero) // optional GCC6 flag output. Or "=@ccc" to read from CF (btc result) and avoid partial-flag stalls
: [a] "r" (a)
: "cc"
);
return btc_target;
//return input_was_zero;
}
GCC9 and trunk has a weird register-allocation missed optimization where it produces res in r8 and has to mov it back to RAX.
gcc8.3 -O3 -mtune=nehalem
LeadingBitIsolate_asm_p6(unsigned long):
xorl %edx, %edx # compiler-generated
xor %eax, %eax # inline asm
bsr %rdi, %rdx
setz %al
btc %rdx, %rax
ret
On Intel CPUs like Core2, Nehalem, and Sandybridge-family, this is 5 uops with no stalls. BSR has 3 cycle latency, the rest have 1 cycle latency. From RDI input to RAX output, the latency is 3 + 1 + 1 = 5 cycles. (setz has to wait for BSR's output. As I mentioned above, test/setz on a before the bsr would allow instruction-level parallelism, spending an extra uop on test to shorten critical-path latency by another cycle.)
On AMD Bulldozer-family / Ryzen, it's a lot more expensive just because of bsr. The setz is still 1 uop, and the btc is 2 uops.
On Atom/Silvermont, it's also expensive because of BSR.
There's no workaround for slow BSR unless you make a version using lzcnt and do runtime dispatching or something. (Probably for a loop that calls this function; call overhead might still be worse than using bsr on a CPU with 8 to 10 uop bsr. Especially Ryzen with its high uop throughput.)
This works fairly obviously for the a!=0 case, doing BTC into a zeroed register to flip the bit at the bit-index found by BSR.
For the a==0 case:
# starting: rdx=rax=0
bsr %rdi, %rdx # leaves RDX=0 (unmodified) and sets ZF
setz %al # sets RAX=1
btc %rdx, %rax # flips bit 0 of RAX, changing it back to 0
# rax = 0 (result), rdx = 0 (bitpos)
When I used two "+r" inputs both set to zero in C, GCC decided to xor-zero EAX/RAX, but used a mov %rax, %rdx to copy that to RDX! (Wasting a REX prefix). That's clearly worse on Sandybridge which has elimination for xor-zeroing but not mov-elimination. It's better on AMD Ryzen, though, which has mov-elimination but still needs a back-end uop for xor-zeroing.
Copying a 0 with mov vs. xor-zeroing is basically neutral on P6-family (Core2 / Nehalem), except that only xor-zeroing avoids partial-register stalls for writing AL with setcc and reading RAX with btc. So I put the xor-zeroing inside inline asm to make sure it's actually xor-zeroing, not a mov, that gcc picks for res. (Why doesn't GCC use partial registers?).
Note that the BSR behaviour of leaving the destination unmodified for the input=0 case is only documented by AMD, not officially documented by Intel. But it is implemented by all Intel's hardware, at least any that supports 64-bit mode. IDK about Via Nano, or ancient Intel.