I keep seeing these palindrome questions with clunky code from beginners, which got me interested in writing a version that's actually efficient. My byte-at-a-time loop should run at one iteration per clock on Intel Haswell, but slower on earlier Intel (because the loop will be more than 4 fused-domain uops because of limited macro-fusion for back-to-back cmp/jcc).
Also see below for a way to make it case-insensitive.
Checking more than one byte at a time works, even with overlap at the crossover in the middle:
abcdcba
abcd
dcba ; overlap by one
bcdc
cdcb ; overlap by two
Load some bytes, reverse the order with bswap, Silvermont/Haswell movbe, SSSE3 pshufb, or even rol ax,8/rol eax,16/rol ax,8. Then compare with the same number of bytes where there should be a match.
Add in printing of results however you like. I just return an exit status.
All of this works the same in 32bit, but I used 64bit because the register calling convention avoids cluttering in the code with stack manipulation.
Note the use of local labels (.label) to avoid conflicts between similar blocks of code that use the same label names.
DEFAULT rel
section .text
ALIGN 16
global check_palindrome
; AMD64 SysV calling convention
check_palindrome: ; (size_t len /*rdi*/, const char *str /*rsi*/)
;;returns bool
cmp rdi, 8
jb check_palindrome_byte_at_a_time ; tailcall the version that handles small inputs
add rdi, rsi ; rdi = end pointer
;ALIGN 16 ; probably not worth it, depending on the CPU
.palin_loop:
lodsq ; rax = [rsi], rsi+=8. lodsd/q is only 2 uops on Haswell
bswap rax
sub rdi, 8
cmp rax, [rdi]
jne .not_palin
cmp rsi, rdi
jb .palin_loop ; stop looping when the pointers cross
;; Loop has 7 uops on Haswell, so it can sustain one iteration (8 bytes in each direction) per 2 cycles.
;; 9 uops on SnB, where lodsq is 3 uops, and only one of cmp/jcc pairs can macro-fuse. For SnB, use mov / add instead of lodsq
;; with unrolling, we might get this down closer to 2x8B per clock, instead of per 2 clocks.
;; or with SSE/AVX, 2x16B per clock. Probably not 2x32B per clock with AVX, due to needing two shuffles. (no cross-lane byte shuffle until AVX512)
; input was a palindrome
mov eax, 1 ; return true
ret
.not_palin:
xor eax,eax ; return false
ret
ALIGN 16
;; helper function with the same signature as the main version
; only needed for small strings, not for unaligned, or not-multiple-of-8
; assume that rdi < 2^32 so we can use edi interchangeably, for smaller code-size.
; If our prototype was (unsigned int, const char*), we'd have to mov ecx, edi or something.
; (moving to a different reg is preferable to mov edi,edi because mov-elimination never works on mov same,same)
check_palindrome_byte_at_a_time:
cmp edi, 1
jbe .is_palin ; the empty string is a palindrome, as is length 1
;ALIGN 16
.palin_loop: ; do{
movzx eax, byte ptr [rsi + rdi - 1] ; 2-register addresses can't micro-fuse on SnB-family, but this is a pure load that doesn't need to micro-fuse with anything. MOVZX avoids false dependencies and extra merging uops on CPUs that don't rename AL separate from EAX; it would need to micro-fuse on Haswell if we'd used mov al, [mem]
cmp al, [rsi]
jne check_palindrome.not_palin
inc rsi ; pointer moves forward
sub edi, 2 ; index counts down towards zero twice as fast
ja .palin_loop ; }while((idx-=2) > 0U); Ideally would leave the loop on EDI==1 as well, as any byte equals itself
;;; Haswell and later can fuse both branches even when they hit the decoders in the same cycle
;;; so this loop should hopefully be 4 fused-domain uops and run at one iteration per clock
.is_palin:
mov eax, 1 ; return true
ret
; .not_palin: ; shared code with other check_palindrome version
global main
extern strlen
ALIGN 16
main:
;; return !check_palindrome(strlen(argv[1]), argv[1])
;mov edi, inputstr_len
;mov esi, inputstr
mov rdi, [rsi+8] ; argv[1]
push rdi ; save it, and align the stack
call strlen
mov rdi, rax
mov rsi, [rsp] ; could pop here and take advantage of the fact that we know check_palindrome doesn't care about the stack
call check_palindrome
pop rdi ; shorter than add rsp,8 and faster on Intel CPUs with a stack engine, given the surrounding instructions
xor eax, 1 ; we know check_palin returns a _Bool, so flip it
ret
;;; Unused, but improved in case you do want to use it
section .rodata
inputstr db "mommom"
inputstr_len equ $-inputstr
msg_palin db "is palindrome",10,0
msg_nopalin db "is not palindrome" ; newline an terminating zero are on the next line, with their own label
msg_newline db 10,0
(inspiration for how to write some of main() from gcc output on godbolt. Compiler output is often a good starting point.)
Tested and works:
$ yasm -Worphan-labels -gdwarf2 -felf64 palindrome.asm && gcc palindrome.o -o palindrome
$ ./palindrome '' && echo "true" || echo "false"
true
$ ./palindrome 1 && echo "true" || echo false
true
$ ./palindrome abccba && echo "true" || echo "false"
true # even size: pair of middle chars
$ ./palindrome abcdcba && echo "true" || echo "false"
true # odd size: single middle char
$ ./palindrome abcdeba && echo "true" || echo "false"
false
$ ./palindrome 'ab bcdcb 1234 bcdcb baab bcdcb 4321 bcdcb ba' && echo "true" || echo "false"
true
$ ./palindrome 'ab bcdcb 1234 bcdcb bab bcdcb 4321 bcdcb ba' && echo "true" || echo "false"
true
$ ./palindrome 'ab bcdcb 1234 bcdcb baab bcdcb 4321 bcdcb baa' && echo "true" || echo "false"
false
$ ./palindrome && echo "true" || echo "false"
Segmentation fault (core dumped) # main doesn't check argc
Case-insensitive:
As well as byte-reversing one chunk, force the alphabetic characters of both chunks to lower case before comparing. With ASCII, this is easy. That answer I linked even has an SSE vector implementation of case-flipping, which can be modified to just force all alphabetic characters to the same case (either replace the final pxor with por, or use a blend with the compare result as the control, instead of pandn/por.)
Unrolling:
As well as using displacements ([rsi+8], [rdi-8], etc) to save add/sub instructions, you can reduce the number of branches by ORing or ANDing some compare results together. This may or may not save anything. On a CPU that doesn't suffer from partial-register merging slowdowns at all (e.g. AMD), this might be a win:
.palin_loop:
xor eax,eax ; inside the loop to break the false dependency
xor edx,edx
movbe ... / cmp ... ; movbe is a load and endian swap in one insn. On Haswell it just saves code-size, not uops. On Silvermont it's a win
setne dl
movbe rcx, [rsi+rdi-16] / cmp [rsi+8], rcx
setne dh
shl edx, 16 ; DH merging takes a cycle on Haswell
... repeat setting dl/dh so edx holds 4 results from four 8B checks
movbe ... / cmp ...
setne al
movbe ... / cmp ...
setne ah
shl eax, 16
... repeat setting al/ah
or eax, edx ; or unroll twice as much, and use rax,rdx
jnz .not_palin
Perhaps a better strategy would be to or some xor results, like
movbe rdx, [rsi+rdi-8]
xor rdx, [rsi+8] ; all-zero if they match, otherwise not
movbe rcx, [rsi+rdi-16]
xor rcx, [rsi+8]
or rcx, rdx ; ZF=0 if there are any mismatches in either XOR result
With SSE, where compare results already are boolean 0/-1 elements, you can pand to check that every element matched in both pcmpeqb results. Then one pmovmskb / cmp eax, 0xffff at the end can check multiple vectors of compare results. (See glibc memcmp and strlen unrolled loops for this and other tricks.)
AVX:
It looks like the loop structure from the byte-at-a-time version is idea,
and we use pshufb to reverse the order of a whole vector. The pmovmskb and testing the bitmask is a pretty standard idiom for dealing with vector-compare results. It's faster than PTEST, because PTEST is 2 uops and can't macro-fuse.
DEFAULT rel
section .rodata
ALIGN 16
bytereverse_shuffle db 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
section .text
ALIGN 16
global check_palindrome_AVX1
check_palindrome_AVX1:
cmp rdi, 32
jb check_palindrome_byte_at_a_time ; tailcall the version that handles small inputs
vmovdqa xmm5, [bytereverse_shuffle]
;;ALIGN 16 ; NASM syntax doesn't have an equivalent for the GNU assembler's align-if-padding-will-be-less-than-N-bytes .p2align 4,,8 (4 meaning 2^4 for power-of-2 align)
;; loop alignment is less important for tiny loops that should fit in the loop buffer
.palin_loop:
vmovdqu xmm0, [rsi + rdi - 16]
vpshufb xmm0, xmm0, xmm5 ; bswap
;;; for AVX2 ymm: vperm2i128 ymm0, ymm0, ymm0, 1 ; swap hi and lo lanes
vpcmpeqb xmm0, xmm0, [rsi] ; all we're gaining from AVX is folding this potentially-unaligned memory operand. Everything else works with SSE
vpmovmskb eax, xmm0
cmp eax, 0xffff ; check that all 16bits are set -> all 16bytes matched
jne .not_palin
add rsi, 16 ; pointer moves forward
sub rdi, 32 ; index counts down towards zero twice as fast
jae .palin_loop ; not ja: we still need to check 16 reversed bytes against the same 16 bytes if we exactly meet in the middle
;;; fused-domain uops: 7 for Haswell, 8 for SnB, so it should run one iteration per 2c
;;; An SSSE3 version that needs an unaligned load separate from pcmpeq will be 8 uops (or 9 on pre-Haswell)
.is_palin:
mov eax, 1
ret
.not_palin:
xor eax,eax
ret