Common widsom is that rep movsb is much slower than rep movsd (or on 64-bit, rep movsq) when performing identical operations. However, I've been testing on a few modern machines, and the run times are coming out identical (up to measurement noise) across a huge range of buffer sizes (10 bytes to 2 megs). So far I have just tested on 2 machines (32-bit Intel Atom D510 and 64-bit AMD FX 8120).
Are there any modern x86 (32- or 64-bit) machines where
rep movsbis slower thanrep movsd(orrep movsq)?If not, what was the last machine where the difference was significant, and how significant was it?
I'm asking this question from a standpoint of wanting to avoid cargo-culting a bunch of tests to break memory up into unaligned head/tail and aligned middle for the sake of using rep movsd or rep movsq if there's no actual benefit to doing this...