The first routine I worked on was memcpy(). Comparison of stock memcpy() of a single page on FreeBSD vs Linux on a Westmere (values are TSC deltas): {{{ x fbsd/westmere/builtin + linux/builtin N Min Max Median Avg Stddev x 1000 336 18444 340 361.628 573.11483 + 1000 276 9996 280 288.924 307.34136 Difference at 95.0% confidence -72.704 +/- 40.3074 -20.1046% +/- 11.1461% (Student's t, pooled s = 459.847) }}} ||= Idea =||= Westmere =||= Sandy Bridge =||= Ivy Bridge =|| || Replace `dec` with `sub` || none || none || none || || Use movsd instead of movsq || slightly slower || slightly slower || 6% faster || || Simple `movdqa` loop || 138% slower || 58% slower || 46% slower || || `movdqa` 32 at a time (old) || 27% slower || 14% faster || 17% faster || || `movdqa` 32 at a time (new) || 27% slower || 15% faster || 18% faster || || `movdqa` 64 at a time || 224% slower || 131% slower || 116% slower ||