wiki:LibCSSE/memcpy

Version 1 (modified by john, 12 years ago) (diff)

Move from parent

The first routine I worked on was memcpy().

Comparison of stock memcpy() of a single page on FreeBSD vs Linux on a Westmere (values are TSC deltas):

x fbsd/westmere/builtin
+ linux/builtin
    N           Min           Max        Median           Avg        Stddev
x 1000           336         18444           340       361.628     573.11483
+ 1000           276          9996           280       288.924     307.34136
Difference at 95.0% confidence
        -72.704 +/- 40.3074
        -20.1046% +/- 11.1461%
        (Student's t, pooled s = 459.847)
Idea Westmere Sandy Bridge Ivy Bridge Penryn
Replace dec with sub none none none
Use movsd instead of movsq slightly slower slightly slower 6% faster
Simple movdqa loop 138% slower 58% slower 46% slower
movdqa 32 at a time (old) 27% slower 14% faster 17% faster
movdqa 32 at a time (new) 27% slower 15% faster 18% faster
movdqa 32 at a time (reorder) 27% slower 16% faster 19% faster
movdqa 64 at a time (old) 224% slower 131% slower 116% slower
movdqa 64 at a time (new) 4 cycles slower 21% faster 24% faster
Intermix SSE and backwards tests slightly slower slightly slower slightly slower
movaps 32 at a time 24% slower 18% faster 23% faster 52% faster
movaps 64 at a time 17% faster 23% faster 25% faster 48% faster

Takeaways from this trial:

  • Use movaps instead of movdqa as movdqa has a size (0x66) prefix
  • Minimize branches for the common path.
  • Unroll copy loop

Now testing the overlap case:

Idea Westmere Sandy Bridge Ivy Bridge Penryn
movaps 64 at a time 56% faster 56% faster 56% faster 48% faster
Above using leaq 50% faster 56% faster 60% faster 52% faster

Notes:

  • leaq seems to be in the noise, it was about 4 cycles faster on Ivy Bridge and possibly Sandy Bridge, 4-8 slower on Westmere, and 8 faster on Penryn