wiki:LibCSSE/memcpy

Version 3 (modified by john, 12 years ago) (diff)

--

memcpy

Variants

Name Description
stock MD amd64 version rep movsq
SSE2 movdqu for block-copy
SSE2 aligned align source to use always use movaps and use movaps for aligned destination and movdqu for unaligned destination
AVX 256-bit vmovdqu for block-copy with 128-byte block as common loop
ERMS repne movsb for machines with ERMS

Machines Tested

CPU Speed (GHz) Notes
AMD FX-8120 3.11 1 x 8 zoo.freebsd.org
AMD Opteron 6328 3.20 2 x 8 Supermicro H8DG6/H8DGi
Intel Xeon X5365 3.00 2 x 4 Supermicro X7DBU
Intel Xeon X5482 3.20 2 x 4 Supermicro X7DWN+
Intel Xeon X5675 3.07 Westmere 2 x 6 Supermicro X8DTU
Intel Core i5-2520M 2.50 Sandy Bridge 1 x 4 Thinkpad X220 (4286)
Intel Core i5-2500K 3.30 Sandy Bridge 1 x 4 MSI Z77A-G45 (MS-7752)
Intel Xeon E5-2680 2.70 Romley 2 x 8 Supermicro X9DRW
Intel Xeon E5-2667 v2 3.30 Romley V2 2 x 8 Supermicro X9DRW (supports ERMS)

Test Cases

Name Description
page copy aligned page to aligned page
overlap overlapping copy of page - 16 bytes within a page
short aligned copy of 15 bytes
short2 aligned copy of 32 bytes
short3 aligned copy of 48 bytes
offset 4 byte offset copy of 128 bytes
offset2 7 byte offset copy of 97 bytes

Results

The numbers are the min value in the distribution where the values are a TSC delta across a single invocation of the test.

Bold indicates the lowest time among the given variations in a Test and CPU combination. Green text is used for times faster than the stock implementation, and red text is used for times slower than the stock implementation.

CPU

Test / Variant

page

overlap

short

short2

short3

offset

offset2

stock SSE2 SSSE2 aligned AVX ERMS stock SSE2 SSSE2 aligned AVX ERMS stock SSE2 SSSE2 aligned AVX ERMS stock SSE2 SSSE2 aligned AVX ERMS stock SSE2 SSSE2 aligned AVX ERMS stock SSE2 SSSE2 aligned AVX ERMS stock SSE2 SSSE2 aligned AVX ERMS AMD FX-8120
AMD Opteron 6328
Intel Xeon X5365
Intel Xeon X5482
Intel Xeon X5675
Intel Core i5-2520M
Intel Core i5-2500K
Intel Xeon E5-2680
Intel Xeon E5-2667 v2

Conclusions

Early Notes

The first routine I worked on was memcpy().

Comparison of stock memcpy() of a single page on FreeBSD vs Linux on a Westmere (values are TSC deltas):

x fbsd/westmere/builtin
+ linux/builtin
    N           Min           Max        Median           Avg        Stddev
x 1000           336         18444           340       361.628     573.11483
+ 1000           276          9996           280       288.924     307.34136
Difference at 95.0% confidence
        -72.704 +/- 40.3074
        -20.1046% +/- 11.1461%
        (Student's t, pooled s = 459.847)
Idea Westmere Sandy Bridge Ivy Bridge Penryn
Replace dec with sub none none none
Use movsd instead of movsq slightly slower slightly slower 6% faster
Simple movdqa loop 138% slower 58% slower 46% slower
movdqa 32 at a time (old) 27% slower 14% faster 17% faster
movdqa 32 at a time (new) 27% slower 15% faster 18% faster
movdqa 32 at a time (reorder) 27% slower 16% faster 19% faster
movdqa 64 at a time (old) 224% slower 131% slower 116% slower
movdqa 64 at a time (new) 4 cycles slower 21% faster 24% faster
Intermix SSE and backwards tests slightly slower slightly slower slightly slower
movaps 32 at a time 24% slower 18% faster 23% faster 52% faster
movaps 64 at a time 17% faster 23% faster 25% faster 48% faster

Takeaways from this trial:

  • Use movaps instead of movdqa as movdqa has a size (0x66) prefix
  • Minimize branches for the common path.
  • Unroll copy loop

Now testing the overlap case:

Idea Westmere Sandy Bridge Ivy Bridge Penryn
movaps 64 at a time 56% faster 56% faster 56% faster 48% faster
Above using leaq 50% faster 56% faster 60% faster 52% faster

Notes:

  • leaq seems to be in the noise, it was about 4 cycles faster on Ivy Bridge and possibly Sandy Bridge, 4-8 slower on Westmere, and 8 faster on Penryn