wiki:LibCSSE/memcpy

Version 12 (modified by john, 12 years ago) (diff)

--

memcpy

Variants

Name Description
stock MD amd64 version rep movsq
SSE2 movdqu for block-copy
SSE2 aligned align source to use always use movaps and use movaps for aligned destination and movdqu for unaligned destination
AVX 256-bit vmovdqu for block-copy with 128-byte block as common loop
ERMS repne movsb for machines with ERMS

Machines Tested

CPU Speed (GHz) Notes
AMD FX-8120 3.11 1 x 8 zoo.freebsd.org
AMD Opteron 6328 3.20 2 x 8 Supermicro H8DG6/H8DGi
Intel Xeon X5365 3.00 2 x 4 Supermicro X7DBU
Intel Xeon X5482 3.20 2 x 4 Supermicro X7DWN+
Intel Xeon X5675 3.07 Westmere 2 x 6 Supermicro X8DTU
Intel Core i5-2520M 2.50 Sandy Bridge 1 x 4 Thinkpad X220 (4286)
Intel Core i5-2500K 3.30 Sandy Bridge 1 x 4 MSI Z77A-G45 (MS-7752)
Intel Xeon E5-2680 2.70 Romley 2 x 8 Supermicro X9DRW
Intel Xeon E5-2667 v2 3.30 Romley V2 2 x 8 Supermicro X9DRW (supports ERMS)

Test Cases

Name Description
page copy aligned page to aligned page
overlap overlapping copy of page - 16 bytes within a page
short aligned copy of 15 bytes
short2 aligned copy of 32 bytes
short3 aligned copy of 48 bytes
offset 4 byte offset source copy of 128 bytes
offset2 7 byte offset source and destination copy of 97 bytes

Results

The numbers are the min value in the distribution where the values are a TSC delta across a single invocation of the test.

Bold indicates the lowest time among the given variations in a Test and CPU combination. Green text is used for times faster than the stock implementation, and red text is used for times slower than the stock implementation.

CPU

Test / Variant

page

overlap

short

short2

short3

offset

offset2

stock SSE2 SSSE2 aligned AVX ERMS stock SSE2 SSSE2 aligned AVX ERMS stock SSE2 SSSE2 aligned AVX ERMS stock SSE2 SSSE2 aligned AVX ERMS stock SSE2 SSSE2 aligned AVX ERMS stock SSE2 SSSE2 aligned AVX ERMS stock SSE2 SSSE2 aligned AVX ERMS
AMD FX-8120 1178 1062 1067 3241 1138 2508 1014 1483 6315 14884 177 199 195 199 186 177 84 122 87 246 186 88 131 90 305 239 89 135 122 775 212 140 272 149 459
AMD Opteron 6328 508 490 448 2533 487 1071 434 641 5086 6591 118 112 112 110 107 114 75 103 78 136 116 78 105 74 151 145 84 101 90 362 134 110 140 106 177
Intel Xeon X5365
Intel Xeon X5482 672 1424 312 -- 760 664 1424 400 -- 4232 144 112 112 -- 112 80 48 48 -- 128 80 48 48 -- 144 248 120 120 -- 448 200 72 200 -- 192
Intel Xeon X5675 336 276 288 -- 424 648 280 380 -- 4244 116 96 96 -- 92 60 46 69 -- 112 60 46 76 -- 128 76 46 56 -- 360 64 40 192 -- 152
Intel Core i5-2520M 1775 900 937 900 13075 2122 900 1312 2463 13362 412 312 312 312 312 200 75 125 75 375 200 87 125 87 425 275 75 137 87 687 212 125 612 125 587
Intel Core i5-2500K
Intel Xeon E5-2680 348 288 304 288 436 676 288 420 776 4276 128 100 100 100 100 64 24 40 24 120 64 28 40 28 136 88 28 44 24 296 68 40 196 36 164
Intel Xeon E5-2667 v2 384 288 336 288 292 708 288 452 740 4268 84 52 44 52 44 100 24 72 24 52 100 24 76 24 40 120 24 92 24 56 84 60 92 56 60

Conclusions

Early Notes

The first routine I worked on was memcpy().

Comparison of stock memcpy() of a single page on FreeBSD vs Linux on a Westmere (values are TSC deltas):

x fbsd/westmere/builtin
+ linux/builtin
    N           Min           Max        Median           Avg        Stddev
x 1000           336         18444           340       361.628     573.11483
+ 1000           276          9996           280       288.924     307.34136
Difference at 95.0% confidence
        -72.704 +/- 40.3074
        -20.1046% +/- 11.1461%
        (Student's t, pooled s = 459.847)
Idea Westmere Sandy Bridge Ivy Bridge Penryn
Replace dec with sub none none none
Use movsd instead of movsq slightly slower slightly slower 6% faster
Simple movdqa loop 138% slower 58% slower 46% slower
movdqa 32 at a time (old) 27% slower 14% faster 17% faster
movdqa 32 at a time (new) 27% slower 15% faster 18% faster
movdqa 32 at a time (reorder) 27% slower 16% faster 19% faster
movdqa 64 at a time (old) 224% slower 131% slower 116% slower
movdqa 64 at a time (new) 4 cycles slower 21% faster 24% faster
Intermix SSE and backwards tests slightly slower slightly slower slightly slower
movaps 32 at a time 24% slower 18% faster 23% faster 52% faster
movaps 64 at a time 17% faster 23% faster 25% faster 48% faster

Takeaways from this trial:

  • Use movaps instead of movdqa as movdqa has a size (0x66) prefix
  • Minimize branches for the common path.
  • Unroll copy loop

Now testing the overlap case:

Idea Westmere Sandy Bridge Ivy Bridge Penryn
movaps 64 at a time 56% faster 56% faster 56% faster 48% faster
Above using leaq 50% faster 56% faster 60% faster 52% faster

Notes:

  • leaq seems to be in the noise, it was about 4 cycles faster on Ivy Bridge and possibly Sandy Bridge, 4-8 slower on Westmere, and 8 faster on Penryn