| Version 7 (modified by john, 12 years ago) (diff) |
|---|
memcpy
Variants
| Name | Description |
|---|---|
| stock | MD amd64 version rep movsq |
| SSE2 | movdqu for block-copy |
| SSE2 aligned | align source to use always use movaps and use movaps for aligned destination and movdqu for unaligned destination |
| AVX | 256-bit vmovdqu for block-copy with 128-byte block as common loop |
| ERMS | repne movsb for machines with ERMS |
Machines Tested
| CPU | Speed (GHz) | Notes |
|---|---|---|
| AMD FX-8120 | 3.11 | 1 x 8 zoo.freebsd.org |
| AMD Opteron 6328 | 3.20 | 2 x 8 Supermicro H8DG6/H8DGi |
| Intel Xeon X5365 | 3.00 | 2 x 4 Supermicro X7DBU |
| Intel Xeon X5482 | 3.20 | 2 x 4 Supermicro X7DWN+ |
| Intel Xeon X5675 | 3.07 | Westmere 2 x 6 Supermicro X8DTU |
| Intel Core i5-2520M | 2.50 | Sandy Bridge 1 x 4 Thinkpad X220 (4286) |
| Intel Core i5-2500K | 3.30 | Sandy Bridge 1 x 4 MSI Z77A-G45 (MS-7752) |
| Intel Xeon E5-2680 | 2.70 | Romley 2 x 8 Supermicro X9DRW |
| Intel Xeon E5-2667 v2 | 3.30 | Romley V2 2 x 8 Supermicro X9DRW (supports ERMS) |
Test Cases
| Name | Description |
|---|---|
| page | copy aligned page to aligned page |
| overlap | overlapping copy of page - 16 bytes within a page |
| short | aligned copy of 15 bytes |
| short2 | aligned copy of 32 bytes |
| short3 | aligned copy of 48 bytes |
| offset | 4 byte offset source copy of 128 bytes |
| offset2 | 7 byte offset source and destination copy of 97 bytes |
Results
The numbers are the min value in the distribution where the values are a TSC delta across a single invocation of the test.
Bold indicates the lowest time among the given variations in a Test and CPU combination. Green text is used for times faster than the stock implementation, and red text is used for times slower than the stock implementation.
CPU | Test / Variant | ||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
page | overlap | short | short2 | short3 | offset | offset2 | |||||||||||||||||||||||||||||
| stock | SSE2 | SSSE2 aligned | AVX | ERMS | stock | SSE2 | SSSE2 aligned | AVX | ERMS | stock | SSE2 | SSSE2 aligned | AVX | ERMS | stock | SSE2 | SSSE2 aligned | AVX | ERMS | stock | SSE2 | SSSE2 aligned | AVX | ERMS | stock | SSE2 | SSSE2 aligned | AVX | ERMS | stock | SSE2 | SSSE2 aligned | AVX | ERMS | |
| AMD FX-8120 | |||||||||||||||||||||||||||||||||||
| AMD Opteron 6328 | 508 | 490 | 448 | 2533 | 487 | 1071 | 434 | 641 | 5086 | 6591 | 118 | 112 | 112 | 110 | 107 | 114 | 75 | 103 | 78 | 136 | 116 | 78 | 105 | 74 | 151 | 145 | 84 | 101 | 90 | 362 | 134 | 110 | 140 | 106 | 177 |
| Intel Xeon X5365 | |||||||||||||||||||||||||||||||||||
| Intel Xeon X5482 | 672 | 1424 | 312 | -- | 760 | 664 | 1424 | 400 | -- | 4232 | 144 | 112 | 112 | -- | 112 | 80 | 48 | 48 | -- | 128 | 80 | 48 | 48 | -- | 144 | 248 | 120 | 120 | -- | 448 | 200 | 72 | 200 | -- | 192 |
| Intel Xeon X5675 | 336 | 276 | 288 | -- | 424 | 648 | 280 | 380 | -- | 4244 | 116 | 96 | 96 | -- | 92 | 60 | 46 | 69 | -- | 112 | 60 | 46 | 76 | -- | 128 | 76 | 46 | 56 | -- | 360 | 64 | 40 | 192 | -- | 152 |
| Intel Core i5-2520M | |||||||||||||||||||||||||||||||||||
| Intel Core i5-2500K | |||||||||||||||||||||||||||||||||||
| Intel Xeon E5-2680 | 348 | 288 | 304 | 288 | 436 | 676 | 288 | 420 | 776 | 4276 | 128 | 100 | 100 | 100 | 100 | 64 | 24 | 40 | 24 | 120 | 64 | 28 | 40 | 28 | 136 | 88 | 28 | 44 | 24 | 296 | 68 | 40 | 196 | 36 | 164 |
| Intel Xeon E5-2667 v2 | 384 | 288 | 336 | 288 | 292 | 708 | 288 | 452 | 740 | 4268 | 84 | 52 | 44 | 52 | 44 | 100 | 24 | 72 | 24 | 52 | 100 | 24 | 76 | 24 | 40 | 120 | 24 | 92 | 24 | 56 | 84 | 60 | 92 | 56 | 60 |
Conclusions
Early Notes
The first routine I worked on was memcpy().
Comparison of stock memcpy() of a single page on FreeBSD vs Linux on a Westmere (values are TSC deltas):
x fbsd/westmere/builtin
+ linux/builtin
N Min Max Median Avg Stddev
x 1000 336 18444 340 361.628 573.11483
+ 1000 276 9996 280 288.924 307.34136
Difference at 95.0% confidence
-72.704 +/- 40.3074
-20.1046% +/- 11.1461%
(Student's t, pooled s = 459.847)
| Idea | Westmere | Sandy Bridge | Ivy Bridge | Penryn |
|---|---|---|---|---|
| Replace dec with sub | none | none | none | |
| Use movsd instead of movsq | slightly slower | slightly slower | 6% faster | |
| Simple movdqa loop | 138% slower | 58% slower | 46% slower | |
| movdqa 32 at a time (old) | 27% slower | 14% faster | 17% faster | |
| movdqa 32 at a time (new) | 27% slower | 15% faster | 18% faster | |
| movdqa 32 at a time (reorder) | 27% slower | 16% faster | 19% faster | |
| movdqa 64 at a time (old) | 224% slower | 131% slower | 116% slower | |
| movdqa 64 at a time (new) | 4 cycles slower | 21% faster | 24% faster | |
| Intermix SSE and backwards tests | slightly slower | slightly slower | slightly slower | |
| movaps 32 at a time | 24% slower | 18% faster | 23% faster | 52% faster |
| movaps 64 at a time | 17% faster | 23% faster | 25% faster | 48% faster |
Takeaways from this trial:
- Use movaps instead of movdqa as movdqa has a size (0x66) prefix
- Minimize branches for the common path.
- Unroll copy loop
Now testing the overlap case:
| Idea | Westmere | Sandy Bridge | Ivy Bridge | Penryn |
|---|---|---|---|---|
| movaps 64 at a time | 56% faster | 56% faster | 56% faster | 48% faster |
| Above using leaq | 50% faster | 56% faster | 60% faster | 52% faster |
Notes:
- leaq seems to be in the noise, it was about 4 cycles faster on Ivy Bridge and possibly Sandy Bridge, 4-8 slower on Westmere, and 8 faster on Penryn
