| Version 3 (modified by john, 12 years ago) (diff) |
|---|
The first routine I worked on was memcpy().
Comparison of stock memcpy() of a single page on FreeBSD vs Linux on a Westmere (values are TSC deltas):
x fbsd/westmere/builtin
+ linux/builtin
N Min Max Median Avg Stddev
x 1000 336 18444 340 361.628 573.11483
+ 1000 276 9996 280 288.924 307.34136
Difference at 95.0% confidence
-72.704 +/- 40.3074
-20.1046% +/- 11.1461%
(Student's t, pooled s = 459.847)
| Idea | Westmere | Sandy Bridge | Ivy Bridge | Penryn = |
|---|---|---|---|---|
| Replace dec with sub | none | none | none | |
| Use movsd instead of movsq | slightly slower | slightly slower | 6% faster | |
| Simple movdqa loop | 138% slower | 58% slower | 46% slower | |
| movdqa 32 at a time (old) | 27% slower | 14% faster | 17% faster | |
| movdqa 32 at a time (new) | 27% slower | 15% faster | 18% faster | |
| movdqa 32 at a time (reorder) | 27% slower | 16% faster | 19% faster | |
| movdqa 64 at a time (old) | 224% slower | 131% slower | 116% slower | |
| movdqa 64 at a time (new) | 4 cycles slower | 21% faster | 24% faster | |
| Intermix SSE and backwards tests | slightly slower | slightly slower | slightly slower | |
| movaps 32 at a time | 24% slower | 18% faster | 23% faster | 52% faster |
| movaps 64 at a time | 17% faster | 23% faster | 25% faster | 48% faster |
Takeaways from this trial:
- Use movaps instead of movdqa as movdqa has a size (0x66) prefix
- Minimize branches for the common path.
- Unroll copy loop
Now testing the overlap case:
| Idea | Westmere | Sandy Bridge | Ivy Bridge | Penryn = |
|---|---|---|---|---|
| movaps 64 at a time | 56% faster | 56% faster | 56% faster | 48% faster |
| Above using leaq | 50% faster | 56% faster | 60% faster | 52% faster |
Notes:
- leaq seems to be in the noise, it was about 4 cycles faster on Ivy Bridge and possibly Sandy Bridge, 4-8 slower on Westmere, and 8 faster on Penryn
