| | 1 | The first routine I worked on was memcpy(). |
| | 2 | |
| | 3 | Comparison of stock memcpy() of a single page on FreeBSD vs Linux on a Westmere (values are TSC deltas): |
| | 4 | |
| | 5 | {{{ |
| | 6 | x fbsd/westmere/builtin |
| | 7 | + linux/builtin |
| | 8 | N Min Max Median Avg Stddev |
| | 9 | x 1000 336 18444 340 361.628 573.11483 |
| | 10 | + 1000 276 9996 280 288.924 307.34136 |
| | 11 | Difference at 95.0% confidence |
| | 12 | -72.704 +/- 40.3074 |
| | 13 | -20.1046% +/- 11.1461% |
| | 14 | (Student's t, pooled s = 459.847) |
| | 15 | }}} |
| | 16 | |
| | 17 | ||= Idea =||= Westmere =||= Sandy Bridge =||= Ivy Bridge =||= Penryn =|| |
| | 18 | || Replace `dec` with `sub` || none || none || none || || |
| | 19 | || Use movsd instead of movsq || slightly slower || slightly slower || 6% faster || || |
| | 20 | || Simple `movdqa` loop || 138% slower || 58% slower || 46% slower || || |
| | 21 | || `movdqa` 32 at a time (old) || 27% slower || 14% faster || 17% faster || || |
| | 22 | || `movdqa` 32 at a time (new) || 27% slower || 15% faster || 18% faster || || |
| | 23 | || `movdqa` 32 at a time (reorder) || 27% slower || 16% faster || 19% faster || || |
| | 24 | || `movdqa` 64 at a time (old) || 224% slower || 131% slower || 116% slower || || |
| | 25 | || `movdqa` 64 at a time (new) || 4 cycles slower || 21% faster || 24% faster || || |
| | 26 | || Intermix SSE and backwards tests|| slightly slower || slightly slower || slightly slower || || |
| | 27 | || `movaps` 32 at a time || 24% slower || 18% faster || 23% faster || 52% faster || |
| | 28 | || `movaps` 64 at a time || 17% faster || 23% faster || 25% faster || 48% faster || |
| | 29 | |
| | 30 | Takeaways from this trial: |
| | 31 | - Use `movaps` instead of `movdqa` as `movdqa` has a size (0x66) prefix |
| | 32 | - Minimize branches for the common path. |
| | 33 | - Unroll copy loop |
| | 34 | |
| | 35 | Now testing the overlap case: |
| | 36 | |
| | 37 | ||= Idea =||= Westmere =||= Sandy Bridge =||= Ivy Bridge =||= Penryn =|| |
| | 38 | || `movaps` 64 at a time || 56% faster || 56% faster || 56% faster || 48% faster || |
| | 39 | || Above using leaq || 50% faster || 56% faster || 60% faster || 52% faster || |
| | 40 | |
| | 41 | Notes: |
| | 42 | - leaq seems to be in the noise, it was about 4 cycles faster on Ivy Bridge and possibly Sandy Bridge, 4-8 slower on Westmere, and 8 faster on Penryn |