Changes between Initial Version and Version 1 of LibCSSE/memcpy


Ignore:
Timestamp:
Aug 8, 2014, 2:46:15 PM (12 years ago)
Author:
john
Comment:

Move from parent

Legend:

Unmodified
Added
Removed
Modified
  • LibCSSE/memcpy

    v1 v1  
     1The first routine I worked on was memcpy().
     2
     3Comparison of stock memcpy() of a single page on FreeBSD vs Linux on a Westmere (values are TSC deltas):
     4
     5{{{
     6x fbsd/westmere/builtin
     7+ linux/builtin
     8    N           Min           Max        Median           Avg        Stddev
     9x 1000           336         18444           340       361.628     573.11483
     10+ 1000           276          9996           280       288.924     307.34136
     11Difference at 95.0% confidence
     12        -72.704 +/- 40.3074
     13        -20.1046% +/- 11.1461%
     14        (Student's t, pooled s = 459.847)
     15}}}
     16
     17||= Idea                          =||= Westmere      =||= Sandy Bridge  =||= Ivy Bridge    =||= Penryn     =||
     18|| Replace `dec` with `sub`        || none            || none            || none            ||              ||
     19|| Use movsd instead of movsq      || slightly slower || slightly slower || 6% faster       ||              ||
     20|| Simple `movdqa` loop            || 138% slower     || 58% slower      || 46% slower      ||              ||
     21|| `movdqa` 32 at a time (old)     || 27% slower      || 14% faster      || 17% faster      ||              ||
     22|| `movdqa` 32 at a time (new)     || 27% slower      || 15% faster      || 18% faster      ||              ||
     23|| `movdqa` 32 at a time (reorder) || 27% slower      || 16% faster      || 19% faster      ||              ||
     24|| `movdqa` 64 at a time (old)     || 224% slower     || 131% slower     || 116% slower     ||              ||
     25|| `movdqa` 64 at a time (new)     || 4 cycles slower || 21% faster      || 24% faster      ||              ||
     26|| Intermix SSE and backwards tests|| slightly slower || slightly slower || slightly slower ||              ||
     27|| `movaps` 32 at a time           || 24% slower      || 18% faster      || 23% faster      || 52% faster   ||
     28|| `movaps` 64 at a time           || 17% faster      || 23% faster      || 25% faster      || 48% faster   ||
     29
     30Takeaways from this trial:
     31- Use `movaps` instead of `movdqa` as `movdqa` has a size (0x66) prefix
     32- Minimize branches for the common path.
     33- Unroll copy loop
     34
     35Now testing the overlap case:
     36
     37||= Idea                          =||= Westmere      =||= Sandy Bridge  =||= Ivy Bridge    =||= Penryn     =||
     38|| `movaps` 64 at a time           || 56% faster      || 56% faster      || 56% faster      || 48% faster   ||
     39|| Above using leaq                || 50% faster      || 56% faster      || 60% faster      || 52% faster   ||
     40
     41Notes:
     42- leaq seems to be in the noise, it was about 4 cycles faster on Ivy Bridge and possibly Sandy Bridge, 4-8 slower on Westmere, and 8 faster on Penryn