Changes between Version 2 and Version 3 of LibCSSE


Ignore:
Timestamp:
May 21, 2014, 5:24:35 PM (12 years ago)
Author:
john
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • LibCSSE

    v2 v3  
    1515}}}
    1616
    17 ||= Idea                          =||= Westmere      =||= Sandy Bridge  =||= Ivy Bridge    =||
    18 || Replace `dec` with `sub`        || none            || none            || none            ||
    19 || Use movsd instead of movsq      || slightly slower || slightly slower || 6% faster       ||
    20 || Simple `movdqa` loop            || 138% slower     || 58% slower      || 46% slower      ||
    21 || `movdqa` 32 at a time (old)     || 27% slower      || 14% faster      || 17% faster      ||
    22 || `movdqa` 32 at a time (new)     || 27% slower      || 15% faster      || 18% faster      ||
    23 || `movdqa` 32 at a time (reorder) || 27% slower      || 15% faster      || 19% faster      ||
    24 || `movdqa` 64 at a time           || 224% slower     || 131% slower     || 116% slower     ||
     17||= Idea                          =||= Westmere      =||= Sandy Bridge  =||= Ivy Bridge    =||= Penryn    = ||
     18|| Replace `dec` with `sub`        || none            || none            || none            ||              ||
     19|| Use movsd instead of movsq      || slightly slower || slightly slower || 6% faster       ||              ||
     20|| Simple `movdqa` loop            || 138% slower     || 58% slower      || 46% slower      ||              ||
     21|| `movdqa` 32 at a time (old)     || 27% slower      || 14% faster      || 17% faster      ||              ||
     22|| `movdqa` 32 at a time (new)     || 27% slower      || 15% faster      || 18% faster      ||              ||
     23|| `movdqa` 32 at a time (reorder) || 27% slower      || 16% faster      || 19% faster      ||              ||
     24|| `movdqa` 64 at a time (old)     || 224% slower     || 131% slower     || 116% slower     ||              ||
     25|| `movdqa` 64 at a time (new)     || 4 cycles slower || 21% faster      || 24% faster      ||              ||
     26|| Intermix SSE and backwards tests|| slightly slower || slightly slower || slightly slower ||              ||
     27|| `movaps` 32 at a time           || 24% slower      || 18% faster      || 23% faster      || 52% faster   ||
     28|| `movaps` 64 at a time           || 17% faster      || 23% faster      || 25% faster      || 48% faster   ||
     29
     30Takeaways from this trial:
     31- Use `movaps` instead of `movdqa` as `movdqa` has a size (0x66) prefix
     32- Minimize branches for the common path.
     33- Unroll copy loop
     34
     35Now testing the overlap case:
     36
     37||= Idea                          =||= Westmere      =||= Sandy Bridge  =||= Ivy Bridge    =||= Penryn    = ||
     38|| `movaps` 64 at a time           || 56% faster      || 56% faster      || 56% faster      || 48% faster   ||
     39|| Above using leaq                || 50% faster      || 56% faster      || 60% faster      || 52% faster   ||
     40
     41Notes:
     42- leaq seems to be in the noise, it was about 4 cycles faster on Ivy Bridge and possibly Sandy Bridge, 4-8 slower on Westmere, and 8 faster on Penryn