Changes between Version 4 and Version 5 of LibCSSE


Ignore:
Timestamp:
Aug 8, 2014, 2:46:39 PM (12 years ago)
Author:
john
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • LibCSSE

    v4 v5  
    1 The first routine I worked on was memcpy().
     1= Implementing SSE-optimized variants of string routines in libc for FreeBSD/amd64 =
    22
    3 Comparison of stock memcpy() of a single page on FreeBSD vs Linux on a Westmere (values are TSC deltas):
     3There is a subpage for each routine.  Most routines contain multiple variants (e.g. plain SSE2 vs AVX).  For each routine, the various variants were evaluated using micro-benchmarks on a variety of processors.
    44
    5 {{{
    6 x fbsd/westmere/builtin
    7 + linux/builtin
    8     N           Min           Max        Median           Avg        Stddev
    9 x 1000           336         18444           340       361.628     573.11483
    10 + 1000           276          9996           280       288.924     307.34136
    11 Difference at 95.0% confidence
    12         -72.704 +/- 40.3074
    13         -20.1046% +/- 11.1461%
    14         (Student's t, pooled s = 459.847)
    15 }}}
     5Routines:
    166
    17 ||= Idea                          =||= Westmere      =||= Sandy Bridge  =||= Ivy Bridge    =||= Penryn     =||
    18 || Replace `dec` with `sub`        || none            || none            || none            ||              ||
    19 || Use movsd instead of movsq      || slightly slower || slightly slower || 6% faster       ||              ||
    20 || Simple `movdqa` loop            || 138% slower     || 58% slower      || 46% slower      ||              ||
    21 || `movdqa` 32 at a time (old)     || 27% slower      || 14% faster      || 17% faster      ||              ||
    22 || `movdqa` 32 at a time (new)     || 27% slower      || 15% faster      || 18% faster      ||              ||
    23 || `movdqa` 32 at a time (reorder) || 27% slower      || 16% faster      || 19% faster      ||              ||
    24 || `movdqa` 64 at a time (old)     || 224% slower     || 131% slower     || 116% slower     ||              ||
    25 || `movdqa` 64 at a time (new)     || 4 cycles slower || 21% faster      || 24% faster      ||              ||
    26 || Intermix SSE and backwards tests|| slightly slower || slightly slower || slightly slower ||              ||
    27 || `movaps` 32 at a time           || 24% slower      || 18% faster      || 23% faster      || 52% faster   ||
    28 || `movaps` 64 at a time           || 17% faster      || 23% faster      || 25% faster      || 48% faster   ||
    29 
    30 Takeaways from this trial:
    31 - Use `movaps` instead of `movdqa` as `movdqa` has a size (0x66) prefix
    32 - Minimize branches for the common path.
    33 - Unroll copy loop
    34 
    35 Now testing the overlap case:
    36 
    37 ||= Idea                          =||= Westmere      =||= Sandy Bridge  =||= Ivy Bridge    =||= Penryn     =||
    38 || `movaps` 64 at a time           || 56% faster      || 56% faster      || 56% faster      || 48% faster   ||
    39 || Above using leaq                || 50% faster      || 56% faster      || 60% faster      || 52% faster   ||
    40 
    41 Notes:
    42 - leaq seems to be in the noise, it was about 4 cycles faster on Ivy Bridge and possibly Sandy Bridge, 4-8 slower on Westmere, and 8 faster on Penryn
     7* [[LibCSSE/memcpy|memcpy]]
     8* [[LibCSSE/memset|memset]]
     9* [[LibCSSE/strlen|strlen]]