Context Navigation

Changes between Version 4 and Version 5 of LibCSSE

Timestamp:: Aug 8, 2014, 2:46:39 PM (12 years ago)
Author:: john
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

LibCSSE

-                      v4
+                      v5
+The first routine I worked on was memcpy().
+= Implementing SSE-optimized variants of string routines in libc for FreeBSD/amd64 =
+Comparison of stock memcpy() of a single page on FreeBSD vs Linux on a Westmere (values are TSC deltas):
+There is a subpage for each routine.  Most routines contain multiple variants (e.g. plain SSE2 vs AVX).  For each routine, the various variants were evaluated using micro-benchmarks on a variety of processors.
+{{{
+x fbsd/westmere/builtin
++ linux/builtin
+    N           Min           Max        Median           Avg        Stddev
+x 1000           336         18444           340       361.628     573.11483
++ 1000           276          9996           280       288.924     307.34136
+Difference at 95.0% confidence
+        -72.704 +/- 40.3074
+        -20.1046% +/- 11.1461%
+        (Student's t, pooled s = 459.847)
+}}}
+Routines:
+||= Idea                          =||= Westmere      =||= Sandy Bridge  =||= Ivy Bridge    =||= Penryn     =||
+|| Replace `dec` with `sub`        || none            || none            || none            ||              ||
+|| Use movsd instead of movsq      || slightly slower || slightly slower || 6% faster       ||              ||
+|| Simple `movdqa` loop            || 138% slower     || 58% slower      || 46% slower      ||              ||
+|| `movdqa` 32 at a time (old)     || 27% slower      || 14% faster      || 17% faster      ||              ||
+|| `movdqa` 32 at a time (new)     || 27% slower      || 15% faster      || 18% faster      ||              ||
+|| `movdqa` 32 at a time (reorder) || 27% slower      || 16% faster      || 19% faster      ||              ||
+|| `movdqa` 64 at a time (old)     || 224% slower     || 131% slower     || 116% slower     ||              ||
+|| `movdqa` 64 at a time (new)     || 4 cycles slower || 21% faster      || 24% faster      ||              ||
+|| Intermix SSE and backwards tests|| slightly slower || slightly slower || slightly slower ||              ||
+|| `movaps` 32 at a time           || 24% slower      || 18% faster      || 23% faster      || 52% faster   ||
+|| `movaps` 64 at a time           || 17% faster      || 23% faster      || 25% faster      || 48% faster   ||
+Takeaways from this trial:
+- Use `movaps` instead of `movdqa` as `movdqa` has a size (0x66) prefix
+- Minimize branches for the common path.
+- Unroll copy loop
+Now testing the overlap case:
+||= Idea                          =||= Westmere      =||= Sandy Bridge  =||= Ivy Bridge    =||= Penryn     =||
+|| `movaps` 64 at a time           || 56% faster      || 56% faster      || 56% faster      || 48% faster   ||
+|| Above using leaq                || 50% faster      || 56% faster      || 60% faster      || 52% faster   ||
+Notes:
+- leaq seems to be in the noise, it was about 4 cycles faster on Ivy Bridge and possibly Sandy Bridge, 4-8 slower on Westmere, and 8 faster on Penryn
+* [[LibCSSE/memcpy|memcpy]]
+* [[LibCSSE/memset|memset]]
+* [[LibCSSE/strlen|strlen]]