= memcpy =

== Variants ==

||= '''Name''' =||= '''Description''' =||
|| stock || MD amd64 version {{{rep movsq}}} ||
|| SSE2 || {{{movdqu}}} for block-copy ||
|| SSE2 aligned || align source to use always use {{{movaps}}} and use {{{movaps}}} for aligned destination and {{{movdqu}}} for unaligned destination ||
|| AVX || 256-bit {{{vmovdqu}}} for block-copy with 128-byte block as common loop ||
|| ERMS || {{{repne movsb}}} for machines with ERMS ||


== Machines Tested ==

||= '''CPU''' =||= '''Speed (GHz)''' =||= '''Notes''' =||
|| AMD FX-8120 || 3.11 || 1 x 8 zoo.freebsd.org ||
|| AMD Opteron 6328 || 3.20 || 2 x 8 Supermicro H8DG6/H8DGi ||
|| Intel Xeon X5365 || 3.00 || 2 x 4 Supermicro X7DBU ||
|| Intel Xeon X5482 || 3.20 || 2 x 4 Supermicro X7DWN+ ||
|| Intel Xeon X5675 || 3.07 || Westmere 2 x 6 Supermicro X8DTU ||
|| Intel Core i5-2520M || 2.50 || Sandy Bridge 1 x 4 Thinkpad X220 (4286) ||
|| Intel Core i5-2500K || 3.30 || Sandy Bridge 1 x 4 MSI Z77A-G45 (MS-7752) ||
|| Intel Xeon E5-2680 || 2.70 || Romley 2 x 8 Supermicro X9DRW ||
|| Intel Xeon E5-2667 v2 || 3.30 || Romley V2 2 x 8 Supermicro X9DRW (supports ERMS) ||

== Test Cases ==

||= '''Name''' =||= '''Description''' =||
|| page || copy aligned page to aligned page ||
|| overlap || overlapping copy of page - 16 bytes within a page ||
|| short || aligned copy of 15 bytes ||
|| short2 || aligned copy of 32 bytes ||
|| short3 || aligned copy of 48 bytes ||
|| offset || 4 byte offset source copy of 128 bytes ||
|| offset2 || 7 byte offset source and destination copy of 97 bytes ||

== Results ==

The numbers are the min value in the distribution where the values are a TSC delta across a single invocation of the test.

Bold indicates the lowest time among the given variations in a Test and CPU combination.  Green text is used for times faster than the stock implementation, and red text is used for times slower than the stock implementation.

{{{#!th rowspan=3
'''CPU'''
}}}
{{{#!th colspan=35
'''Test / Variant'''
}}}
|--
{{{#!th colspan=5
'''page'''
}}}
{{{#!th colspan=5
'''overlap'''
}}}
{{{#!th colspan=5
'''short'''
}}}
{{{#!th colspan=5
'''short2'''
}}}
{{{#!th colspan=5
'''short3'''
}}}
{{{#!th colspan=5
'''offset'''
}}}
{{{#!th colspan=5
'''offset2'''
}}}
|--
||= '''stock''' =||= '''SSE2''' =||= '''SSSE2 aligned''' =||= '''AVX''' =||= '''ERMS''' =|| \
||= '''stock''' =||= '''SSE2''' =||= '''SSSE2 aligned''' =||= '''AVX''' =||= '''ERMS''' =|| \
||= '''stock''' =||= '''SSE2''' =||= '''SSSE2 aligned''' =||= '''AVX''' =||= '''ERMS''' =|| \
||= '''stock''' =||= '''SSE2''' =||= '''SSSE2 aligned''' =||= '''AVX''' =||= '''ERMS''' =|| \
||= '''stock''' =||= '''SSE2''' =||= '''SSSE2 aligned''' =||= '''AVX''' =||= '''ERMS''' =|| \
||= '''stock''' =||= '''SSE2''' =||= '''SSSE2 aligned''' =||= '''AVX''' =||= '''ERMS''' =|| \
||= '''stock''' =||= '''SSE2''' =||= '''SSSE2 aligned''' =||= '''AVX''' =||= '''ERMS''' =||
|| AMD FX-8120 ||
|| AMD Opteron 6328 || \
|| 508|| [[span(490, style=color: green)]]|| '''[[span(448, style=color: green)]]'''|| [[span(2533, style=color:red)]]|| [[span(487, style=color: green)]]|| \
|| 1071|| '''[[span(434, style=color: green)]]'''|| [[span(641, style=color: green)]]|| [[span(5086, style=color:red)]]|| [[span(6591, style=color:red)]]|| \
|| 118|| [[span(112, style=color: green)]]|| [[span(112, style=color: green)]]|| [[span(110, style=color: green)]]|| '''[[span(107, style=color: green)]]'''|| \
|| 114|| '''[[span(75, style=color: green)]]'''|| [[span(103, style=color: green)]]|| [[span(78, style=color: green)]]|| [[span(136, style=color:red)]]|| \
|| 116|| [[span(78, style=color: green)]]|| [[span(105, style=color: green)]]|| '''[[span(74, style=color: green)]]'''|| [[span(151, style=color:red)]]|| \
|| 145|| '''[[span(84, style=color: green)]]'''|| [[span(101, style=color: green)]]|| [[span(90, style=color: green)]]|| [[span(362, style=color:red)]]|| \
|| 134|| [[span(110, style=color: green)]]|| [[span(140, style=color:red)]]|| '''[[span(106, style=color: green)]]'''|| [[span(177, style=color:red)]]||
|| Intel Xeon X5365 ||
|| Intel Xeon X5482 || \
|| 672|| 1424|| '''312'''|| -- || 760|| \
|| 664|| 1424|| '''400'''|| -- || 4232|| \
|| 144|| '''112'''|| '''112'''|| -- || '''112'''|| \
|| 80|| '''48'''|| '''48'''|| -- || 128|| \
|| 80|| '''48'''|| '''48'''|| -- || 144|| \
|| 248|| '''120'''|| '''120'''|| -- || 448|| \
|| 200|| '''72''' || 200|| -- || 192||
|| Intel Xeon X5675 || \
|| 336|| '''276'''|| 288|| -- || 424|| \
|| 648|| '''280'''|| 380|| -- || 4244|| \
|| 116|| 96|| 96|| -- || '''92'''|| \
|| 60|| '''46'''|| 69|| -- || 112|| \
|| 60|| '''46'''|| 76|| -- || 128|| \
|| 76|| '''46'''|| 56|| -- || 360|| \
|| 64|| '''40'''|| 192|| -- || 152||
|| Intel Core i5-2520M ||
|| Intel Core i5-2500K ||
|| Intel Xeon E5-2680 || \
|| 348|| '''288'''|| 304|| '''288'''|| 436|| \
|| 676|| '''288'''|| 420|| 776|| 4276|| \
|| 128|| '''100'''|| '''100'''|| '''100'''|| '''100'''|| \
|| 64|| '''24'''|| 40|| '''24'''|| 120|| \
|| 64|| '''28'''|| 40|| '''28'''|| 136|| \
|| 88|| 28|| 44|| '''24'''|| 296|| \
|| 68|| 40|| 196|| '''36'''|| 164||
|| Intel Xeon E5-2667 v2 || \
|| 384|| '''288'''|| 336|| '''288'''|| 292|| \
|| 708|| '''288'''|| 452|| 740|| 4268|| \
|| 84|| 52|| '''44'''|| 52|| '''44'''|| \
|| 100|| '''24'''|| 72|| '''24'''|| 52|| \
|| 100|| '''24'''|| 76|| '''24'''|| 40|| \
|| 120|| '''24'''|| 92|| '''24'''|| 56|| \
|| 84|| 60|| 92|| '''56'''|| 60||

== Conclusions ==

== Early Notes ==

The first routine I worked on was memcpy().

Comparison of stock memcpy() of a single page on FreeBSD vs Linux on a Westmere (values are TSC deltas):

{{{
x fbsd/westmere/builtin
+ linux/builtin
    N           Min           Max        Median           Avg        Stddev
x 1000           336         18444           340       361.628     573.11483
+ 1000           276          9996           280       288.924     307.34136
Difference at 95.0% confidence
        -72.704 +/- 40.3074
        -20.1046% +/- 11.1461%
        (Student's t, pooled s = 459.847)
}}}

||= Idea                          =||= Westmere      =||= Sandy Bridge  =||= Ivy Bridge    =||= Penryn     =||
|| Replace `dec` with `sub`        || none            || none            || none            ||              ||
|| Use movsd instead of movsq      || slightly slower || slightly slower || 6% faster       ||              ||
|| Simple `movdqa` loop            || 138% slower     || 58% slower      || 46% slower      ||              ||
|| `movdqa` 32 at a time (old)     || 27% slower      || 14% faster      || 17% faster      ||              ||
|| `movdqa` 32 at a time (new)     || 27% slower      || 15% faster      || 18% faster      ||              ||
|| `movdqa` 32 at a time (reorder) || 27% slower      || 16% faster      || 19% faster      ||              ||
|| `movdqa` 64 at a time (old)     || 224% slower     || 131% slower     || 116% slower     ||              ||
|| `movdqa` 64 at a time (new)     || 4 cycles slower || 21% faster      || 24% faster      ||              ||
|| Intermix SSE and backwards tests|| slightly slower || slightly slower || slightly slower ||              ||
|| `movaps` 32 at a time           || 24% slower      || 18% faster      || 23% faster      || 52% faster   ||
|| `movaps` 64 at a time           || 17% faster      || 23% faster      || 25% faster      || 48% faster   ||

Takeaways from this trial:
- Use `movaps` instead of `movdqa` as `movdqa` has a size (0x66) prefix
- Minimize branches for the common path.
- Unroll copy loop

Now testing the overlap case:

||= Idea                          =||= Westmere      =||= Sandy Bridge  =||= Ivy Bridge    =||= Penryn     =||
|| `movaps` 64 at a time           || 56% faster      || 56% faster      || 56% faster      || 48% faster   ||
|| Above using leaq                || 50% faster      || 56% faster      || 60% faster      || 52% faster   ||

Notes:
- leaq seems to be in the noise, it was about 4 cycles faster on Ivy Bridge and possibly Sandy Bridge, 4-8 slower on Westmere, and 8 faster on Penryn