The first routine I worked on was memcpy().

Comparison of stock memcpy() of a single page on FreeBSD vs Linux on a Westmere (values are TSC deltas):

{{{
x fbsd/westmere/builtin
+ linux/builtin
    N           Min           Max        Median           Avg        Stddev
x 1000           336         18444           340       361.628     573.11483
+ 1000           276          9996           280       288.924     307.34136
Difference at 95.0% confidence
        -72.704 +/- 40.3074
        -20.1046% +/- 11.1461%
        (Student's t, pooled s = 459.847)
}}}

||= Idea                          =||= Westmere      =||= Sandy Bridge  =||= Ivy Bridge    =||= Penryn     =||
|| Replace `dec` with `sub`        || none            || none            || none            ||              ||
|| Use movsd instead of movsq      || slightly slower || slightly slower || 6% faster       ||              ||
|| Simple `movdqa` loop            || 138% slower     || 58% slower      || 46% slower      ||              ||
|| `movdqa` 32 at a time (old)     || 27% slower      || 14% faster      || 17% faster      ||              ||
|| `movdqa` 32 at a time (new)     || 27% slower      || 15% faster      || 18% faster      ||              ||
|| `movdqa` 32 at a time (reorder) || 27% slower      || 16% faster      || 19% faster      ||              ||
|| `movdqa` 64 at a time (old)     || 224% slower     || 131% slower     || 116% slower     ||              ||
|| `movdqa` 64 at a time (new)     || 4 cycles slower || 21% faster      || 24% faster      ||              ||
|| Intermix SSE and backwards tests|| slightly slower || slightly slower || slightly slower ||              ||
|| `movaps` 32 at a time           || 24% slower      || 18% faster      || 23% faster      || 52% faster   ||
|| `movaps` 64 at a time           || 17% faster      || 23% faster      || 25% faster      || 48% faster   ||

Takeaways from this trial:
- Use `movaps` instead of `movdqa` as `movdqa` has a size (0x66) prefix
- Minimize branches for the common path.
- Unroll copy loop

Now testing the overlap case:

||= Idea                          =||= Westmere      =||= Sandy Bridge  =||= Ivy Bridge    =||= Penryn     =||
|| `movaps` 64 at a time           || 56% faster      || 56% faster      || 56% faster      || 48% faster   ||
|| Above using leaq                || 50% faster      || 56% faster      || 60% faster      || 52% faster   ||

Notes:
- leaq seems to be in the noise, it was about 4 cycles faster on Ivy Bridge and possibly Sandy Bridge, 4-8 slower on Westmere, and 8 faster on Penryn