LibCSSE/memcpy – FreeBSD Tickets

Context Navigation

← Previous Version
View Latest Version
Next Version →

Version 1 (modified by john, 12 years ago) (diff)
Move from parent

The first routine I worked on was memcpy().

Comparison of stock memcpy() of a single page on FreeBSD vs Linux on a Westmere (values are TSC deltas):

x fbsd/westmere/builtin
+ linux/builtin
    N           Min           Max        Median           Avg        Stddev
x 1000           336         18444           340       361.628     573.11483
+ 1000           276          9996           280       288.924     307.34136
Difference at 95.0% confidence
        -72.704 +/- 40.3074
        -20.1046% +/- 11.1461%
        (Student's t, pooled s = 459.847)

Idea	Westmere	Sandy Bridge	Ivy Bridge	Penryn
Replace `dec` with `sub`	none	none	none
Use movsd instead of movsq	slightly slower	slightly slower	6% faster
Simple `movdqa` loop	138% slower	58% slower	46% slower
`movdqa` 32 at a time (old)	27% slower	14% faster	17% faster
`movdqa` 32 at a time (new)	27% slower	15% faster	18% faster
`movdqa` 32 at a time (reorder)	27% slower	16% faster	19% faster
`movdqa` 64 at a time (old)	224% slower	131% slower	116% slower
`movdqa` 64 at a time (new)	4 cycles slower	21% faster	24% faster
Intermix SSE and backwards tests	slightly slower	slightly slower	slightly slower
`movaps` 32 at a time	24% slower	18% faster	23% faster	52% faster
`movaps` 64 at a time	17% faster	23% faster	25% faster	48% faster

Takeaways from this trial:

Use movaps instead of movdqa as movdqa has a size (0x66) prefix
Minimize branches for the common path.
Unroll copy loop

Now testing the overlap case:

Idea	Westmere	Sandy Bridge	Ivy Bridge	Penryn
`movaps` 64 at a time	56% faster	56% faster	56% faster	48% faster
Above using leaq	50% faster	56% faster	60% faster	52% faster

Notes:

leaq seems to be in the noise, it was about 4 cycles faster on Ivy Bridge and possibly Sandy Bridge, 4-8 slower on Westmere, and 8 faster on Penryn

Download in other formats:

Plain Text