wiki:LibCSSE/memcpy

Context Navigation

Version 3 (modified by john, 12 years ago) (diff)
--

memcpy

Variants

Name	Description
stock	MD amd64 version `rep movsq`
SSE2	`movdqu` for block-copy
SSE2 aligned	align source to use always use `movaps` and use `movaps` for aligned destination and `movdqu` for unaligned destination
AVX	256-bit `vmovdqu` for block-copy with 128-byte block as common loop
ERMS	`repne movsb` for machines with ERMS

Machines Tested

CPU	Speed (GHz)	Notes
AMD FX-8120	3.11	1 x 8 zoo.freebsd.org
AMD Opteron 6328	3.20	2 x 8 Supermicro H8DG6/H8DGi
Intel Xeon X5365	3.00	2 x 4 Supermicro X7DBU
Intel Xeon X5482	3.20	2 x 4 Supermicro X7DWN+
Intel Xeon X5675	3.07	Westmere 2 x 6 Supermicro X8DTU
Intel Core i5-2520M	2.50	Sandy Bridge 1 x 4 Thinkpad X220 (4286)
Intel Core i5-2500K	3.30	Sandy Bridge 1 x 4 MSI Z77A-G45 (MS-7752)
Intel Xeon E5-2680	2.70	Romley 2 x 8 Supermicro X9DRW
Intel Xeon E5-2667 v2	3.30	Romley V2 2 x 8 Supermicro X9DRW (supports ERMS)

Test Cases

Name	Description
page	copy aligned page to aligned page
overlap	overlapping copy of page - 16 bytes within a page
short	aligned copy of 15 bytes
short2	aligned copy of 32 bytes
short3	aligned copy of 48 bytes
offset	4 byte offset copy of 128 bytes
offset2	7 byte offset copy of 97 bytes

Results

The numbers are the min value in the distribution where the values are a TSC delta across a single invocation of the test.

Bold indicates the lowest time among the given variations in a Test and CPU combination. Green text is used for times faster than the stock implementation, and red text is used for times slower than the stock implementation.

CPU	Test / Variant
	page					overlap					short					short2					short3					offset					offset2
	stock	SSE2	SSSE2 aligned	AVX	ERMS	stock	SSE2	SSSE2 aligned	AVX	ERMS	stock	SSE2	SSSE2 aligned	AVX	ERMS	stock	SSE2	SSSE2 aligned	AVX	ERMS	stock	SSE2	SSSE2 aligned	AVX	ERMS	stock	SSE2	SSSE2 aligned	AVX	ERMS	stock	SSE2	SSSE2 aligned	AVX	ERMS	AMD FX-8120
AMD Opteron 6328
Intel Xeon X5365
Intel Xeon X5482
Intel Xeon X5675
Intel Core i5-2520M
Intel Core i5-2500K
Intel Xeon E5-2680
Intel Xeon E5-2667 v2

Conclusions

Early Notes

The first routine I worked on was memcpy().

Comparison of stock memcpy() of a single page on FreeBSD vs Linux on a Westmere (values are TSC deltas):

x fbsd/westmere/builtin
+ linux/builtin
    N           Min           Max        Median           Avg        Stddev
x 1000           336         18444           340       361.628     573.11483
+ 1000           276          9996           280       288.924     307.34136
Difference at 95.0% confidence
        -72.704 +/- 40.3074
        -20.1046% +/- 11.1461%
        (Student's t, pooled s = 459.847)

Idea	Westmere	Sandy Bridge	Ivy Bridge	Penryn
Replace `dec` with `sub`	none	none	none
Use movsd instead of movsq	slightly slower	slightly slower	6% faster
Simple `movdqa` loop	138% slower	58% slower	46% slower
`movdqa` 32 at a time (old)	27% slower	14% faster	17% faster
`movdqa` 32 at a time (new)	27% slower	15% faster	18% faster
`movdqa` 32 at a time (reorder)	27% slower	16% faster	19% faster
`movdqa` 64 at a time (old)	224% slower	131% slower	116% slower
`movdqa` 64 at a time (new)	4 cycles slower	21% faster	24% faster
Intermix SSE and backwards tests	slightly slower	slightly slower	slightly slower
`movaps` 32 at a time	24% slower	18% faster	23% faster	52% faster
`movaps` 64 at a time	17% faster	23% faster	25% faster	48% faster

Takeaways from this trial:

Use movaps instead of movdqa as movdqa has a size (0x66) prefix
Minimize branches for the common path.
Unroll copy loop

Now testing the overlap case:

Idea	Westmere	Sandy Bridge	Ivy Bridge	Penryn
`movaps` 64 at a time	56% faster	56% faster	56% faster	48% faster
Above using leaq	50% faster	56% faster	60% faster	52% faster

Notes:

leaq seems to be in the noise, it was about 4 cycles faster on Ivy Bridge and possibly Sandy Bridge, 4-8 slower on Westmere, and 8 faster on Penryn

Download in other formats:

Plain Text