wiki:LibCSSE/memcpy

Context Navigation

Version 7 (modified by john, 12 years ago) (diff)
--

memcpy

Variants

Name	Description
stock	MD amd64 version `rep movsq`
SSE2	`movdqu` for block-copy
SSE2 aligned	align source to use always use `movaps` and use `movaps` for aligned destination and `movdqu` for unaligned destination
AVX	256-bit `vmovdqu` for block-copy with 128-byte block as common loop
ERMS	`repne movsb` for machines with ERMS

Machines Tested

CPU	Speed (GHz)	Notes
AMD FX-8120	3.11	1 x 8 zoo.freebsd.org
AMD Opteron 6328	3.20	2 x 8 Supermicro H8DG6/H8DGi
Intel Xeon X5365	3.00	2 x 4 Supermicro X7DBU
Intel Xeon X5482	3.20	2 x 4 Supermicro X7DWN+
Intel Xeon X5675	3.07	Westmere 2 x 6 Supermicro X8DTU
Intel Core i5-2520M	2.50	Sandy Bridge 1 x 4 Thinkpad X220 (4286)
Intel Core i5-2500K	3.30	Sandy Bridge 1 x 4 MSI Z77A-G45 (MS-7752)
Intel Xeon E5-2680	2.70	Romley 2 x 8 Supermicro X9DRW
Intel Xeon E5-2667 v2	3.30	Romley V2 2 x 8 Supermicro X9DRW (supports ERMS)

Test Cases

Name	Description
page	copy aligned page to aligned page
overlap	overlapping copy of page - 16 bytes within a page
short	aligned copy of 15 bytes
short2	aligned copy of 32 bytes
short3	aligned copy of 48 bytes
offset	4 byte offset source copy of 128 bytes
offset2	7 byte offset source and destination copy of 97 bytes

Results

The numbers are the min value in the distribution where the values are a TSC delta across a single invocation of the test.

Bold indicates the lowest time among the given variations in a Test and CPU combination. Green text is used for times faster than the stock implementation, and red text is used for times slower than the stock implementation.

CPU	Test / Variant
	page					overlap					short					short2					short3					offset					offset2
	stock	SSE2	SSSE2 aligned	AVX	ERMS	stock	SSE2	SSSE2 aligned	AVX	ERMS	stock	SSE2	SSSE2 aligned	AVX	ERMS	stock	SSE2	SSSE2 aligned	AVX	ERMS	stock	SSE2	SSSE2 aligned	AVX	ERMS	stock	SSE2	SSSE2 aligned	AVX	ERMS	stock	SSE2	SSSE2 aligned	AVX	ERMS
AMD FX-8120
AMD Opteron 6328	508	490	448	2533	487	1071	434	641	5086	6591	118	112	112	110	107	114	75	103	78	136	116	78	105	74	151	145	84	101	90	362	134	110	140	106	177
Intel Xeon X5365
Intel Xeon X5482	672	1424	312	--	760	664	1424	400	--	4232	144	112	112	--	112	80	48	48	--	128	80	48	48	--	144	248	120	120	--	448	200	72	200	--	192
Intel Xeon X5675	336	276	288	--	424	648	280	380	--	4244	116	96	96	--	92	60	46	69	--	112	60	46	76	--	128	76	46	56	--	360	64	40	192	--	152
Intel Core i5-2520M
Intel Core i5-2500K
Intel Xeon E5-2680	348	288	304	288	436	676	288	420	776	4276	128	100	100	100	100	64	24	40	24	120	64	28	40	28	136	88	28	44	24	296	68	40	196	36	164
Intel Xeon E5-2667 v2	384	288	336	288	292	708	288	452	740	4268	84	52	44	52	44	100	24	72	24	52	100	24	76	24	40	120	24	92	24	56	84	60	92	56	60

Conclusions

Early Notes

The first routine I worked on was memcpy().

Comparison of stock memcpy() of a single page on FreeBSD vs Linux on a Westmere (values are TSC deltas):

x fbsd/westmere/builtin
+ linux/builtin
    N           Min           Max        Median           Avg        Stddev
x 1000           336         18444           340       361.628     573.11483
+ 1000           276          9996           280       288.924     307.34136
Difference at 95.0% confidence
        -72.704 +/- 40.3074
        -20.1046% +/- 11.1461%
        (Student's t, pooled s = 459.847)

Idea	Westmere	Sandy Bridge	Ivy Bridge	Penryn
Replace `dec` with `sub`	none	none	none
Use movsd instead of movsq	slightly slower	slightly slower	6% faster
Simple `movdqa` loop	138% slower	58% slower	46% slower
`movdqa` 32 at a time (old)	27% slower	14% faster	17% faster
`movdqa` 32 at a time (new)	27% slower	15% faster	18% faster
`movdqa` 32 at a time (reorder)	27% slower	16% faster	19% faster
`movdqa` 64 at a time (old)	224% slower	131% slower	116% slower
`movdqa` 64 at a time (new)	4 cycles slower	21% faster	24% faster
Intermix SSE and backwards tests	slightly slower	slightly slower	slightly slower
`movaps` 32 at a time	24% slower	18% faster	23% faster	52% faster
`movaps` 64 at a time	17% faster	23% faster	25% faster	48% faster

Takeaways from this trial:

Use movaps instead of movdqa as movdqa has a size (0x66) prefix
Minimize branches for the common path.
Unroll copy loop

Now testing the overlap case:

Idea	Westmere	Sandy Bridge	Ivy Bridge	Penryn
`movaps` 64 at a time	56% faster	56% faster	56% faster	48% faster
Above using leaq	50% faster	56% faster	60% faster	52% faster

Notes:

leaq seems to be in the noise, it was about 4 cycles faster on Ivy Bridge and possibly Sandy Bridge, 4-8 slower on Westmere, and 8 faster on Penryn

Download in other formats:

Plain Text