wiki:LibCSSE/memset

Context Navigation

Version 24 (modified by john, 12 years ago) (diff)
--

strlen

Variants

Name	Description
stock	MD amd64 version {{rep stosq}}
SSE2	`movups` for block-store
SSE2 aligned	`movaps` for aligned block-store and `movups` for unaligned
AVX 128	128-bit `vmovups` for block-store
AVX 256	256-bit `vmovups` for block-store
ERMS	`repne stosb` for machines with ERMS

Note: clang was too smart and inlined all the short memset calls, so I had to create a copy of the amd64 version called memset_stock() to fool it.

Machines Tested

CPU	Speed (GHz)	Notes
AMD FX-8120	3.11	1 x 8 zoo.freebsd.org
AMD Opteron 6328	3.20	2 x 8 Supermicro H8DG6/H8DGi
Intel Xeon X5365	3.00	2 x 4 Supermicro X7DBU
Intel Xeon X5482	3.20	2 x 4 Supermicro X7DWN+
Intel Xeon X5675	3.07	Westmere 2 x 6 Supermicro X8DTU
Intel Core i5-2520M	2.50	Sandy Bridge 1 x 4 Thinkpad X220 (4286)
Intel Core i5-2500K	3.30	Sandy Bridge 1 x 4 MSI Z77A-G45 (MS-7752)
Intel Xeon E5-2680	2.70	Romley 2 x 8 Supermicro X9DRW
Intel Xeon E5-2667 v2	3.30	Romley V2 2 x 8 Supermicro X9DRW (supports ERMS)

Test Cases

Name	Description
page	set page to 0xa5
short	set aligned 15 bytes to 0xa5
short2	set aligned 32 bytes to 0xa5
short3	set aligned 48 bytes to 0xa5
offset	set misaligned ( + 4) 128 bytes to 0
offset2	set misaligned ( + 7) 97 bytes to 0

Results

The numbers are the min value in the distribution where the values are a TSC delta across a single invocation of the test.

Bold indicates the lowest time among the given variations in a Test and CPU combination. Green text is used for times faster than the stock implementation, and red text is used for times slower than the stock implementation.

CPU	Test / Variant
	page						short						short2						short3						offset						offset2
	stock	SSE2	SSSE2 aligned	AVX 128	AVX 256	ERMS	stock	SSE2	SSSE2 aligned	AVX 128	AVX 256	ERMS	stock	SSE2	SSSE2 aligned	AVX 128	AVX 256	ERMS	stock	SSE2	SSSE2 aligned	AVX 128	AVX 256	ERMS	stock	SSE2	SSSE2 aligned	AVX 128	AVX 256	ERMS	stock	SSE2	SSSE2 aligned	AVX 128	AVX 256	ERMS
AMD FX-8120	1078	987	972	974	3095	1009	157	161	157	157	157	157	188	99	90	97	91	248	203	89	119	95	119	290	265	89	96	97	148	469	221	122	122	120	144	469
AMD Opteron 6328	490	446	454	454	2485	457	108	106	108	108	108	108	126	90	92	92	94	130	128	91	91	96	94	144	148	90	95	93	103	231	137	93	96	96	99	233
Intel Xeon X5365	657	1206	378	--	--	720	144	144	144	--	--	144	126	90	90	--	--	162	126	99	90	--	--	171	243	135	135	--	--	252	126	108	108	--	--	225
Intel Xeon X5482	624	1144	312	--	--	688	112	112	112	--	--	112	96	64	64	--	--	128	96	72	64	--	--	144	216	120	120	--	--	224	96	96	96	--	--	192
Intel Xeon X5675	352	296	300	--	--	428	100	100	96	--	--	96	76	44	48	--	--	120	76	44	48	--	--	136	208	106	106	--	--	192	76	52	56	--	--	160
Intel Core i5-2520M	1812	962	962	950	1400	13100	337	350	350	350	350	337	237	162	162	150	600	400	237	162	162	150	612	450	687	187	187	187	625	700	237	187	187	187	637	612
Intel Core i5-2500K	321	285	285	282	417	411	81	84	84	84	84	81	57	39	39	36	171	96	57	39	39	36	174	135	192	45	45	45	177	180	57	45	45	45	180	156
Intel Xeon E5-2680	356	308	308	304	448	436	108	112	112	112	112	108	76	52	52	48	196	128	76	52	52	52	196	144	220	60	60	60	204	208	76	60	60	60	208	172
Intel Xeon E5-2667 v2	424	344	340	340	484	292	56	60	60	60	60	56	152	84	80	80	224	56	152	84	84	80	228	56	120	96	96	92	236	60	132	64	64	64	208	56

Conclusions

The short case is basically the same in all cases. It is a few cycles slower for the SSE/AVX variants due to an extra branch, but the delta is small enough to generally be in noise, especially considering the changes in the other tests.
ERMS is a definite win for machines that have it.
On the oldest Intel CPUs tested (Core 2), using movdqu on an aligned address is super slow, hence why SSE2 is slower for the page set.
256-bit AVX is generally slower than the other SSE/AVX variants, but 128-bit AVX often shaves a few cycles.
SSE2 aligned is generally faster or the same, so it is the best choice for the "default" case. Should probably include ERMS and AVX 128 versions as well. Would be nice to use an ifunc to enable ERMS version, but an #ifdef for AVX 128 is probably fine.

Download in other formats:

Plain Text