wiki:LibCSSE/strlen

Version 16 (modified by john, 12 years ago) (diff)

--

strlen

Variants

Name Description
stock MI C version
SSE2 pcmpeqb and pmovmskb
SSE4.2 pcmpestri and pcpmestrm
AVX 128-bit vpcmpeqb and vpmovmskb
ERMS repne scasb for machines with ERMS

Note: clang was too smart and optimized plain strlen calls away, so I had to create a copy of the C version called strlen_mi() to fool it.

Machines Tested

CPU Speed (GHz) Notes
AMD FX-8120 3.11 1 x 8 zoo.freebsd.org
AMD Opteron 6328 3.20 2 x 8 Supermicro H8DG6/H8DGi
Intel Xeon X5365 3.00 2 x 4 Supermicro X7DBU
Intel Xeon X5482 3.20 2 x 4 Supermicro X7DWN+
Intel Xeon X5675 3.07 Westmere 2 x 6 Supermicro X8DTU
Intel Core i5-2520M 2.50 Sandy Bridge 1 x 4 Thinkpad X220 (4286)
Intel Core i5-2500K 3.30 Sandy Bridge 1 x 4 MSI Z77A-G45 (MS-7752)
Intel Xeon E5-2680 2.70 Romley 2 x 8 Supermicro X9DRW
Intel Xeon E5-2667 v2 3.30 Romley V2 2 x 8 Supermicro X9DRW (supports ERMS)

Test Cases

Name Description
page aligned string one page - 1 long
short aligned string 14 characters long
short2 aligned string 32 characters long
short3 aligned string 48 characters long
offset 4 byte offset string 126 characters long
offset2 7 byte offset string 95 characters long

Results

The numbers are the min value in the distribution where the values are a TSC delta across a single invocation of the test.

Bold indicates the lowest time among the given variations in a Test and CPU combination. Green text is used for times faster than the stock implementation, and red text is used for times slower than the stock implementation.

CPU

Test / Variant

page

short

short2

short3

offset

offset2

stock SSE2 SSSE4.2 AVX ERMS stock SSE2 SSSE4.2 AVX ERMS stock SSE2 SSSE4.2 AVX ERMS stock SSE2 SSSE4.2 AVX ERMS stock SSE2 SSSE4.2 AVX ERMS stock SSE2 SSSE4.2 AVX ERMS
AMD FX-8120 3899 2017 7245 2006 33064 139 81 95 82 318 168 82 157 81 430 195 86 183 95 551 316 150 354 155 1184 241 137 307 137 938
AMD Opteron 6328 1510 821 2932 585 13543 78 78 80 73 143 90 70 86 73 202 93 73 97 79 253 126 96 166 83 511 114 83 143 79 412
Intel Xeon X5365 1386 864 -- -- 16506 81 72 -- -- 180 90 72 -- -- 252 90 81 -- -- 315 144 108 -- -- 630 135 108 -- -- 504
Intel Xeon X5482 1608 808 -- -- 16464 48 40 -- -- 136 48 40 -- -- 140 56 40 -- -- 264 80 80 -- -- 592 72 72 -- -- 464
Intel Xeon X5675 1592 848 2100 -- 8252 60 24 24 -- 92 32 46 32 -- 124 78 46 83 -- 156 76 64 104 -- 316 64 56 88 -- 256
Intel Core i5-2520M 3525 1950 6463 536 25812 100 75 75 18 300 112 75 87 18 412 112 75 112 21 512 212 100 262 21 1012 175 100 212 24 825
Intel Core i5-2500K 1002 552 1893 573 7350 21 18 18 18 87 33 18 18 18 117 36 21 27 21 147 57 24 69 21 297 45 21 54 24 225
Intel Xeon E5-2680 1496 632 2048 644 8260 36 24 24 28 96 40 24 28 24 132 52 24 32 24 164 80 40 84 24 324 68 36 64 24 264
Intel Xeon E5-2667 v2 1296 632 2076 648 8260 24 24 24 24 100 36 24 28 24 132 48 24 28 28 164 80 44 84 24 324 64 28 64 24 264

Conclusions

  • The SSE2 version is generally faster than the stock version.
  • The SSE4.2 version is generally slower than the SSE2 version.
  • The AVX version often outperforms the SSE2 version.
  • It seems that ERMS does not accelerate repne scasb.