wiki:LibCSSE/strlen

Version 13 (modified by john, 12 years ago) (diff)

--

strlen

Variants

Name Description
stock MI C version
SSE2 pcmpeqb and pmovmskb
SSE4.2 pcmpestri and pcpmestrm
AVX 128-bit vpcmpeqb and vpmovmskb
ERMS repne scasb for machines with ERMS

Note: clang was too smart and optimized plain strlen calls away, so I had to create a copy of the C version called strlen_mi() to fool it.

Machines Tested

CPU Speed (GHz) Notes
AMD FX-8120 3.11 1 x 8 zoo.freebsd.org
Intel Xeon X5365 3.00 2 x 4 Supermicro X7DBU
Intel Xeon X5482 3.20 2 x 4 Supermicro X7DWN+
Intel Xeon X5675 3.07 Westmere 2 x 6 Supermicro X8DTU
Intel Core i5-2520M 2.50 Sandy Bridge 1 x 4 Thinkpad X220 (4286)
Intel Core i5-2500K 3.30 Sandy Bridge 1 x 4 MSI Z77A-G45 (MS-7752)
Intel Xeon E5-2680 2.70 Romley 2 x 8 Supermicro X9DRW
Intel Xeon E5-2667 v2 3.30 Romley V2 2 x 8 Supermicro X9DRW (supports ERMS)

Test Cases

Name Description
page aligned string one page - 1 long
short aligned string 14 characters long
short2 aligned string 32 characters long
short3 aligned string 48 characters long
offset 4 byte offset string 126 characters long
offset2 7 byte offset string 95 characters long

Results

The numbers are the min value in the distribution where the values are a TSC delta across a single invocation of the test.

Bold indicates the lowest time among the given variations in a Test and CPU combination. Green text is used for times faster than the stock implementation, and red text is used for times slower than the stock implementation.

CPU

Test / Variant

page

short

short2

short3

offset

offset2

stock SSE2 SSSE4.2 AVX ERMS stock SSE2 SSSE4.2 AVX ERMS stock SSE2 SSSE4.2 AVX ERMS stock SSE2 SSSE4.2 AVX ERMS stock SSE2 SSSE4.2 AVX ERMS stock SSE2 SSSE4.2 AVX ERMS
Intel Xeon X5365 1386 864 -- -- 16506 81 72 -- -- 180 90 72 -- -- 252 90 81 -- -- 315 144 108 -- -- 630 135 108 -- -- 504
Intel Xeon X5482 1608 808 -- -- 16464 48 40 -- -- 136 48 40 -- -- 140 56 40 -- -- 264 80 80 -- -- 592 72 72 -- -- 464
Intel Xeon X5675 1592 848 2100 -- 8252 60 24 24 -- 92 32 46 32 -- 124 78 46 83 -- 156 76 64 104 -- 316 64 56 88 -- 256
Intel Core i5-2520M 3525 1950 6463 536 25812 100 75 75 18 300 112 75 87 18 412 112 75 112 21 512 212 100 262 21 1012 175 100 212 24 825
Intel Core i5-2500K 1002 552 1893 573 7350 21 18 18 18 87 33 18 18 18 117 36 21 27 21 147 57 24 69 21 297 45 21 54 24 225
Intel Xeon E5-2680 1496 632 2048 644 8260 36 24 24 28 96 40 24 28 24 132 52 24 32 24 164 80 40 84 24 324 68 36 64 24 264
Intel Xeon E5-2667 v2 1296 632 2076 648 8260 24 24 24 24 100 36 24 28 24 132 48 24 28 28 164 80 44 84 24 324 64 28 64 24 264

Conclusions

  • The SSE2 version is generally faster than the stock version.
  • The SSE4.2 version is generally slower than the SSE2 version.
  • The AVX version often outperforms the SSE2 version.
  • It seems that ERMS does not accelerate repne scasb.