| Version 12 (modified by john, 12 years ago) (diff) |
|---|
strlen
Variants
| Name | Description |
|---|---|
| stock | MI C version |
| SSE2 | pcmpeqb and pmovmskb |
| SSE4.2 | pcmpestri and pcpmestrm |
| AVX | 128-bit vpcmpeqb and vpmovmskb |
| ERMS | repne scasb for machines with ERMS |
Note: clang was too smart and optimized plain strlen calls away, so I had to create a copy of the C version called strlen_mi() to fool it.
Machines Tested
| CPU | Speed (GHz) | Notes |
|---|---|---|
| AMD FX-8120 | 3.11 | 1 x 8 zoo.freebsd.org |
| Intel Xeon X5365 | 3.00 | 2 x 4 Supermicro X7DBU |
| Intel Xeon X5482 | 3.20 | 2 x 4 Supermicro X7DWN+ |
| Intel Xeon X5675 | 3.07 | Westmere 2 x 6 Supermicro X8DTU |
| Intel Core i5-2520M | 2.50 | Sandy Bridge 1 x 4 Thinkpad X220 (4286) |
| Intel Core i5-2500K | 3.30 | Sandy Bridge 1 x 4 MSI Z77A-G45 (MS-7752) |
| Intel Xeon E5-2680 | 2.70 | Romley 2 x 8 Supermicro X9DRW |
| Intel Xeon E5-2667 v2 | 3.30 | Romley V2 2 x 8 Supermicro X9DRW |
Test Cases
| Name | Description |
|---|---|
| page | aligned string one page - 1 long |
| short | aligned string 14 characters long |
| short2 | aligned string 32 characters long |
| short3 | aligned string 48 characters long |
| offset | 4 byte offset string 126 characters long |
| offset2 | 7 byte offset string 95 characters long |
Results
The numbers are the min value in the distribution where the values are a TSC delta across a single invocation of the test.
Bold indicates the lowest time among the given variations in a Test and CPU combination. Green text is used for times faster than the stock implementation, and red text is used for times slower than the stock implementation.
CPU | Test / Variant | |||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
page | short | short2 | short3 | offset | offset2 | |||||||||||||||||||||||||
| stock | SSE2 | SSSE4.2 | AVX | ERMS | stock | SSE2 | SSSE4.2 | AVX | ERMS | stock | SSE2 | SSSE4.2 | AVX | ERMS | stock | SSE2 | SSSE4.2 | AVX | ERMS | stock | SSE2 | SSSE4.2 | AVX | ERMS | stock | SSE2 | SSSE4.2 | AVX | ERMS | |
| Intel Xeon X5365 | 1386 | 864 | -- | -- | 16506 | 81 | 72 | -- | -- | 180 | 90 | 72 | -- | -- | 252 | 90 | 81 | -- | -- | 315 | 144 | 108 | -- | -- | 630 | 135 | 108 | -- | -- | 504 |
| Intel Xeon X5482 | 1608 | 808 | -- | -- | 16464 | 48 | 40 | -- | -- | 136 | 48 | 40 | -- | -- | 140 | 56 | 40 | -- | -- | 264 | 80 | 80 | -- | -- | 592 | 72 | 72 | -- | -- | 464 |
| Intel Xeon X5675 | 1592 | 848 | 2100 | -- | 8252 | 60 | 24 | 24 | -- | 92 | 32 | 46 | 32 | -- | 124 | 78 | 46 | 83 | -- | 156 | 76 | 64 | 104 | -- | 316 | 64 | 56 | 88 | -- | 256 |
| Intel Core i5-2520M | 3525 | 1950 | 6463 | 536 | 25812 | 100 | 75 | 75 | 18 | 300 | 112 | 75 | 87 | 18 | 412 | 112 | 75 | 112 | 21 | 512 | 212 | 100 | 262 | 21 | 1012 | 175 | 100 | 212 | 24 | 825 |
| Intel Core i5-2500K | 1002 | 552 | 1893 | 573 | 7350 | 21 | 18 | 18 | 18 | 87 | 33 | 18 | 18 | 18 | 117 | 36 | 21 | 27 | 21 | 147 | 57 | 24 | 69 | 21 | 297 | 45 | 21 | 54 | 24 | 225 |
| Intel Xeon E5-2680 | 1496 | 632 | 2048 | 644 | 8260 | 36 | 24 | 24 | 28 | 96 | 40 | 24 | 28 | 24 | 132 | 52 | 24 | 32 | 24 | 164 | 80 | 40 | 84 | 24 | 324 | 68 | 36 | 64 | 24 | 264 |
| Intel Xeon E5-2667 v2 | 1296 | 632 | 2076 | 648 | 8260 | 24 | 24 | 24 | 24 | 100 | 36 | 24 | 28 | 24 | 132 | 48 | 24 | 28 | 28 | 164 | 80 | 44 | 84 | 24 | 324 | 64 | 28 | 64 | 24 | 264 |
Conclusions
- The SSE2 version is generally faster than the stock version.
- The SSE4.2 version is generally slower than the SSE2 version.
- The AVX version often outperforms the SSE2 version.
- It seems that ERMS does not accelerate repne scasb.
