| Version 24 (modified by john, 12 years ago) (diff) |
|---|
strlen
Variants
| Name | Description |
|---|---|
| stock | MD amd64 version {{rep stosq}} |
| SSE2 | movups for block-store |
| SSE2 aligned | movaps for aligned block-store and movups for unaligned |
| AVX 128 | 128-bit vmovups for block-store |
| AVX 256 | 256-bit vmovups for block-store |
| ERMS | repne stosb for machines with ERMS |
Note: clang was too smart and inlined all the short memset calls, so I had to create a copy of the amd64 version called memset_stock() to fool it.
Machines Tested
| CPU | Speed (GHz) | Notes |
|---|---|---|
| AMD FX-8120 | 3.11 | 1 x 8 zoo.freebsd.org |
| AMD Opteron 6328 | 3.20 | 2 x 8 Supermicro H8DG6/H8DGi |
| Intel Xeon X5365 | 3.00 | 2 x 4 Supermicro X7DBU |
| Intel Xeon X5482 | 3.20 | 2 x 4 Supermicro X7DWN+ |
| Intel Xeon X5675 | 3.07 | Westmere 2 x 6 Supermicro X8DTU |
| Intel Core i5-2520M | 2.50 | Sandy Bridge 1 x 4 Thinkpad X220 (4286) |
| Intel Core i5-2500K | 3.30 | Sandy Bridge 1 x 4 MSI Z77A-G45 (MS-7752) |
| Intel Xeon E5-2680 | 2.70 | Romley 2 x 8 Supermicro X9DRW |
| Intel Xeon E5-2667 v2 | 3.30 | Romley V2 2 x 8 Supermicro X9DRW (supports ERMS) |
Test Cases
| Name | Description |
|---|---|
| page | set page to 0xa5 |
| short | set aligned 15 bytes to 0xa5 |
| short2 | set aligned 32 bytes to 0xa5 |
| short3 | set aligned 48 bytes to 0xa5 |
| offset | set misaligned ( + 4) 128 bytes to 0 |
| offset2 | set misaligned ( + 7) 97 bytes to 0 |
Results
The numbers are the min value in the distribution where the values are a TSC delta across a single invocation of the test.
Bold indicates the lowest time among the given variations in a Test and CPU combination. Green text is used for times faster than the stock implementation, and red text is used for times slower than the stock implementation.
CPU | Test / Variant | ||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
page | short | short2 | short3 | offset | offset2 | ||||||||||||||||||||||||||||||||||||
| stock | SSE2 | SSSE2 aligned | AVX 128 | AVX 256 | ERMS | stock | SSE2 | SSSE2 aligned | AVX 128 | AVX 256 | ERMS | stock | SSE2 | SSSE2 aligned | AVX 128 | AVX 256 | ERMS | stock | SSE2 | SSSE2 aligned | AVX 128 | AVX 256 | ERMS | stock | SSE2 | SSSE2 aligned | AVX 128 | AVX 256 | ERMS | stock | SSE2 | SSSE2 aligned | AVX 128 | AVX 256 | ERMS | ||||||
| AMD FX-8120 | 1078 | 987 | 972 | 974 | 3095 | 1009 | 157 | 161 | 157 | 157 | 157 | 157 | 188 | 99 | 90 | 97 | 91 | 248 | 203 | 89 | 119 | 95 | 119 | 290 | 265 | 89 | 96 | 97 | 148 | 469 | 221 | 122 | 122 | 120 | 144 | 469 | |||||
| AMD Opteron 6328 | 490 | 446 | 454 | 454 | 2485 | 457 | 108 | 106 | 108 | 108 | 108 | 108 | 126 | 90 | 92 | 92 | 94 | 130 | 128 | 91 | 91 | 96 | 94 | 144 | 148 | 90 | 95 | 93 | 103 | 231 | 137 | 93 | 96 | 96 | 99 | 233 | |||||
| Intel Xeon X5365 | 657 | 1206 | 378 | -- | -- | 720 | 144 | 144 | 144 | -- | -- | 144 | 126 | 90 | 90 | -- | -- | 162 | 126 | 99 | 90 | -- | -- | 171 | 243 | 135 | 135 | -- | -- | 252 | 126 | 108 | 108 | -- | -- | 225 | |||||
| Intel Xeon X5482 | 624 | 1144 | 312 | -- | -- | 688 | 112 | 112 | 112 | -- | -- | 112 | 96 | 64 | 64 | -- | -- | 128 | 96 | 72 | 64 | -- | -- | 144 | 216 | 120 | 120 | -- | -- | 224 | 96 | 96 | 96 | -- | -- | 192 | |||||
| Intel Xeon X5675 | 352 | 296 | 300 | -- | -- | 428 | 100 | 100 | 96 | -- | -- | 96 | 76 | 44 | 48 | -- | -- | 120 | 76 | 44 | 48 | -- | -- | 136 | 208 | 106 | 106 | -- | -- | 192 | 76 | 52 | 56 | -- | -- | 160 | |||||
| Intel Core i5-2520M | 1812 | 962 | 962 | 950 | 1400 | 13100 | 337 | 350 | 350 | 350 | 350 | 337 | 237 | 162 | 162 | 150 | 600 | 400 | 237 | 162 | 162 | 150 | 612 | 450 | 687 | 187 | 187 | 187 | 625 | 700 | 237 | 187 | 187 | 187 | 637 | 612 | |||||
| Intel Core i5-2500K | 321 | 285 | 285 | 282 | 417 | 411 | 81 | 84 | 84 | 84 | 84 | 81 | 57 | 39 | 39 | 36 | 171 | 96 | 57 | 39 | 39 | 36 | 174 | 135 | 192 | 45 | 45 | 45 | 177 | 180 | 57 | 45 | 45 | 45 | 180 | 156 | |||||
| Intel Xeon E5-2680 | 356 | 308 | 308 | 304 | 448 | 436 | 108 | 112 | 112 | 112 | 112 | 108 | 76 | 52 | 52 | 48 | 196 | 128 | 76 | 52 | 52 | 52 | 196 | 144 | 220 | 60 | 60 | 60 | 204 | 208 | 76 | 60 | 60 | 60 | 208 | 172 | |||||
| Intel Xeon E5-2667 v2 | 424 | 344 | 340 | 340 | 484 | 292 | 56 | 60 | 60 | 60 | 60 | 56 | 152 | 84 | 80 | 80 | 224 | 56 | 152 | 84 | 84 | 80 | 228 | 56 | 120 | 96 | 96 | 92 | 236 | 60 | 132 | 64 | 64 | 64 | 208 | 56 | |||||
Conclusions
- The short case is basically the same in all cases. It is a few cycles slower for the SSE/AVX variants due to an extra branch, but the delta is small enough to generally be in noise, especially considering the changes in the other tests.
- ERMS is a definite win for machines that have it.
- On the oldest Intel CPUs tested (Core 2), using movdqu on an aligned address is super slow, hence why SSE2 is slower for the page set.
- 256-bit AVX is generally slower than the other SSE/AVX variants, but 128-bit AVX often shaves a few cycles.
- SSE2 aligned is generally faster or the same, so it is the best choice for the "default" case. Should probably include ERMS and AVX 128 versions as well. Would be nice to use an ifunc to enable ERMS version, but an #ifdef for AVX 128 is probably fine.
