| | 141 | |
| | 142 | * The short case is basically the same in all cases. It is a few cycles slower for the SSE/AVX variants due to an extra branch, but the delta is small enough to generally be in noise, especially considering the changes in the other tests. |
| | 143 | * ERMS is a definite win for machines that have it. |
| | 144 | * On the oldest Intel CPUs tested (Core 2), using {{{movdqu}}} on an aligned address is super slow, hence why SSE2 is slower for the page set. |
| | 145 | * 256-bit AVX is generally slower than the other SSE/AVX variants, but 128-bit AVX often shaves a few cycles. |
| | 146 | * SSE2 aligned is generally faster or the same, so it is the best choice for the "default" case. Should probably include ERMS and AVX 128 versions as well. Would be nice to use an ifunc to enable ERMS version, but an #ifdef for AVX 128 is probably fine. |