| 17 | | ||= Idea =||= Westmere =||= Sandy Bridge =||= Ivy Bridge =||= Penryn =|| |
| 18 | | || Replace `dec` with `sub` || none || none || none || || |
| 19 | | || Use movsd instead of movsq || slightly slower || slightly slower || 6% faster || || |
| 20 | | || Simple `movdqa` loop || 138% slower || 58% slower || 46% slower || || |
| 21 | | || `movdqa` 32 at a time (old) || 27% slower || 14% faster || 17% faster || || |
| 22 | | || `movdqa` 32 at a time (new) || 27% slower || 15% faster || 18% faster || || |
| 23 | | || `movdqa` 32 at a time (reorder) || 27% slower || 16% faster || 19% faster || || |
| 24 | | || `movdqa` 64 at a time (old) || 224% slower || 131% slower || 116% slower || || |
| 25 | | || `movdqa` 64 at a time (new) || 4 cycles slower || 21% faster || 24% faster || || |
| 26 | | || Intermix SSE and backwards tests|| slightly slower || slightly slower || slightly slower || || |
| 27 | | || `movaps` 32 at a time || 24% slower || 18% faster || 23% faster || 52% faster || |
| 28 | | || `movaps` 64 at a time || 17% faster || 23% faster || 25% faster || 48% faster || |
| 29 | | |
| 30 | | Takeaways from this trial: |
| 31 | | - Use `movaps` instead of `movdqa` as `movdqa` has a size (0x66) prefix |
| 32 | | - Minimize branches for the common path. |
| 33 | | - Unroll copy loop |
| 34 | | |
| 35 | | Now testing the overlap case: |
| 36 | | |
| 37 | | ||= Idea =||= Westmere =||= Sandy Bridge =||= Ivy Bridge =||= Penryn =|| |
| 38 | | || `movaps` 64 at a time || 56% faster || 56% faster || 56% faster || 48% faster || |
| 39 | | || Above using leaq || 50% faster || 56% faster || 60% faster || 52% faster || |
| 40 | | |
| 41 | | Notes: |
| 42 | | - leaq seems to be in the noise, it was about 4 cycles faster on Ivy Bridge and possibly Sandy Bridge, 4-8 slower on Westmere, and 8 faster on Penryn |
| | 7 | * [[LibCSSE/memcpy|memcpy]] |
| | 8 | * [[LibCSSE/memset|memset]] |
| | 9 | * [[LibCSSE/strlen|strlen]] |