= memcpy = == Variants == ||= '''Name''' =||= '''Description''' =|| || stock || MD amd64 version {{{rep movsq}}} || || SSE2 || {{{movdqu}}} for block-copy || || SSE2 aligned || align source to use always use {{{movaps}}} and use {{{movaps}}} for aligned destination and {{{movdqu}}} for unaligned destination || || AVX || 256-bit {{{vmovdqu}}} for block-copy with 128-byte block as common loop || || ERMS || {{{repne movsb}}} for machines with ERMS || == Machines Tested == ||= '''CPU''' =||= '''Speed (GHz)''' =||= '''Notes''' =|| || AMD FX-8120 || 3.11 || 1 x 8 zoo.freebsd.org || || AMD Opteron 6328 || 3.20 || 2 x 8 Supermicro H8DG6/H8DGi || || Intel Xeon X5365 || 3.00 || 2 x 4 Supermicro X7DBU || || Intel Xeon X5482 || 3.20 || 2 x 4 Supermicro X7DWN+ || || Intel Xeon X5675 || 3.07 || Westmere 2 x 6 Supermicro X8DTU || || Intel Core i5-2520M || 2.50 || Sandy Bridge 1 x 4 Thinkpad X220 (4286) || || Intel Core i5-2500K || 3.30 || Sandy Bridge 1 x 4 MSI Z77A-G45 (MS-7752) || || Intel Xeon E5-2680 || 2.70 || Romley 2 x 8 Supermicro X9DRW || || Intel Xeon E5-2667 v2 || 3.30 || Romley V2 2 x 8 Supermicro X9DRW (supports ERMS) || == Test Cases == ||= '''Name''' =||= '''Description''' =|| || page || copy aligned page to aligned page || || overlap || overlapping copy of page - 16 bytes within a page || || short || aligned copy of 15 bytes || || short2 || aligned copy of 32 bytes || || short3 || aligned copy of 48 bytes || || offset || 4 byte offset source copy of 128 bytes || || offset2 || 7 byte offset source and destination copy of 97 bytes || == Results == The numbers are the min value in the distribution where the values are a TSC delta across a single invocation of the test. Bold indicates the lowest time among the given variations in a Test and CPU combination. Green text is used for times faster than the stock implementation, and red text is used for times slower than the stock implementation. {{{#!th rowspan=3 '''CPU''' }}} {{{#!th colspan=35 '''Test / Variant''' }}} |-- {{{#!th colspan=5 '''page''' }}} {{{#!th colspan=5 '''overlap''' }}} {{{#!th colspan=5 '''short''' }}} {{{#!th colspan=5 '''short2''' }}} {{{#!th colspan=5 '''short3''' }}} {{{#!th colspan=5 '''offset''' }}} {{{#!th colspan=5 '''offset2''' }}} |-- ||= '''stock''' =||= '''SSE2''' =||= '''SSSE2 aligned''' =||= '''AVX''' =||= '''ERMS''' =|| \ ||= '''stock''' =||= '''SSE2''' =||= '''SSSE2 aligned''' =||= '''AVX''' =||= '''ERMS''' =|| \ ||= '''stock''' =||= '''SSE2''' =||= '''SSSE2 aligned''' =||= '''AVX''' =||= '''ERMS''' =|| \ ||= '''stock''' =||= '''SSE2''' =||= '''SSSE2 aligned''' =||= '''AVX''' =||= '''ERMS''' =|| \ ||= '''stock''' =||= '''SSE2''' =||= '''SSSE2 aligned''' =||= '''AVX''' =||= '''ERMS''' =|| \ ||= '''stock''' =||= '''SSE2''' =||= '''SSSE2 aligned''' =||= '''AVX''' =||= '''ERMS''' =|| \ ||= '''stock''' =||= '''SSE2''' =||= '''SSSE2 aligned''' =||= '''AVX''' =||= '''ERMS''' =|| || AMD FX-8120 || || AMD Opteron 6328 || \ || 508|| [[span(490, style=color: green)]]|| '''[[span(448, style=color: green)]]'''|| [[span(2533, style=color:red)]]|| [[span(487, style=color: green)]]|| \ || 1071|| '''[[span(434, style=color: green)]]'''|| [[span(641, style=color: green)]]|| [[span(5086, style=color:red)]]|| [[span(6591, style=color:red)]]|| \ || 118|| [[span(112, style=color: green)]]|| [[span(112, style=color: green)]]|| [[span(110, style=color: green)]]|| '''[[span(107, style=color: green)]]'''|| \ || 114|| '''[[span(75, style=color: green)]]'''|| [[span(103, style=color: green)]]|| [[span(78, style=color: green)]]|| [[span(136, style=color:red)]]|| \ || 116|| [[span(78, style=color: green)]]|| [[span(105, style=color: green)]]|| '''[[span(74, style=color: green)]]'''|| [[span(151, style=color:red)]]|| \ || 145|| '''[[span(84, style=color: green)]]'''|| [[span(101, style=color: green)]]|| [[span(90, style=color: green)]]|| [[span(362, style=color:red)]]|| \ || 134|| [[span(110, style=color: green)]]|| [[span(140, style=color:red)]]|| '''[[span(106, style=color: green)]]'''|| [[span(177, style=color:red)]]|| || Intel Xeon X5365 || || Intel Xeon X5482 || \ || 672|| 1424|| '''312'''|| -- || 760|| \ || 664|| 1424|| '''400'''|| -- || 4232|| \ || 144|| '''112'''|| '''112'''|| -- || '''112'''|| \ || 80|| '''48'''|| '''48'''|| -- || 128|| \ || 80|| '''48'''|| '''48'''|| -- || 144|| \ || 248|| '''120'''|| '''120'''|| -- || 448|| \ || 200|| '''72''' || 200|| -- || 192|| || Intel Xeon X5675 || \ || 336|| '''276'''|| 288|| -- || 424|| \ || 648|| '''280'''|| 380|| -- || 4244|| \ || 116|| 96|| 96|| -- || '''92'''|| \ || 60|| '''46'''|| 69|| -- || 112|| \ || 60|| '''46'''|| 76|| -- || 128|| \ || 76|| '''46'''|| 56|| -- || 360|| \ || 64|| '''40'''|| 192|| -- || 152|| || Intel Core i5-2520M || || Intel Core i5-2500K || || Intel Xeon E5-2680 || \ || 348|| '''288'''|| 304|| '''288'''|| 436|| \ || 676|| '''288'''|| 420|| 776|| 4276|| \ || 128|| '''100'''|| '''100'''|| '''100'''|| '''100'''|| \ || 64|| '''24'''|| 40|| '''24'''|| 120|| \ || 64|| '''28'''|| 40|| '''28'''|| 136|| \ || 88|| 28|| 44|| '''24'''|| 296|| \ || 68|| 40|| 196|| '''36'''|| 164|| || Intel Xeon E5-2667 v2 || \ || 384|| '''288'''|| 336|| '''288'''|| 292|| \ || 708|| '''288'''|| 452|| 740|| 4268|| \ || 84|| 52|| '''44'''|| 52|| '''44'''|| \ || 100|| '''24'''|| 72|| '''24'''|| 52|| \ || 100|| '''24'''|| 76|| '''24'''|| 40|| \ || 120|| '''24'''|| 92|| '''24'''|| 56|| \ || 84|| 60|| 92|| '''56'''|| 60|| == Conclusions == == Early Notes == The first routine I worked on was memcpy(). Comparison of stock memcpy() of a single page on FreeBSD vs Linux on a Westmere (values are TSC deltas): {{{ x fbsd/westmere/builtin + linux/builtin N Min Max Median Avg Stddev x 1000 336 18444 340 361.628 573.11483 + 1000 276 9996 280 288.924 307.34136 Difference at 95.0% confidence -72.704 +/- 40.3074 -20.1046% +/- 11.1461% (Student's t, pooled s = 459.847) }}} ||= Idea =||= Westmere =||= Sandy Bridge =||= Ivy Bridge =||= Penryn =|| || Replace `dec` with `sub` || none || none || none || || || Use movsd instead of movsq || slightly slower || slightly slower || 6% faster || || || Simple `movdqa` loop || 138% slower || 58% slower || 46% slower || || || `movdqa` 32 at a time (old) || 27% slower || 14% faster || 17% faster || || || `movdqa` 32 at a time (new) || 27% slower || 15% faster || 18% faster || || || `movdqa` 32 at a time (reorder) || 27% slower || 16% faster || 19% faster || || || `movdqa` 64 at a time (old) || 224% slower || 131% slower || 116% slower || || || `movdqa` 64 at a time (new) || 4 cycles slower || 21% faster || 24% faster || || || Intermix SSE and backwards tests|| slightly slower || slightly slower || slightly slower || || || `movaps` 32 at a time || 24% slower || 18% faster || 23% faster || 52% faster || || `movaps` 64 at a time || 17% faster || 23% faster || 25% faster || 48% faster || Takeaways from this trial: - Use `movaps` instead of `movdqa` as `movdqa` has a size (0x66) prefix - Minimize branches for the common path. - Unroll copy loop Now testing the overlap case: ||= Idea =||= Westmere =||= Sandy Bridge =||= Ivy Bridge =||= Penryn =|| || `movaps` 64 at a time || 56% faster || 56% faster || 56% faster || 48% faster || || Above using leaq || 50% faster || 56% faster || 60% faster || 52% faster || Notes: - leaq seems to be in the noise, it was about 4 cycles faster on Ivy Bridge and possibly Sandy Bridge, 4-8 slower on Westmere, and 8 faster on Penryn