= memcpy = == Variants == ||= '''Name''' =||= '''Description''' =|| || stock || MD amd64 version {{{rep movsq}}} || || SSE2 || {{{movdqu}}} for block-copy || || SSE2 aligned || align source to use always use {{{movaps}}} and use {{{movaps}}} for aligned destination and {{{movdqu}}} for unaligned destination || || AVX || 256-bit {{{vmovdqu}}} for block-copy with 128-byte block as common loop || || ERMS || {{{repne movsb}}} for machines with ERMS || == Machines Tested == ||= '''CPU''' =||= '''Speed (GHz)''' =||= '''Notes''' =|| || AMD FX-8120 || 3.11 || 1 x 8 zoo.freebsd.org || || AMD Opteron 6328 || 3.20 || 2 x 8 Supermicro H8DG6/H8DGi || || Intel Xeon X5365 || 3.00 || 2 x 4 Supermicro X7DBU || || Intel Xeon X5482 || 3.20 || 2 x 4 Supermicro X7DWN+ || || Intel Xeon X5675 || 3.07 || Westmere 2 x 6 Supermicro X8DTU || || Intel Core i5-2520M || 2.50 || Sandy Bridge 1 x 4 Thinkpad X220 (4286) || || Intel Core i5-2500K || 3.30 || Sandy Bridge 1 x 4 MSI Z77A-G45 (MS-7752) || || Intel Xeon E5-2680 || 2.70 || Romley 2 x 8 Supermicro X9DRW || || Intel Xeon E5-2667 v2 || 3.30 || Romley V2 2 x 8 Supermicro X9DRW (supports ERMS) || == Test Cases == ||= '''Name''' =||= '''Description''' =|| || page || copy aligned page to aligned page || || overlap || overlapping copy of page - 16 bytes within a page || || short || aligned copy of 15 bytes || || short2 || aligned copy of 32 bytes || || short3 || aligned copy of 48 bytes || || offset || 4 byte offset source copy of 128 bytes || || offset2 || 7 byte offset source and destination copy of 97 bytes || == Results == The numbers are the min value in the distribution where the values are a TSC delta across a single invocation of the test. Bold indicates the lowest time among the given variations in a Test and CPU combination. Green text is used for times faster than the stock implementation, and red text is used for times slower than the stock implementation. {{{#!th rowspan=3 '''CPU''' }}} {{{#!th colspan=41 '''Test / Variant''' }}} |-- {{{#!th colspan=5 '''page''' }}} {{{#!th rowspan=11 style="background: gray" }}} {{{#!th colspan=5 '''overlap''' }}} {{{#!th rowspan=11 style="background: gray" }}} {{{#!th colspan=5 '''short''' }}} {{{#!th rowspan=11 style="background: gray" }}} {{{#!th colspan=5 '''short2''' }}} {{{#!th rowspan=11 style="background: gray" }}} {{{#!th colspan=5 '''short3''' }}} {{{#!th rowspan=11 style="background: gray" }}} {{{#!th colspan=5 '''offset''' }}} {{{#!th rowspan=11 style="background: gray" }}} {{{#!th colspan=5 '''offset2''' }}} |-- ||= '''stock''' =||= '''SSE2''' =||= '''SSSE2 aligned''' =||= '''AVX''' =||= '''ERMS''' =|| \ ||= '''stock''' =||= '''SSE2''' =||= '''SSSE2 aligned''' =||= '''AVX''' =||= '''ERMS''' =|| \ ||= '''stock''' =||= '''SSE2''' =||= '''SSSE2 aligned''' =||= '''AVX''' =||= '''ERMS''' =|| \ ||= '''stock''' =||= '''SSE2''' =||= '''SSSE2 aligned''' =||= '''AVX''' =||= '''ERMS''' =|| \ ||= '''stock''' =||= '''SSE2''' =||= '''SSSE2 aligned''' =||= '''AVX''' =||= '''ERMS''' =|| \ ||= '''stock''' =||= '''SSE2''' =||= '''SSSE2 aligned''' =||= '''AVX''' =||= '''ERMS''' =|| \ ||= '''stock''' =||= '''SSE2''' =||= '''SSSE2 aligned''' =||= '''AVX''' =||= '''ERMS''' =|| || AMD FX-8120 || \ || 1178|| '''[[span(1062, style=color: green)]]'''|| [[span(1067, style=color: green)]]|| [[span(3241, style=color:red)]]|| [[span(1138, style=color: green)]]|| \ || 2508|| '''[[span(1014, style=color: green)]]'''|| [[span(1483, style=color: green)]]|| [[span(6315, style=color:red)]]|| [[span(14884, style=color:red)]]|| \ || '''177'''|| [[span(199, style=color:red)]]|| [[span(195, style=color:red)]]|| [[span(199, style=color:red)]]|| [[span(186, style=color:red)]]|| \ || 177|| '''[[span(84, style=color: green)]]'''|| [[span(122, style=color: green)]]|| [[span(87, style=color: green)]]|| [[span(246, style=color:red)]]|| \ || 186|| '''[[span(88, style=color: green)]]'''|| [[span(131, style=color: green)]]|| [[span(90, style=color: green)]]|| [[span(305, style=color:red)]]|| \ || 239|| '''[[span(89, style=color: green)]]'''|| [[span(135, style=color: green)]]|| [[span(122, style=color: green)]]|| [[span(775, style=color:red)]]|| \ || 212|| '''[[span(140, style=color: green)]]'''|| [[span(272, style=color:red)]]|| [[span(149, style=color: green)]]|| [[span(459, style=color:red)]]|| || AMD Opteron 6328 || \ || 508|| [[span(490, style=color: green)]]|| '''[[span(448, style=color: green)]]'''|| [[span(2533, style=color:red)]]|| [[span(487, style=color: green)]]|| \ || 1071|| '''[[span(434, style=color: green)]]'''|| [[span(641, style=color: green)]]|| [[span(5086, style=color:red)]]|| [[span(6591, style=color:red)]]|| \ || 118|| [[span(112, style=color: green)]]|| [[span(112, style=color: green)]]|| [[span(110, style=color: green)]]|| '''[[span(107, style=color: green)]]'''|| \ || 114|| '''[[span(75, style=color: green)]]'''|| [[span(103, style=color: green)]]|| [[span(78, style=color: green)]]|| [[span(136, style=color:red)]]|| \ || 116|| [[span(78, style=color: green)]]|| [[span(105, style=color: green)]]|| '''[[span(74, style=color: green)]]'''|| [[span(151, style=color:red)]]|| \ || 145|| '''[[span(84, style=color: green)]]'''|| [[span(101, style=color: green)]]|| [[span(90, style=color: green)]]|| [[span(362, style=color:red)]]|| \ || 134|| [[span(110, style=color: green)]]|| [[span(140, style=color:red)]]|| '''[[span(106, style=color: green)]]'''|| [[span(177, style=color:red)]]|| || Intel Xeon X5365 || \ || 711|| [[span(1512, style=color:red)]]|| '''[[span(378, style=color: green)]]'''|| -- || [[span(792, style=color:red)]]|| \ || 693|| [[span(1476, style=color:red)]]|| '''[[span(423, style=color: green)]]'''|| -- || [[span(4257, style=color:red)]]|| \ || 180|| '''[[span(144, style=color: green)]]'''|| '''[[span(144, style=color: green)]]'''|| -- || '''[[span(144, style=color: green)]]'''|| \ || 117|| '''[[span(81, style=color: green)]]'''|| [[span(90, style=color: green)]]|| -- || [[span(162, style=color:red)]]|| \ || 117|| '''[[span(90, style=color: green)]]'''|| '''[[span(90, style=color: green)]]'''|| -- || [[span(180, style=color:red)]]|| \ || 261|| '''[[span(135, style=color: green)]]'''|| '''[[span(135, style=color: green)]]'''|| -- || [[span(477, style=color:red)]]|| \ || 216|| '''[[span(117, style=color: green)]]'''|| [[span(243, style=color:red)]]|| -- || [[span(234, style=color:red)]]|| || Intel Xeon X5482 || \ || 672|| [[span(1424, style=color:red)]]|| '''[[span(312, style=color: green)]]'''|| -- || [[span(760, style=color:red)]]|| \ || 664|| [[span(1424, style=color:red)]]|| '''[[span(400, style=color: green)]]'''|| -- || [[span(4232, style=color:red)]]|| \ || 144|| '''[[span(112, style=color: green)]]'''|| '''[[span(112, style=color: green)]]'''|| -- || '''[[span(112, style=color: green)]]'''|| \ || 80|| '''[[span(48, style=color: green)]]'''|| '''[[span(48, style=color: green)]]'''|| -- || [[span(128, style=color:red)]]|| \ || 80|| '''[[span(48, style=color: green)]]'''|| '''[[span(48, style=color: green)]]'''|| -- || [[span(144, style=color:red)]]|| \ || 248|| '''[[span(120, style=color: green)]]'''|| '''[[span(120, style=color: green)]]'''|| -- || [[span(448, style=color:red)]]|| \ || 200|| '''[[span(72, style=color: green)]]'''|| 200|| -- || [[span(192, style=color: green)]]|| || Intel Xeon X5675 || \ || 336|| '''[[span(276, style=color: green)]]'''|| [[span(288, style=color: green)]]|| -- || [[span(424, style=color:red)]]|| \ || 648|| '''[[span(280, style=color: green)]]'''|| [[span(380, style=color: green)]]|| -- || [[span(4244, style=color:red)]]|| \ || 116|| [[span(96, style=color: green)]]|| [[span(96, style=color: green)]]|| -- || '''[[span(92, style=color: green)]]'''|| \ || 60|| '''[[span(46, style=color: green)]]'''|| [[span(69, style=color:red)]]|| -- || [[span(112, style=color:red)]]|| \ || 60|| '''[[span(46, style=color: green)]]'''|| [[span(76, style=color:red)]]|| -- || [[span(128, style=color:red)]]|| \ || 76|| '''[[span(46, style=color: green)]]'''|| [[span(56, style=color: green)]]|| -- || [[span(360, style=color:red)]]|| \ || 64|| '''[[span(40, style=color: green)]]'''|| [[span(192, style=color:red)]]|| -- || [[span(152, style=color:red)]]|| || Intel Core i5-2520M || \ || 1775|| '''[[span(900, style=color: green)]]'''|| [[span(937, style=color: green)]]|| '''[[span(900, style=color: green)]]'''|| [[span(13075, style=color:red)]]|| \ || 2122|| '''[[span(900, style=color: green)]]'''|| [[span(1312, style=color: green)]]|| [[span(2463, style=color:red)]]|| [[span(13362, style=color:red)]]|| \ || 412|| '''[[span(312, style=color: green)]]'''|| '''[[span(312, style=color: green)]]'''|| '''[[span(312, style=color: green)]]'''|| '''[[span(312, style=color: green)]]'''|| \ || 200|| '''[[span(75, style=color: green)]]'''|| [[span(125, style=color: green)]]|| '''[[span(75, style=color: green)]]'''|| [[span(375, style=color:red)]]|| \ || 200|| '''[[span(87, style=color: green)]]'''|| [[span(125, style=color: green)]]|| '''[[span(87, style=color: green)]]'''|| [[span(425, style=color:red)]]|| \ || 275|| '''[[span(75, style=color: green)]]'''|| [[span(137, style=color: green)]]|| '''[[span(87, style=color: green)]]'''|| [[span(687, style=color:red)]]|| \ || 212|| '''[[span(125, style=color: green)]]'''|| [[span(612, style=color:red)]]|| '''[[span(125, style=color: green)]]'''|| [[span(587, style=color:red)]]|| || Intel Core i5-2500K || || Intel Xeon E5-2680 || \ || 348|| '''[[span(288, style=color: green)]]'''|| [[span(304, style=color: green)]]|| '''[[span(288, style=color: green)]]'''|| [[span(436, style=color:red)]]|| \ || 676|| '''[[span(288, style=color: green)]]'''|| [[span(420, style=color: green)]]|| [[span(776, style=color:red)]]|| [[span(4276, style=color:red)]]|| \ || 128|| '''[[span(100, style=color: green)]]'''|| '''[[span(100, style=color: green)]]'''|| '''[[span(100, style=color: green)]]'''|| '''[[span(100, style=color: green)]]'''|| \ || 64|| '''[[span(24, style=color: green)]]'''|| [[span(40, style=color: green)]]|| '''[[span(24, style=color: green)]]'''|| [[span(120, style=color:red)]]|| \ || 64|| '''[[span(28, style=color: green)]]'''|| [[span(40, style=color: green)]]|| '''[[span(28, style=color: green)]]'''|| [[span(136, style=color:red)]]|| \ || 88|| [[span(28, style=color: green)]]|| [[span(44, style=color: green)]]|| '''[[span(24, style=color: green)]]'''|| [[span(296, style=color:red)]]|| \ || 68|| [[span(40, style=color: green)]]|| [[span(196, style=color:red)]]|| '''[[span(36, style=color: green)]]'''|| [[span(164, style=color:red)]]|| || Intel Xeon E5-2667 v2 || \ || 384|| '''[[span(288, style=color: green)]]'''|| [[span(336, style=color: green)]]|| '''[[span(288, style=color: green)]]'''|| [[span(292, style=color: green)]]|| \ || 708|| '''[[span(288, style=color: green)]]'''|| [[span(452, style=color: green)]]|| [[span(740, style=color:red)]]|| [[span(4268, style=color:red)]]|| \ || 84|| [[span(52, style=color: green)]]|| '''[[span(44, style=color: green)]]'''|| [[span(52, style=color: green)]]|| '''[[span(44, style=color: green)]]'''|| \ || 100|| '''[[span(24, style=color: green)]]'''|| [[span(72, style=color: green)]]|| '''[[span(24, style=color: green)]]'''|| [[span(52, style=color: green)]]|| \ || 100|| '''[[span(24, style=color: green)]]'''|| [[span(76, style=color: green)]]|| '''[[span(24, style=color: green)]]'''|| [[span(40, style=color: green)]]|| \ || 120|| '''[[span(24, style=color: green)]]'''|| [[span(92, style=color: green)]]|| '''[[span(24, style=color: green)]]'''|| [[span(56, style=color: green)]]|| \ || 84|| [[span(60, style=color: green)]]|| [[span(92, style=color:red)]]|| '''[[span(56, style=color: green)]]'''|| [[span(60, style=color: green)]]|| == Conclusions == == Early Notes == The first routine I worked on was memcpy(). Comparison of stock memcpy() of a single page on FreeBSD vs Linux on a Westmere (values are TSC deltas): {{{ x fbsd/westmere/builtin + linux/builtin N Min Max Median Avg Stddev x 1000 336 18444 340 361.628 573.11483 + 1000 276 9996 280 288.924 307.34136 Difference at 95.0% confidence -72.704 +/- 40.3074 -20.1046% +/- 11.1461% (Student's t, pooled s = 459.847) }}} ||= Idea =||= Westmere =||= Sandy Bridge =||= Ivy Bridge =||= Penryn =|| || Replace `dec` with `sub` || none || none || none || || || Use movsd instead of movsq || slightly slower || slightly slower || 6% faster || || || Simple `movdqa` loop || 138% slower || 58% slower || 46% slower || || || `movdqa` 32 at a time (old) || 27% slower || 14% faster || 17% faster || || || `movdqa` 32 at a time (new) || 27% slower || 15% faster || 18% faster || || || `movdqa` 32 at a time (reorder) || 27% slower || 16% faster || 19% faster || || || `movdqa` 64 at a time (old) || 224% slower || 131% slower || 116% slower || || || `movdqa` 64 at a time (new) || 4 cycles slower || 21% faster || 24% faster || || || Intermix SSE and backwards tests|| slightly slower || slightly slower || slightly slower || || || `movaps` 32 at a time || 24% slower || 18% faster || 23% faster || 52% faster || || `movaps` 64 at a time || 17% faster || 23% faster || 25% faster || 48% faster || Takeaways from this trial: - Use `movaps` instead of `movdqa` as `movdqa` has a size (0x66) prefix - Minimize branches for the common path. - Unroll copy loop Now testing the overlap case: ||= Idea =||= Westmere =||= Sandy Bridge =||= Ivy Bridge =||= Penryn =|| || `movaps` 64 at a time || 56% faster || 56% faster || 56% faster || 48% faster || || Above using leaq || 50% faster || 56% faster || 60% faster || 52% faster || Notes: - leaq seems to be in the noise, it was about 4 cycles faster on Ivy Bridge and possibly Sandy Bridge, 4-8 slower on Westmere, and 8 faster on Penryn