From 96ce8a5210d002c0ab15e13824be373ee04668c4 Mon Sep 17 00:00:00 2001 From: =?utf8?q?G=C3=A9=20van=20Geldorp?= Date: Sat, 3 Dec 2005 19:40:52 +0000 Subject: [PATCH] Document (failed) attempt to optimize memcpy() svn path=/trunk/; revision=19843 --- reactos/lib/string/i386/memcpy_asm.s | 6 +-- reactos/media/doc/memcpy_optimize.txt | 55 +++++++++++++++++++++++++++ 2 files changed, 57 insertions(+), 4 deletions(-) create mode 100644 reactos/media/doc/memcpy_optimize.txt diff --git a/reactos/lib/string/i386/memcpy_asm.s b/reactos/lib/string/i386/memcpy_asm.s index beeccfd5efa..370d9d89eb6 100644 --- a/reactos/lib/string/i386/memcpy_asm.s +++ b/reactos/lib/string/i386/memcpy_asm.s @@ -1,9 +1,7 @@ -/* - * $Id$ - */ - /* * void *memcpy (void *to, const void *from, size_t count) + * + * Some optimization research can be found in media/doc/memcpy_optimize.txt */ .globl _memcpy diff --git a/reactos/media/doc/memcpy_optimize.txt b/reactos/media/doc/memcpy_optimize.txt new file mode 100644 index 00000000000..d9eacf7ed38 --- /dev/null +++ b/reactos/media/doc/memcpy_optimize.txt @@ -0,0 +1,55 @@ +Surfing the Internet, I stumbled upon http://www.sciencemark.org where you +can download a benchmark program that (amongst others) can benchmark different +x86 memcpy implementations. Running that benchmark on my machine revealed that +the fastest implementation was roughly twice as fast as the "rep movsl" +implementation (lib/string/i386/memcpy_asm.s) that ReactOS uses. +To test the alternate implementations in a ReactOS setting, I first +instrumented the existing memcpy implementation to log with which arguments +it was being called. I then booted ReactOS, started a background compile in it +(to generate some I/O) and played a game of Solitaire (to generate graphics +operations). After loosing the game, I shut down ReactOS. I then extracted +the memcpy calls roughly between the start of Explorer (to get rid of one time +startup effects) an shutdown. The resulting call profile is attached below. +I then used that profile to make calls to the existing memcpy and an alternate +implementation (I selected the "MMX registry copy with SSE prefetching"), +taking care to use different source and destination regions to remove caching +effects. The profile consisted of roughly 250000 calls to memcpy, I found +that I had to execute the profile 10000 times to get "reasonable" time values. +To compensate for the overhead of the test program, I also ran a test where +the whole memcpy routine consisted of a single instruction: "ret". The test +results, after applying a correction for the overhead: + +rep movl 70.5 sec +mmx registers 58.3 sec +Speed increase: 17% + +(Test machine: AMD Athlon MP 2800+ running Linux). +Although the relative speed increase is nice (17%), we also have to look at the +absolute speed increase. Remember that the 70.5 sec for the "rep movl" case +was obtained by running the whole profile 10000 times. This means that all the +memcpy's executed during the profiling run of ReactOS together took only +0.00705 seconds. So the conclusion has to be that we're simply not spending +a significant amount of time in memcpy (BTW, our memcpy implementation is +shared between kernel and user mode, of the total of 250000 memcpy calls about +90% were made from kernel mode and 10% from user mode), so optimizing memcpy +(although possible) will not result in a significant better performance of +ReactOS as a whole. +Just for fun, I then used only the part of the profile where the memory area +was larger than 128 bytes. The MMX implementation actually only runs for sizes +over 128 bytes, for smaller sizes it deferred to the "rep movl" implementation. +According to the profile, the vast majority of memcpy calls is made with a +size smaller than 128 bytes (96.8%). + +rep movl 52.9 sec +mmx registers 27.1 sec +Speed increase 48% + +This is more or less in line with the results I got from the membench benchmark +from http://www.sciencemark.org. + +Final conclusion: Although optimizing memcpy is useful (and feasible) for +transfer of large blocks, the usage pattern in ReactOS consists mostly of +small blocks. The resulting absolute spead increase doesn't justify the +increased code complexity. + +2005/12/03 GvG -- 2.17.1