reactos/media/doc/memcpy_optimize.txt

   1 Surfing the Internet, I stumbled upon http://www.sciencemark.org where you
   2 can download a benchmark program that (amongst others) can benchmark different
   3 x86 memcpy implementations. Running that benchmark on my machine revealed that
   4 the fastest implementation was roughly twice as fast as the "rep movsl"
   5 implementation (lib/string/i386/memcpy_asm.s) that ReactOS uses.
   6 To test the alternate implementations in a ReactOS setting, I first
   7 instrumented the existing memcpy implementation to log with which arguments
   8 it was being called. I then booted ReactOS, started a background compile in it
   9 (to generate some I/O) and played a game of Solitaire (to generate graphics
  10 operations). After loosing the game, I shut down ReactOS. I then extracted
  11 the memcpy calls roughly between the start of Explorer (to get rid of one time
  12 startup effects) an shutdown. The resulting call profile is attached below.
  13 I then used that profile to make calls to the existing memcpy and an alternate
  14 implementation (I selected the "MMX registry copy with SSE prefetching"),
  15 taking care to use different source and destination regions to remove caching
  16 effects. The profile consisted of roughly 250000 calls to memcpy, I found
  17 that I had to execute the profile 10000 times to get "reasonable" time values.
  18 To compensate for the overhead of the test program, I also ran a test where
  19 the whole memcpy routine consisted of a single instruction: "ret". The test
  20 results, after applying a correction for the overhead:
  21
  22 rep movl 70.5 sec
  23 mmx registers 58.3 sec
  24 Speed increase: 17%
  25
  26 (Test machine: AMD Athlon MP 2800+ running Linux).
  27 Although the relative speed increase is nice (17%), we also have to look at the
  28 absolute speed increase. Remember that the 70.5 sec for the "rep movl" case
  29 was obtained by running the whole profile 10000 times. This means that all the
  30 memcpy's executed during the profiling run of ReactOS together took only
  31 0.00705 seconds. So the conclusion has to be that we're simply not spending
  32 a significant amount of time in memcpy (BTW, our memcpy implementation is
  33 shared between kernel and user mode, of the total of 250000 memcpy calls about
  34 90% were made from kernel mode and 10% from user mode), so optimizing memcpy
  35 (although possible) will not result in a significant better performance of
  36 ReactOS as a whole.
  37 Just for fun, I then used only the part of the profile where the memory area
  38 was larger than 128 bytes. The MMX implementation actually only runs for sizes
  39 over 128 bytes, for smaller sizes it deferred to the "rep movl" implementation.
  40 According to the profile, the vast majority of memcpy calls is made with a
  41 size smaller than 128 bytes (96.8%).
  42
  43 rep movl 52.9 sec
  44 mmx registers 27.1 sec
  45 Speed increase 48%
  46
  47 This is more or less in line with the results I got from the membench benchmark
  48 from http://www.sciencemark.org.
  49
  50 Final conclusion: Although optimizing memcpy is useful (and feasible) for
  51 transfer of large blocks, the usage pattern in ReactOS consists mostly of
  52 small blocks. The resulting absolute spead increase doesn't justify the
  53 increased code complexity.
  54
  55 2005/12/03 GvG