Surfing the Internet, I stumbled upon http://www.sciencemark.org where you
can download a benchmark program that (amongst others) can benchmark different
x86 memcpy implementations. Running that benchmark on my machine revealed that
the fastest implementation was roughly twice as fast as the "rep movsl"
implementation (lib/string/i386/memcpy_asm.s) that ReactOS uses.
To test the alternate implementations in a ReactOS setting, I first
instrumented the existing memcpy implementation to log with which arguments
it was being called. I then booted ReactOS, started a background compile in it
(to generate some I/O) and played a game of Solitaire (to generate graphics
operations). After loosing the game, I shut down ReactOS. I then extracted
the memcpy calls roughly between the start of Explorer (to get rid of one time
startup effects) an shutdown. The resulting call profile is attached below.
I then used that profile to make calls to the existing memcpy and an alternate
implementation (I selected the "MMX registry copy with SSE prefetching"),
taking care to use different source and destination regions to remove caching
effects. The profile consisted of roughly 250000 calls to memcpy, I found
that I had to execute the profile 10000 times to get "reasonable" time values.
To compensate for the overhead of the test program, I also ran a test where
the whole memcpy routine consisted of a single instruction: "ret". The test
results, after applying a correction for the overhead:

rep movl 70.5 sec
mmx registers 58.3 sec
Speed increase: 17%

(Test machine: AMD Athlon MP 2800+ running Linux).
Although the relative speed increase is nice (17%), we also have to look at the
absolute speed increase. Remember that the 70.5 sec for the "rep movl" case
was obtained by running the whole profile 10000 times. This means that all the
memcpy's executed during the profiling run of ReactOS together took only
0.00705 seconds. So the conclusion has to be that we're simply not spending
a significant amount of time in memcpy (BTW, our memcpy implementation is
shared between kernel and user mode, of the total of 250000 memcpy calls about
90% were made from kernel mode and 10% from user mode), so optimizing memcpy
(although possible) will not result in a significant better performance of
ReactOS as a whole.
Just for fun, I then used only the part of the profile where the memory area
was larger than 128 bytes. The MMX implementation actually only runs for sizes
over 128 bytes, for smaller sizes it deferred to the "rep movl" implementation.
According to the profile, the vast majority of memcpy calls is made with a
size smaller than 128 bytes (96.8%).

rep movl 52.9 sec
mmx registers 27.1 sec
Speed increase 48%

This is more or less in line with the results I got from the membench benchmark
from http://www.sciencemark.org.

Final conclusion: Although optimizing memcpy is useful (and feasible) for
transfer of large blocks, the usage pattern in ReactOS consists mostly of
small blocks. The resulting absolute spead increase doesn't justify the
increased code complexity.

2005/12/03 GvG