Surfing the Internet, I stumbled upon http://www.sciencemark.org where you can download a benchmark program that (amongst others) can benchmark different x86 memcpy implementations. Running that benchmark on my machine revealed that the fastest implementation was roughly twice as fast as the "rep movsl" implementation (lib/string/i386/memcpy_asm.s) that ReactOS uses. To test the alternate implementations in a ReactOS setting, I first instrumented the existing memcpy implementation to log with which arguments it was being called. I then booted ReactOS, started a background compile in it (to generate some I/O) and played a game of Solitaire (to generate graphics operations). After loosing the game, I shut down ReactOS. I then extracted the memcpy calls roughly between the start of Explorer (to get rid of one time startup effects) an shutdown. The resulting call profile is attached below. I then used that profile to make calls to the existing memcpy and an alternate implementation (I selected the "MMX registry copy with SSE prefetching"), taking care to use different source and destination regions to remove caching effects. The profile consisted of roughly 250000 calls to memcpy, I found that I had to execute the profile 10000 times to get "reasonable" time values. To compensate for the overhead of the test program, I also ran a test where the whole memcpy routine consisted of a single instruction: "ret". The test results, after applying a correction for the overhead: rep movl 70.5 sec mmx registers 58.3 sec Speed increase: 17% (Test machine: AMD Athlon MP 2800+ running Linux). Although the relative speed increase is nice (17%), we also have to look at the absolute speed increase. Remember that the 70.5 sec for the "rep movl" case was obtained by running the whole profile 10000 times. This means that all the memcpy's executed during the profiling run of ReactOS together took only 0.00705 seconds. So the conclusion has to be that we're simply not spending a significant amount of time in memcpy (BTW, our memcpy implementation is shared between kernel and user mode, of the total of 250000 memcpy calls about 90% were made from kernel mode and 10% from user mode), so optimizing memcpy (although possible) will not result in a significant better performance of ReactOS as a whole. Just for fun, I then used only the part of the profile where the memory area was larger than 128 bytes. The MMX implementation actually only runs for sizes over 128 bytes, for smaller sizes it deferred to the "rep movl" implementation. According to the profile, the vast majority of memcpy calls is made with a size smaller than 128 bytes (96.8%). rep movl 52.9 sec mmx registers 27.1 sec Speed increase 48% This is more or less in line with the results I got from the membench benchmark from http://www.sciencemark.org. Final conclusion: Although optimizing memcpy is useful (and feasible) for transfer of large blocks, the usage pattern in ReactOS consists mostly of small blocks. The resulting absolute spead increase doesn't justify the increased code complexity. 2005/12/03 GvG