Document (failed) attempt to optimize memcpy()
[reactos.git] / reactos / media / doc / memcpy_optimize.txt
1 Surfing the Internet, I stumbled upon where you
2 can download a benchmark program that (amongst others) can benchmark different
3 x86 memcpy implementations. Running that benchmark on my machine revealed that
4 the fastest implementation was roughly twice as fast as the "rep movsl"
5 implementation (lib/string/i386/memcpy_asm.s) that ReactOS uses.
6 To test the alternate implementations in a ReactOS setting, I first
7 instrumented the existing memcpy implementation to log with which arguments
8 it was being called. I then booted ReactOS, started a background compile in it
9 (to generate some I/O) and played a game of Solitaire (to generate graphics
10 operations). After loosing the game, I shut down ReactOS. I then extracted
11 the memcpy calls roughly between the start of Explorer (to get rid of one time
12 startup effects) an shutdown. The resulting call profile is attached below.
13 I then used that profile to make calls to the existing memcpy and an alternate
14 implementation (I selected the "MMX registry copy with SSE prefetching"),
15 taking care to use different source and destination regions to remove caching
16 effects. The profile consisted of roughly 250000 calls to memcpy, I found
17 that I had to execute the profile 10000 times to get "reasonable" time values.
18 To compensate for the overhead of the test program, I also ran a test where
19 the whole memcpy routine consisted of a single instruction: "ret". The test
20 results, after applying a correction for the overhead:
22 rep movl 70.5 sec
23 mmx registers 58.3 sec
24 Speed increase: 17%
26 (Test machine: AMD Athlon MP 2800+ running Linux).
27 Although the relative speed increase is nice (17%), we also have to look at the
28 absolute speed increase. Remember that the 70.5 sec for the "rep movl" case
29 was obtained by running the whole profile 10000 times. This means that all the
30 memcpy's executed during the profiling run of ReactOS together took only
31 0.00705 seconds. So the conclusion has to be that we're simply not spending
32 a significant amount of time in memcpy (BTW, our memcpy implementation is
33 shared between kernel and user mode, of the total of 250000 memcpy calls about
34 90% were made from kernel mode and 10% from user mode), so optimizing memcpy
35 (although possible) will not result in a significant better performance of
36 ReactOS as a whole.
37 Just for fun, I then used only the part of the profile where the memory area
38 was larger than 128 bytes. The MMX implementation actually only runs for sizes
39 over 128 bytes, for smaller sizes it deferred to the "rep movl" implementation.
40 According to the profile, the vast majority of memcpy calls is made with a
41 size smaller than 128 bytes (96.8%).
43 rep movl 52.9 sec
44 mmx registers 27.1 sec
45 Speed increase 48%
47 This is more or less in line with the results I got from the membench benchmark
48 from
50 Final conclusion: Although optimizing memcpy is useful (and feasible) for
51 transfer of large blocks, the usage pattern in ReactOS consists mostly of
52 small blocks. The resulting absolute spead increase doesn't justify the
53 increased code complexity.
55 2005/12/03 GvG