Imported bzip2 modified to build a decompression only dll for use by the ramdisk...
[reactos.git] / reactos / drivers / lib / bzip2 / manual_4.html
1 <HTML>
2 <HEAD>
3 <!-- This HTML file has been created by texi2html 1.54
4 from manual.texi on 23 March 2000 -->
5
6 <TITLE>bzip2 and libbzip2 - Miscellanea</TITLE>
7 <link href="manual_3.html" rel=Previous>
8 <link href="manual_toc.html" rel=ToC>
9
10 </HEAD>
11 <BODY>
12 <p>Go to the <A HREF="manual_1.html">first</A>, <A HREF="manual_3.html">previous</A>, next, last section, <A HREF="manual_toc.html">table of contents</A>.
13 <P><HR><P>
14
15
16 <H1><A NAME="SEC43" HREF="manual_toc.html#TOC43">Miscellanea</A></H1>
17
18 <P>
19 These are just some random thoughts of mine. Your mileage may
20 vary.
21
22 </P>
23
24
25 <H2><A NAME="SEC44" HREF="manual_toc.html#TOC44">Limitations of the compressed file format</A></H2>
26 <P>
27 <CODE>bzip2-1.0</CODE>, <CODE>0.9.5</CODE> and <CODE>0.9.0</CODE>
28 use exactly the same file format as the previous
29 version, <CODE>bzip2-0.1</CODE>. This decision was made in the interests of
30 stability. Creating yet another incompatible compressed file format
31 would create further confusion and disruption for users.
32
33 </P>
34 <P>
35 Nevertheless, this is not a painless decision. Development
36 work since the release of <CODE>bzip2-0.1</CODE> in August 1997
37 has shown complexities in the file format which slow down
38 decompression and, in retrospect, are unnecessary. These are:
39
40 <UL>
41 <LI>The run-length encoder, which is the first of the
42
43 compression transformations, is entirely irrelevant.
44 The original purpose was to protect the sorting algorithm
45 from the very worst case input: a string of repeated
46 symbols. But algorithm steps Q6a and Q6b in the original
47 Burrows-Wheeler technical report (SRC-124) show how
48 repeats can be handled without difficulty in block
49 sorting.
50 <LI>The randomisation mechanism doesn't really need to be
51
52 there. Udi Manber and Gene Myers published a suffix
53 array construction algorithm a few years back, which
54 can be employed to sort any block, no matter how
55 repetitive, in O(N log N) time. Subsequent work by
56 Kunihiko Sadakane has produced a derivative O(N (log N)^2)
57 algorithm which usually outperforms the Manber-Myers
58 algorithm.
59
60 I could have changed to Sadakane's algorithm, but I find
61 it to be slower than <CODE>bzip2</CODE>'s existing algorithm for
62 most inputs, and the randomisation mechanism protects
63 adequately against bad cases. I didn't think it was
64 a good tradeoff to make. Partly this is due to the fact
65 that I was not flooded with email complaints about
66 <CODE>bzip2-0.1</CODE>'s performance on repetitive data, so
67 perhaps it isn't a problem for real inputs.
68
69 Probably the best long-term solution,
70 and the one I have incorporated into 0.9.5 and above,
71 is to use the existing sorting
72 algorithm initially, and fall back to a O(N (log N)^2)
73 algorithm if the standard algorithm gets into difficulties.
74 <LI>The compressed file format was never designed to be
75
76 handled by a library, and I have had to jump though
77 some hoops to produce an efficient implementation of
78 decompression. It's a bit hairy. Try passing
79 <CODE>decompress.c</CODE> through the C preprocessor
80 and you'll see what I mean. Much of this complexity
81 could have been avoided if the compressed size of
82 each block of data was recorded in the data stream.
83 <LI>An Adler-32 checksum, rather than a CRC32 checksum,
84
85 would be faster to compute.
86 </UL>
87
88 <P>
89 It would be fair to say that the <CODE>bzip2</CODE> format was frozen
90 before I properly and fully understood the performance
91 consequences of doing so.
92
93 </P>
94 <P>
95 Improvements which I was able to incorporate into
96 0.9.0, despite using the same file format, are:
97
98 <UL>
99 <LI>Single array implementation of the inverse BWT. This
100
101 significantly speeds up decompression, presumably
102 because it reduces the number of cache misses.
103 <LI>Faster inverse MTF transform for large MTF values. The
104
105 new implementation is based on the notion of sliding blocks
106 of values.
107 <LI><CODE>bzip2-0.9.0</CODE> now reads and writes files with <CODE>fread</CODE>
108
109 and <CODE>fwrite</CODE>; version 0.1 used <CODE>putc</CODE> and <CODE>getc</CODE>.
110 Duh! Well, you live and learn.
111
112 </UL>
113
114 <P>
115 Further ahead, it would be nice
116 to be able to do random access into files. This will
117 require some careful design of compressed file formats.
118
119 </P>
120
121
122
123 <H2><A NAME="SEC45" HREF="manual_toc.html#TOC45">Portability issues</A></H2>
124 <P>
125 After some consideration, I have decided not to use
126 GNU <CODE>autoconf</CODE> to configure 0.9.5 or 1.0.
127
128 </P>
129 <P>
130 <CODE>autoconf</CODE>, admirable and wonderful though it is,
131 mainly assists with portability problems between Unix-like
132 platforms. But <CODE>bzip2</CODE> doesn't have much in the way
133 of portability problems on Unix; most of the difficulties appear
134 when porting to the Mac, or to Microsoft's operating systems.
135 <CODE>autoconf</CODE> doesn't help in those cases, and brings in a
136 whole load of new complexity.
137
138 </P>
139 <P>
140 Most people should be able to compile the library and program
141 under Unix straight out-of-the-box, so to speak, especially
142 if you have a version of GNU C available.
143
144 </P>
145 <P>
146 There are a couple of <CODE>__inline__</CODE> directives in the code. GNU C
147 (<CODE>gcc</CODE>) should be able to handle them. If you're not using
148 GNU C, your C compiler shouldn't see them at all.
149 If your compiler does, for some reason, see them and doesn't
150 like them, just <CODE>#define</CODE> <CODE>__inline__</CODE> to be <CODE>/* */</CODE>. One
151 easy way to do this is to compile with the flag <CODE>-D__inline__=</CODE>,
152 which should be understood by most Unix compilers.
153
154 </P>
155 <P>
156 If you still have difficulties, try compiling with the macro
157 <CODE>BZ_STRICT_ANSI</CODE> defined. This should enable you to build the
158 library in a strictly ANSI compliant environment. Building the program
159 itself like this is dangerous and not supported, since you remove
160 <CODE>bzip2</CODE>'s checks against compressing directories, symbolic links,
161 devices, and other not-really-a-file entities. This could cause
162 filesystem corruption!
163
164 </P>
165 <P>
166 One other thing: if you create a <CODE>bzip2</CODE> binary for public
167 distribution, please try and link it statically (<CODE>gcc -s</CODE>). This
168 avoids all sorts of library-version issues that others may encounter
169 later on.
170
171 </P>
172 <P>
173 If you build <CODE>bzip2</CODE> on Win32, you must set <CODE>BZ_UNIX</CODE> to 0 and
174 <CODE>BZ_LCCWIN32</CODE> to 1, in the file <CODE>bzip2.c</CODE>, before compiling.
175 Otherwise the resulting binary won't work correctly.
176
177 </P>
178
179
180
181 <H2><A NAME="SEC46" HREF="manual_toc.html#TOC46">Reporting bugs</A></H2>
182 <P>
183 I tried pretty hard to make sure <CODE>bzip2</CODE> is
184 bug free, both by design and by testing. Hopefully
185 you'll never need to read this section for real.
186
187 </P>
188 <P>
189 Nevertheless, if <CODE>bzip2</CODE> dies with a segmentation
190 fault, a bus error or an internal assertion failure, it
191 will ask you to email me a bug report. Experience with
192 version 0.1 shows that almost all these problems can
193 be traced to either compiler bugs or hardware problems.
194
195 <UL>
196 <LI>
197
198 Recompile the program with no optimisation, and see if it
199 works. And/or try a different compiler.
200 I heard all sorts of stories about various flavours
201 of GNU C (and other compilers) generating bad code for
202 <CODE>bzip2</CODE>, and I've run across two such examples myself.
203
204 2.7.X versions of GNU C are known to generate bad code from
205 time to time, at high optimisation levels.
206 If you get problems, try using the flags
207 <CODE>-O2</CODE> <CODE>-fomit-frame-pointer</CODE> <CODE>-fno-strength-reduce</CODE>.
208 You should specifically <EM>not</EM> use <CODE>-funroll-loops</CODE>.
209
210 You may notice that the Makefile runs six tests as part of
211 the build process. If the program passes all of these, it's
212 a pretty good (but not 100%) indication that the compiler has
213 done its job correctly.
214 <LI>
215
216 If <CODE>bzip2</CODE> crashes randomly, and the crashes are not
217 repeatable, you may have a flaky memory subsystem. <CODE>bzip2</CODE>
218 really hammers your memory hierarchy, and if it's a bit marginal,
219 you may get these problems. Ditto if your disk or I/O subsystem
220 is slowly failing. Yup, this really does happen.
221
222 Try using a different machine of the same type, and see if
223 you can repeat the problem.
224 <LI>This isn't really a bug, but ... If <CODE>bzip2</CODE> tells
225
226 you your file is corrupted on decompression, and you
227 obtained the file via FTP, there is a possibility that you
228 forgot to tell FTP to do a binary mode transfer. That absolutely
229 will cause the file to be non-decompressible. You'll have to transfer
230 it again.
231 </UL>
232
233 <P>
234 If you've incorporated <CODE>libbzip2</CODE> into your own program
235 and are getting problems, please, please, please, check that the
236 parameters you are passing in calls to the library, are
237 correct, and in accordance with what the documentation says
238 is allowable. I have tried to make the library robust against
239 such problems, but I'm sure I haven't succeeded.
240
241 </P>
242 <P>
243 Finally, if the above comments don't help, you'll have to send
244 me a bug report. Now, it's just amazing how many people will
245 send me a bug report saying something like
246
247 <PRE>
248 bzip2 crashed with segmentation fault on my machine
249 </PRE>
250
251 <P>
252 and absolutely nothing else. Needless to say, a such a report
253 is <EM>totally, utterly, completely and comprehensively 100% useless;
254 a waste of your time, my time, and net bandwidth</EM>.
255 With no details at all, there's no way I can possibly begin
256 to figure out what the problem is.
257
258 </P>
259 <P>
260 The rules of the game are: facts, facts, facts. Don't omit
261 them because "oh, they won't be relevant". At the bare
262 minimum:
263
264 <PRE>
265 Machine type. Operating system version.
266 Exact version of <CODE>bzip2</CODE> (do <CODE>bzip2 -V</CODE>).
267 Exact version of the compiler used.
268 Flags passed to the compiler.
269 </PRE>
270
271 <P>
272 However, the most important single thing that will help me is
273 the file that you were trying to compress or decompress at the
274 time the problem happened. Without that, my ability to do anything
275 more than speculate about the cause, is limited.
276
277 </P>
278 <P>
279 Please remember that I connect to the Internet with a modem, so
280 you should contact me before mailing me huge files.
281
282 </P>
283
284
285
286 <H2><A NAME="SEC47" HREF="manual_toc.html#TOC47">Did you get the right package?</A></H2>
287
288 <P>
289 <CODE>bzip2</CODE> is a resource hog. It soaks up large amounts of CPU cycles
290 and memory. Also, it gives very large latencies. In the worst case, you
291 can feed many megabytes of uncompressed data into the library before
292 getting any compressed output, so this probably rules out applications
293 requiring interactive behaviour.
294
295 </P>
296 <P>
297 These aren't faults of my implementation, I hope, but more
298 an intrinsic property of the Burrows-Wheeler transform (unfortunately).
299 Maybe this isn't what you want.
300
301 </P>
302 <P>
303 If you want a compressor and/or library which is faster, uses less
304 memory but gets pretty good compression, and has minimal latency,
305 consider Jean-loup
306 Gailly's and Mark Adler's work, <CODE>zlib-1.1.2</CODE> and
307 <CODE>gzip-1.2.4</CODE>. Look for them at
308
309 </P>
310 <P>
311 <CODE>http://www.cdrom.com/pub/infozip/zlib</CODE> and
312 <CODE>http://www.gzip.org</CODE> respectively.
313
314 </P>
315 <P>
316 For something faster and lighter still, you might try Markus F X J
317 Oberhumer's <CODE>LZO</CODE> real-time compression/decompression library, at
318 <BR> <CODE>http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html</CODE>.
319
320 </P>
321 <P>
322 If you want to use the <CODE>bzip2</CODE> algorithms to compress small blocks
323 of data, 64k bytes or smaller, for example on an on-the-fly disk
324 compressor, you'd be well advised not to use this library. Instead,
325 I've made a special library tuned for that kind of use. It's part of
326 <CODE>e2compr-0.40</CODE>, an on-the-fly disk compressor for the Linux
327 <CODE>ext2</CODE> filesystem. Look at
328 <CODE>http://www.netspace.net.au/~reiter/e2compr</CODE>.
329
330 </P>
331
332
333
334 <H2><A NAME="SEC48" HREF="manual_toc.html#TOC48">Testing</A></H2>
335
336 <P>
337 A record of the tests I've done.
338
339 </P>
340 <P>
341 First, some data sets:
342
343 <UL>
344 <LI>B: a directory containing 6001 files, one for every length in the
345
346 range 0 to 6000 bytes. The files contain random lowercase
347 letters. 18.7 megabytes.
348 <LI>H: my home directory tree. Documents, source code, mail files,
349
350 compressed data. H contains B, and also a directory of
351 files designed as boundary cases for the sorting; mostly very
352 repetitive, nasty files. 565 megabytes.
353 <LI>A: directory tree holding various applications built from source:
354
355 <CODE>egcs</CODE>, <CODE>gcc-2.8.1</CODE>, KDE, GTK, Octave, etc.
356 2200 megabytes.
357 </UL>
358
359 <P>
360 The tests conducted are as follows. Each test means compressing
361 (a copy of) each file in the data set, decompressing it and
362 comparing it against the original.
363
364 </P>
365 <P>
366 First, a bunch of tests with block sizes and internal buffer
367 sizes set very small,
368 to detect any problems with the
369 blocking and buffering mechanisms.
370 This required modifying the source code so as to try to
371 break it.
372
373 <OL>
374 <LI>Data set H, with
375
376 buffer size of 1 byte, and block size of 23 bytes.
377 <LI>Data set B, buffer sizes 1 byte, block size 1 byte.
378
379 <LI>As (2) but small-mode decompression.
380
381 <LI>As (2) with block size 2 bytes.
382
383 <LI>As (2) with block size 3 bytes.
384
385 <LI>As (2) with block size 4 bytes.
386
387 <LI>As (2) with block size 5 bytes.
388
389 <LI>As (2) with block size 6 bytes and small-mode decompression.
390
391 <LI>H with buffer size of 1 byte, but normal block
392
393 size (up to 900000 bytes).
394 </OL>
395
396 <P>
397 Then some tests with unmodified source code.
398
399 <OL>
400 <LI>H, all settings normal.
401
402 <LI>As (1), with small-mode decompress.
403
404 <LI>H, compress with flag <CODE>-1</CODE>.
405
406 <LI>H, compress with flag <CODE>-s</CODE>, decompress with flag <CODE>-s</CODE>.
407
408 <LI>Forwards compatibility: H, <CODE>bzip2-0.1pl2</CODE> compressing,
409
410 <CODE>bzip2-0.9.5</CODE> decompressing, all settings normal.
411 <LI>Backwards compatibility: H, <CODE>bzip2-0.9.5</CODE> compressing,
412
413 <CODE>bzip2-0.1pl2</CODE> decompressing, all settings normal.
414 <LI>Bigger tests: A, all settings normal.
415
416 <LI>As (7), using the fallback (Sadakane-like) sorting algorithm.
417
418 <LI>As (8), compress with flag <CODE>-1</CODE>, decompress with flag
419
420 <CODE>-s</CODE>.
421 <LI>H, using the fallback sorting algorithm.
422
423 <LI>Forwards compatibility: A, <CODE>bzip2-0.1pl2</CODE> compressing,
424
425 <CODE>bzip2-0.9.5</CODE> decompressing, all settings normal.
426 <LI>Backwards compatibility: A, <CODE>bzip2-0.9.5</CODE> compressing,
427
428 <CODE>bzip2-0.1pl2</CODE> decompressing, all settings normal.
429 <LI>Misc test: about 400 megabytes of <CODE>.tar</CODE> files with
430
431 <CODE>bzip2</CODE> compiled with Checker (a memory access error
432 detector, like Purify).
433 <LI>Misc tests to make sure it builds and runs ok on non-Linux/x86
434
435 platforms.
436 </OL>
437
438 <P>
439 These tests were conducted on a 225 MHz IDT WinChip machine, running
440 Linux 2.0.36. They represent nearly a week of continuous computation.
441 All tests completed successfully.
442
443 </P>
444
445
446
447 <H2><A NAME="SEC49" HREF="manual_toc.html#TOC49">Further reading</A></H2>
448 <P>
449 <CODE>bzip2</CODE> is not research work, in the sense that it doesn't present
450 any new ideas. Rather, it's an engineering exercise based on existing
451 ideas.
452
453 </P>
454 <P>
455 Four documents describe essentially all the ideas behind <CODE>bzip2</CODE>:
456
457 <PRE>
458 Michael Burrows and D. J. Wheeler:
459 "A block-sorting lossless data compression algorithm"
460 10th May 1994.
461 Digital SRC Research Report 124.
462 ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz
463 If you have trouble finding it, try searching at the
464 New Zealand Digital Library, http://www.nzdl.org.
465
466 Daniel S. Hirschberg and Debra A. LeLewer
467 "Efficient Decoding of Prefix Codes"
468 Communications of the ACM, April 1990, Vol 33, Number 4.
469 You might be able to get an electronic copy of this
470 from the ACM Digital Library.
471
472 David J. Wheeler
473 Program bred3.c and accompanying document bred3.ps.
474 This contains the idea behind the multi-table Huffman
475 coding scheme.
476 ftp://ftp.cl.cam.ac.uk/users/djw3/
477
478 Jon L. Bentley and Robert Sedgewick
479 "Fast Algorithms for Sorting and Searching Strings"
480 Available from Sedgewick's web page,
481 www.cs.princeton.edu/~rs
482 </PRE>
483
484 <P>
485 The following paper gives valuable additional insights into the
486 algorithm, but is not immediately the basis of any code
487 used in bzip2.
488
489 <PRE>
490 Peter Fenwick:
491 Block Sorting Text Compression
492 Proceedings of the 19th Australasian Computer Science Conference,
493 Melbourne, Australia. Jan 31 - Feb 2, 1996.
494 ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps
495 </PRE>
496
497 <P>
498 Kunihiko Sadakane's sorting algorithm, mentioned above,
499 is available from:
500
501 <PRE>
502 http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz
503 </PRE>
504
505 <P>
506 The Manber-Myers suffix array construction
507 algorithm is described in a paper
508 available from:
509
510 <PRE>
511 http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps
512 </PRE>
513
514 <P>
515 Finally, the following paper documents some recent investigations
516 I made into the performance of sorting algorithms:
517
518 <PRE>
519 Julian Seward:
520 On the Performance of BWT Sorting Algorithms
521 Proceedings of the IEEE Data Compression Conference 2000
522 Snowbird, Utah. 28-30 March 2000.
523 </PRE>
524
525 <P><HR><P>
526 <p>Go to the <A HREF="manual_1.html">first</A>, <A HREF="manual_3.html">previous</A>, next, last section, <A HREF="manual_toc.html">table of contents</A>.
527 </BODY>
528 </HTML>