reactos/drivers/lib/bzip2/manual_4.html

   1 <HTML>
   2 <HEAD>
   3 <!-- This HTML file has been created by texi2html 1.54
   4      from manual.texi on 23 March 2000 -->
   5
   6 <TITLE>bzip2 and libbzip2 - Miscellanea</TITLE>
   7 <link href="manual_3.html" rel=Previous>
   8 <link href="manual_toc.html" rel=ToC>
   9
  10 </HEAD>
  11 <BODY>
  12 <p>Go to the <A HREF="manual_1.html">first</A>, <A HREF="manual_3.html">previous</A>, next, last section, <A HREF="manual_toc.html">table of contents</A>.
  13 <P><HR><P>
  14
  15
  16 <H1><A NAME="SEC43" HREF="manual_toc.html#TOC43">Miscellanea</A></H1>
  17
  18 <P>
  19 These are just some random thoughts of mine.  Your mileage may
  20 vary.
  21
  22 </P>
  23
  24
  25 <H2><A NAME="SEC44" HREF="manual_toc.html#TOC44">Limitations of the compressed file format</A></H2>
  26 <P>
  27 <CODE>bzip2-1.0</CODE>, <CODE>0.9.5</CODE> and <CODE>0.9.0</CODE>
  28 use exactly the same file format as the previous
  29 version, <CODE>bzip2-0.1</CODE>.  This decision was made in the interests of
  30 stability.  Creating yet another incompatible compressed file format
  31 would create further confusion and disruption for users.
  32
  33 </P>
  34 <P>
  35 Nevertheless, this is not a painless decision.  Development
  36 work since the release of <CODE>bzip2-0.1</CODE> in August 1997
  37 has shown complexities in the file format which slow down
  38 decompression and, in retrospect, are unnecessary.  These are:
  39
  40 <UL>
  41 <LI>The run-length encoder, which is the first of the
  42
  43       compression transformations, is entirely irrelevant.
  44       The original purpose was to protect the sorting algorithm
  45       from the very worst case input: a string of repeated
  46       symbols.  But algorithm steps Q6a and Q6b in the original
  47       Burrows-Wheeler technical report (SRC-124) show how
  48       repeats can be handled without difficulty in block
  49       sorting.
  50 <LI>The randomisation mechanism doesn't really need to be
  51
  52       there.  Udi Manber and Gene Myers published a suffix
  53       array construction algorithm a few years back, which
  54       can be employed to sort any block, no matter how
  55       repetitive, in O(N log N) time.  Subsequent work by
  56       Kunihiko Sadakane has produced a derivative O(N (log N)^2)
  57       algorithm which usually outperforms the Manber-Myers
  58       algorithm.
  59
  60       I could have changed to Sadakane's algorithm, but I find
  61       it to be slower than <CODE>bzip2</CODE>'s existing algorithm for
  62       most inputs, and the randomisation mechanism protects
  63       adequately against bad cases.  I didn't think it was
  64       a good tradeoff to make.  Partly this is due to the fact
  65       that I was not flooded with email complaints about
  66       <CODE>bzip2-0.1</CODE>'s performance on repetitive data, so
  67       perhaps it isn't a problem for real inputs.
  68
  69       Probably the best long-term solution,
  70       and the one I have incorporated into 0.9.5 and above,
  71       is to use the existing sorting
  72       algorithm initially, and fall back to a O(N (log N)^2)
  73       algorithm if the standard algorithm gets into difficulties.
  74 <LI>The compressed file format was never designed to be
  75
  76       handled by a library, and I have had to jump though
  77       some hoops to produce an efficient implementation of
  78       decompression.  It's a bit hairy.  Try passing
  79       <CODE>decompress.c</CODE> through the C preprocessor
  80       and you'll see what I mean.  Much of this complexity
  81       could have been avoided if the compressed size of
  82       each block of data was recorded in the data stream.
  83 <LI>An Adler-32 checksum, rather than a CRC32 checksum,
  84
  85       would be faster to compute.
  86 </UL>
  87
  88 <P>
  89 It would be fair to say that the <CODE>bzip2</CODE> format was frozen
  90 before I properly and fully understood the performance
  91 consequences of doing so.
  92
  93 </P>
  94 <P>
  95 Improvements which I was able to incorporate into
  96 0.9.0, despite using the same file format, are:
  97
  98 <UL>
  99 <LI>Single array implementation of the inverse BWT.  This
 100
 101       significantly speeds up decompression, presumably
 102       because it reduces the number of cache misses.
 103 <LI>Faster inverse MTF transform for large MTF values.  The
 104
 105       new implementation is based on the notion of sliding blocks
 106       of values.
 107 <LI><CODE>bzip2-0.9.0</CODE> now reads and writes files with <CODE>fread</CODE>
 108
 109       and <CODE>fwrite</CODE>; version 0.1 used <CODE>putc</CODE> and <CODE>getc</CODE>.
 110       Duh!  Well, you live and learn.
 111
 112 </UL>
 113
 114 <P>
 115 Further ahead, it would be nice
 116 to be able to do random access into files.  This will
 117 require some careful design of compressed file formats.
 118
 119 </P>
 120
 121
 122
 123 <H2><A NAME="SEC45" HREF="manual_toc.html#TOC45">Portability issues</A></H2>
 124 <P>
 125 After some consideration, I have decided not to use
 126 GNU <CODE>autoconf</CODE> to configure 0.9.5 or 1.0.
 127
 128 </P>
 129 <P>
 130 <CODE>autoconf</CODE>, admirable and wonderful though it is,
 131 mainly assists with portability problems between Unix-like
 132 platforms.  But <CODE>bzip2</CODE> doesn't have much in the way
 133 of portability problems on Unix; most of the difficulties appear
 134 when porting to the Mac, or to Microsoft's operating systems.
 135 <CODE>autoconf</CODE> doesn't help in those cases, and brings in a
 136 whole load of new complexity.
 137
 138 </P>
 139 <P>
 140 Most people should be able to compile the library and program
 141 under Unix straight out-of-the-box, so to speak, especially
 142 if you have a version of GNU C available.
 143
 144 </P>
 145 <P>
 146 There are a couple of <CODE>__inline__</CODE> directives in the code.  GNU C
 147 (<CODE>gcc</CODE>) should be able to handle them.  If you're not using
 148 GNU C, your C compiler shouldn't see them at all.
 149 If your compiler does, for some reason, see them and doesn't
 150 like them, just <CODE>#define</CODE> <CODE>__inline__</CODE> to be <CODE>/* */</CODE>.  One
 151 easy way to do this is to compile with the flag <CODE>-D__inline__=</CODE>,
 152 which should be understood by most Unix compilers.
 153
 154 </P>
 155 <P>
 156 If you still have difficulties, try compiling with the macro
 157 <CODE>BZ_STRICT_ANSI</CODE> defined.  This should enable you to build the
 158 library in a strictly ANSI compliant environment.  Building the program
 159 itself like this is dangerous and not supported, since you remove
 160 <CODE>bzip2</CODE>'s checks against compressing directories, symbolic links,
 161 devices, and other not-really-a-file entities.  This could cause
 162 filesystem corruption!
 163
 164 </P>
 165 <P>
 166 One other thing: if you create a <CODE>bzip2</CODE> binary for public
 167 distribution, please try and link it statically (<CODE>gcc -s</CODE>).  This
 168 avoids all sorts of library-version issues that others may encounter
 169 later on.
 170
 171 </P>
 172 <P>
 173 If you build <CODE>bzip2</CODE> on Win32, you must set <CODE>BZ_UNIX</CODE> to 0 and
 174 <CODE>BZ_LCCWIN32</CODE> to 1, in the file <CODE>bzip2.c</CODE>, before compiling.
 175 Otherwise the resulting binary won't work correctly.
 176
 177 </P>
 178
 179
 180
 181 <H2><A NAME="SEC46" HREF="manual_toc.html#TOC46">Reporting bugs</A></H2>
 182 <P>
 183 I tried pretty hard to make sure <CODE>bzip2</CODE> is
 184 bug free, both by design and by testing.  Hopefully
 185 you'll never need to read this section for real.
 186
 187 </P>
 188 <P>
 189 Nevertheless, if <CODE>bzip2</CODE> dies with a segmentation
 190 fault, a bus error or an internal assertion failure, it
 191 will ask you to email me a bug report.  Experience with
 192 version 0.1 shows that almost all these problems can
 193 be traced to either compiler bugs or hardware problems.
 194
 195 <UL>
 196 <LI>
 197
 198 Recompile the program with no optimisation, and see if it
 199 works.  And/or try a different compiler.
 200 I heard all sorts of stories about various flavours
 201 of GNU C (and other compilers) generating bad code for
 202 <CODE>bzip2</CODE>, and I've run across two such examples myself.
 203
 204 2.7.X versions of GNU C are known to generate bad code from
 205 time to time, at high optimisation levels.
 206 If you get problems, try using the flags
 207 <CODE>-O2</CODE> <CODE>-fomit-frame-pointer</CODE> <CODE>-fno-strength-reduce</CODE>.
 208 You should specifically <EM>not</EM> use <CODE>-funroll-loops</CODE>.
 209
 210 You may notice that the Makefile runs six tests as part of
 211 the build process.  If the program passes all of these, it's
 212 a pretty good (but not 100%) indication that the compiler has
 213 done its job correctly.
 214 <LI>
 215
 216 If <CODE>bzip2</CODE> crashes randomly, and the crashes are not
 217 repeatable, you may have a flaky memory subsystem.  <CODE>bzip2</CODE>
 218 really hammers your memory hierarchy, and if it's a bit marginal,
 219 you may get these problems.  Ditto if your disk or I/O subsystem
 220 is slowly failing.  Yup, this really does happen.
 221
 222 Try using a different machine of the same type, and see if
 223 you can repeat the problem.
 224 <LI>This isn't really a bug, but ... If <CODE>bzip2</CODE> tells
 225
 226 you your file is corrupted on decompression, and you
 227 obtained the file via FTP, there is a possibility that you
 228 forgot to tell FTP to do a binary mode transfer.  That absolutely
 229 will cause the file to be non-decompressible.  You'll have to transfer
 230 it again.
 231 </UL>
 232
 233 <P>
 234 If you've incorporated <CODE>libbzip2</CODE> into your own program
 235 and are getting problems, please, please, please, check that the
 236 parameters you are passing in calls to the library, are
 237 correct, and in accordance with what the documentation says
 238 is allowable.  I have tried to make the library robust against
 239 such problems, but I'm sure I haven't succeeded.
 240
 241 </P>
 242 <P>
 243 Finally, if the above comments don't help, you'll have to send
 244 me a bug report.  Now, it's just amazing how many people will
 245 send me a bug report saying something like
 246
 247 <PRE>
 248    bzip2 crashed with segmentation fault on my machine
 249 </PRE>
 250
 251 <P>
 252 and absolutely nothing else.  Needless to say, a such a report
 253 is <EM>totally, utterly, completely and comprehensively 100% useless;
 254 a waste of your time, my time, and net bandwidth</EM>.
 255 With no details at all, there's no way I can possibly begin
 256 to figure out what the problem is.
 257
 258 </P>
 259 <P>
 260 The rules of the game are: facts, facts, facts.  Don't omit
 261 them because "oh, they won't be relevant".  At the bare
 262 minimum:
 263
 264 <PRE>
 265    Machine type.  Operating system version.
 266    Exact version of <CODE>bzip2</CODE> (do <CODE>bzip2 -V</CODE>).
 267    Exact version of the compiler used.
 268    Flags passed to the compiler.
 269 </PRE>
 270
 271 <P>
 272 However, the most important single thing that will help me is
 273 the file that you were trying to compress or decompress at the
 274 time the problem happened.  Without that, my ability to do anything
 275 more than speculate about the cause, is limited.
 276
 277 </P>
 278 <P>
 279 Please remember that I connect to the Internet with a modem, so
 280 you should contact me before mailing me huge files.
 281
 282 </P>
 283
 284
 285
 286 <H2><A NAME="SEC47" HREF="manual_toc.html#TOC47">Did you get the right package?</A></H2>
 287
 288 <P>
 289 <CODE>bzip2</CODE> is a resource hog.  It soaks up large amounts of CPU cycles
 290 and memory.  Also, it gives very large latencies.  In the worst case, you
 291 can feed many megabytes of uncompressed data into the library before
 292 getting any compressed output, so this probably rules out applications
 293 requiring interactive behaviour.
 294
 295 </P>
 296 <P>
 297 These aren't faults of my implementation, I hope, but more
 298 an intrinsic property of the Burrows-Wheeler transform (unfortunately).
 299 Maybe this isn't what you want.
 300
 301 </P>
 302 <P>
 303 If you want a compressor and/or library which is faster, uses less
 304 memory but gets pretty good compression, and has minimal latency,
 305 consider Jean-loup
 306 Gailly's and Mark Adler's work, <CODE>zlib-1.1.2</CODE> and
 307 <CODE>gzip-1.2.4</CODE>.  Look for them at
 308
 309 </P>
 310 <P>
 311 <CODE>http://www.cdrom.com/pub/infozip/zlib</CODE> and
 312 <CODE>http://www.gzip.org</CODE> respectively.
 313
 314 </P>
 315 <P>
 316 For something faster and lighter still, you might try Markus F X J
 317 Oberhumer's <CODE>LZO</CODE> real-time compression/decompression library, at
 318 <BR> <CODE>http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html</CODE>.
 319
 320 </P>
 321 <P>
 322 If you want to use the <CODE>bzip2</CODE> algorithms to compress small blocks
 323 of data, 64k bytes or smaller, for example on an on-the-fly disk
 324 compressor, you'd be well advised not to use this library.  Instead,
 325 I've made a special library tuned for that kind of use.  It's part of
 326 <CODE>e2compr-0.40</CODE>, an on-the-fly disk compressor for the Linux
 327 <CODE>ext2</CODE> filesystem.  Look at
 328 <CODE>http://www.netspace.net.au/~reiter/e2compr</CODE>.
 329
 330 </P>
 331
 332
 333
 334 <H2><A NAME="SEC48" HREF="manual_toc.html#TOC48">Testing</A></H2>
 335
 336 <P>
 337 A record of the tests I've done.
 338
 339 </P>
 340 <P>
 341 First, some data sets:
 342
 343 <UL>
 344 <LI>B: a directory containing 6001 files, one for every length in the
 345
 346       range 0 to 6000 bytes.  The files contain random lowercase
 347       letters.  18.7 megabytes.
 348 <LI>H: my home directory tree.  Documents, source code, mail files,
 349
 350       compressed data.  H contains B, and also a directory of
 351       files designed as boundary cases for the sorting; mostly very
 352       repetitive, nasty files.  565 megabytes.
 353 <LI>A: directory tree holding various applications built from source:
 354
 355       <CODE>egcs</CODE>, <CODE>gcc-2.8.1</CODE>, KDE, GTK, Octave, etc.
 356       2200 megabytes.
 357 </UL>
 358
 359 <P>
 360 The tests conducted are as follows.  Each test means compressing
 361 (a copy of) each file in the data set, decompressing it and
 362 comparing it against the original.
 363
 364 </P>
 365 <P>
 366 First, a bunch of tests with block sizes and internal buffer
 367 sizes set very small,
 368 to detect any problems with the
 369 blocking and buffering mechanisms.
 370 This required modifying the source code so as to try to
 371 break it.
 372
 373 <OL>
 374 <LI>Data set H, with
 375
 376       buffer size of 1 byte, and block size of 23 bytes.
 377 <LI>Data set B, buffer sizes 1 byte, block size 1 byte.
 378
 379 <LI>As (2) but small-mode decompression.
 380
 381 <LI>As (2) with block size 2 bytes.
 382
 383 <LI>As (2) with block size 3 bytes.
 384
 385 <LI>As (2) with block size 4 bytes.
 386
 387 <LI>As (2) with block size 5 bytes.
 388
 389 <LI>As (2) with block size 6 bytes and small-mode decompression.
 390
 391 <LI>H with buffer size of 1 byte, but normal block
 392
 393       size (up to 900000 bytes).
 394 </OL>
 395
 396 <P>
 397 Then some tests with unmodified source code.
 398
 399 <OL>
 400 <LI>H, all settings normal.
 401
 402 <LI>As (1), with small-mode decompress.
 403
 404 <LI>H, compress with flag <CODE>-1</CODE>.
 405
 406 <LI>H, compress with flag <CODE>-s</CODE>, decompress with flag <CODE>-s</CODE>.
 407
 408 <LI>Forwards compatibility: H, <CODE>bzip2-0.1pl2</CODE> compressing,
 409
 410       <CODE>bzip2-0.9.5</CODE> decompressing, all settings normal.
 411 <LI>Backwards compatibility:  H, <CODE>bzip2-0.9.5</CODE> compressing,
 412
 413       <CODE>bzip2-0.1pl2</CODE> decompressing, all settings normal.
 414 <LI>Bigger tests: A, all settings normal.
 415
 416 <LI>As (7), using the fallback (Sadakane-like) sorting algorithm.
 417
 418 <LI>As (8), compress with flag <CODE>-1</CODE>, decompress with flag
 419
 420       <CODE>-s</CODE>.
 421 <LI>H, using the fallback sorting algorithm.
 422
 423 <LI>Forwards compatibility: A, <CODE>bzip2-0.1pl2</CODE> compressing,
 424
 425       <CODE>bzip2-0.9.5</CODE> decompressing, all settings normal.
 426 <LI>Backwards compatibility:  A, <CODE>bzip2-0.9.5</CODE> compressing,
 427
 428       <CODE>bzip2-0.1pl2</CODE> decompressing, all settings normal.
 429 <LI>Misc test: about 400 megabytes of <CODE>.tar</CODE> files with
 430
 431       <CODE>bzip2</CODE> compiled with Checker (a memory access error
 432        detector, like Purify).
 433 <LI>Misc tests to make sure it builds and runs ok on non-Linux/x86
 434
 435       platforms.
 436 </OL>
 437
 438 <P>
 439 These tests were conducted on a 225 MHz IDT WinChip machine, running
 440 Linux 2.0.36.  They represent nearly a week of continuous computation.
 441 All tests completed successfully.
 442
 443 </P>
 444
 445
 446
 447 <H2><A NAME="SEC49" HREF="manual_toc.html#TOC49">Further reading</A></H2>
 448 <P>
 449 <CODE>bzip2</CODE> is not research work, in the sense that it doesn't present
 450 any new ideas.  Rather, it's an engineering exercise based on existing
 451 ideas.
 452
 453 </P>
 454 <P>
 455 Four documents describe essentially all the ideas behind <CODE>bzip2</CODE>:
 456
 457 <PRE>
 458 Michael Burrows and D. J. Wheeler:
 459   "A block-sorting lossless data compression algorithm"
 460    10th May 1994.
 461    Digital SRC Research Report 124.
 462    ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz
 463    If you have trouble finding it, try searching at the
 464    New Zealand Digital Library, http://www.nzdl.org.
 465
 466 Daniel S. Hirschberg and Debra A. LeLewer
 467   "Efficient Decoding of Prefix Codes"
 468    Communications of the ACM, April 1990, Vol 33, Number 4.
 469    You might be able to get an electronic copy of this
 470       from the ACM Digital Library.
 471
 472 David J. Wheeler
 473    Program bred3.c and accompanying document bred3.ps.
 474    This contains the idea behind the multi-table Huffman
 475    coding scheme.
 476    ftp://ftp.cl.cam.ac.uk/users/djw3/
 477
 478 Jon L. Bentley and Robert Sedgewick
 479   "Fast Algorithms for Sorting and Searching Strings"
 480    Available from Sedgewick's web page,
 481    www.cs.princeton.edu/~rs
 482 </PRE>
 483
 484 <P>
 485 The following paper gives valuable additional insights into the
 486 algorithm, but is not immediately the basis of any code
 487 used in bzip2.
 488
 489 <PRE>
 490 Peter Fenwick:
 491    Block Sorting Text Compression
 492    Proceedings of the 19th Australasian Computer Science Conference,
 493      Melbourne, Australia.  Jan 31 - Feb 2, 1996.
 494    ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps
 495 </PRE>
 496
 497 <P>
 498 Kunihiko Sadakane's sorting algorithm, mentioned above,
 499 is available from:
 500
 501 <PRE>
 502 http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz
 503 </PRE>
 504
 505 <P>
 506 The Manber-Myers suffix array construction
 507 algorithm is described in a paper
 508 available from:
 509
 510 <PRE>
 511 http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps
 512 </PRE>
 513
 514 <P>
 515 Finally, the following paper documents some recent investigations
 516 I made into the performance of sorting algorithms:
 517
 518 <PRE>
 519 Julian Seward:
 520    On the Performance of BWT Sorting Algorithms
 521    Proceedings of the IEEE Data Compression Conference 2000
 522      Snowbird, Utah.  28-30 March 2000.
 523 </PRE>
 524
 525 <P><HR><P>
 526 <p>Go to the <A HREF="manual_1.html">first</A>, <A HREF="manual_3.html">previous</A>, next, last section, <A HREF="manual_toc.html">table of contents</A>.
 527 </BODY>
 528 </HTML>