- Get IP to compile.
[reactos.git] / reactos / drivers / lib / bzip2 / bzip2.txt
1
2
3 NAME
4 bzip2, bunzip2 - a block-sorting file compressor, v1.0
5 bzcat - decompresses files to stdout
6 bzip2recover - recovers data from damaged bzip2 files
7
8
9 SYNOPSIS
10 bzip2 [ -cdfkqstvzVL123456789 ] [ filenames ... ]
11 bunzip2 [ -fkvsVL ] [ filenames ... ]
12 bzcat [ -s ] [ filenames ... ]
13 bzip2recover filename
14
15
16 DESCRIPTION
17 bzip2 compresses files using the Burrows-Wheeler block
18 sorting text compression algorithm, and Huffman coding.
19 Compression is generally considerably better than that
20 achieved by more conventional LZ77/LZ78-based compressors,
21 and approaches the performance of the PPM family of sta-
22 tistical compressors.
23
24 The command-line options are deliberately very similar to
25 those of GNU gzip, but they are not identical.
26
27 bzip2 expects a list of file names to accompany the com-
28 mand-line flags. Each file is replaced by a compressed
29 version of itself, with the name "original_name.bz2".
30 Each compressed file has the same modification date, per-
31 missions, and, when possible, ownership as the correspond-
32 ing original, so that these properties can be correctly
33 restored at decompression time. File name handling is
34 naive in the sense that there is no mechanism for preserv-
35 ing original file names, permissions, ownerships or dates
36 in filesystems which lack these concepts, or have serious
37 file name length restrictions, such as MS-DOS.
38
39 bzip2 and bunzip2 will by default not overwrite existing
40 files. If you want this to happen, specify the -f flag.
41
42 If no file names are specified, bzip2 compresses from
43 standard input to standard output. In this case, bzip2
44 will decline to write compressed output to a terminal, as
45 this would be entirely incomprehensible and therefore
46 pointless.
47
48 bunzip2 (or bzip2 -d) decompresses all specified files.
49 Files which were not created by bzip2 will be detected and
50 ignored, and a warning issued. bzip2 attempts to guess
51 the filename for the decompressed file from that of the
52 compressed file as follows:
53
54 filename.bz2 becomes filename
55 filename.bz becomes filename
56 filename.tbz2 becomes filename.tar
57 filename.tbz becomes filename.tar
58 anyothername becomes anyothername.out
59
60 If the file does not end in one of the recognised endings,
61 .bz2, .bz, .tbz2 or .tbz, bzip2 complains that it cannot
62 guess the name of the original file, and uses the original
63 name with .out appended.
64
65 As with compression, supplying no filenames causes decom-
66 pression from standard input to standard output.
67
68 bunzip2 will correctly decompress a file which is the con-
69 catenation of two or more compressed files. The result is
70 the concatenation of the corresponding uncompressed files.
71 Integrity testing (-t) of concatenated compressed files is
72 also supported.
73
74 You can also compress or decompress files to the standard
75 output by giving the -c flag. Multiple files may be com-
76 pressed and decompressed like this. The resulting outputs
77 are fed sequentially to stdout. Compression of multiple
78 files in this manner generates a stream containing multi-
79 ple compressed file representations. Such a stream can be
80 decompressed correctly only by bzip2 version 0.9.0 or
81 later. Earlier versions of bzip2 will stop after decom-
82 pressing the first file in the stream.
83
84 bzcat (or bzip2 -dc) decompresses all specified files to
85 the standard output.
86
87 bzip2 will read arguments from the environment variables
88 BZIP2 and BZIP, in that order, and will process them
89 before any arguments read from the command line. This
90 gives a convenient way to supply default arguments.
91
92 Compression is always performed, even if the compressed
93 file is slightly larger than the original. Files of less
94 than about one hundred bytes tend to get larger, since the
95 compression mechanism has a constant overhead in the
96 region of 50 bytes. Random data (including the output of
97 most file compressors) is coded at about 8.05 bits per
98 byte, giving an expansion of around 0.5%.
99
100 As a self-check for your protection, bzip2 uses 32-bit
101 CRCs to make sure that the decompressed version of a file
102 is identical to the original. This guards against corrup-
103 tion of the compressed data, and against undetected bugs
104 in bzip2 (hopefully very unlikely). The chances of data
105 corruption going undetected is microscopic, about one
106 chance in four billion for each file processed. Be aware,
107 though, that the check occurs upon decompression, so it
108 can only tell you that something is wrong. It can't help
109 you recover the original uncompressed data. You can use
110 bzip2recover to try to recover data from damaged files.
111
112 Return values: 0 for a normal exit, 1 for environmental
113 problems (file not found, invalid flags, I/O errors, &c),
114 2 to indicate a corrupt compressed file, 3 for an internal
115 consistency error (eg, bug) which caused bzip2 to panic.
116
117
118 OPTIONS
119 -c --stdout
120 Compress or decompress to standard output.
121
122 -d --decompress
123 Force decompression. bzip2, bunzip2 and bzcat are
124 really the same program, and the decision about
125 what actions to take is done on the basis of which
126 name is used. This flag overrides that mechanism,
127 and forces bzip2 to decompress.
128
129 -z --compress
130 The complement to -d: forces compression, regard-
131 less of the invokation name.
132
133 -t --test
134 Check integrity of the specified file(s), but don't
135 decompress them. This really performs a trial
136 decompression and throws away the result.
137
138 -f --force
139 Force overwrite of output files. Normally, bzip2
140 will not overwrite existing output files. Also
141 forces bzip2 to break hard links to files, which it
142 otherwise wouldn't do.
143
144 -k --keep
145 Keep (don't delete) input files during compression
146 or decompression.
147
148 -s --small
149 Reduce memory usage, for compression, decompression
150 and testing. Files are decompressed and tested
151 using a modified algorithm which only requires 2.5
152 bytes per block byte. This means any file can be
153 decompressed in 2300k of memory, albeit at about
154 half the normal speed.
155
156 During compression, -s selects a block size of
157 200k, which limits memory use to around the same
158 figure, at the expense of your compression ratio.
159 In short, if your machine is low on memory (8
160 megabytes or less), use -s for everything. See
161 MEMORY MANAGEMENT below.
162
163 -q --quiet
164 Suppress non-essential warning messages. Messages
165 pertaining to I/O errors and other critical events
166 will not be suppressed.
167
168 -v --verbose
169 Verbose mode -- show the compression ratio for each
170 file processed. Further -v's increase the ver-
171 bosity level, spewing out lots of information which
172 is primarily of interest for diagnostic purposes.
173
174 -L --license -V --version
175 Display the software version, license terms and
176 conditions.
177
178 -1 to -9
179 Set the block size to 100 k, 200 k .. 900 k when
180 compressing. Has no effect when decompressing.
181 See MEMORY MANAGEMENT below.
182
183 -- Treats all subsequent arguments as file names, even
184 if they start with a dash. This is so you can han-
185 dle files with names beginning with a dash, for
186 example: bzip2 -- -myfilename.
187
188 --repetitive-fast --repetitive-best
189 These flags are redundant in versions 0.9.5 and
190 above. They provided some coarse control over the
191 behaviour of the sorting algorithm in earlier ver-
192 sions, which was sometimes useful. 0.9.5 and above
193 have an improved algorithm which renders these
194 flags irrelevant.
195
196
197 MEMORY MANAGEMENT
198 bzip2 compresses large files in blocks. The block size
199 affects both the compression ratio achieved, and the
200 amount of memory needed for compression and decompression.
201 The flags -1 through -9 specify the block size to be
202 100,000 bytes through 900,000 bytes (the default) respec-
203 tively. At decompression time, the block size used for
204 compression is read from the header of the compressed
205 file, and bunzip2 then allocates itself just enough memory
206 to decompress the file. Since block sizes are stored in
207 compressed files, it follows that the flags -1 to -9 are
208 irrelevant to and so ignored during decompression.
209
210 Compression and decompression requirements, in bytes, can
211 be estimated as:
212
213 Compression: 400k + ( 8 x block size )
214
215 Decompression: 100k + ( 4 x block size ), or
216 100k + ( 2.5 x block size )
217
218 Larger block sizes give rapidly diminishing marginal
219 returns. Most of the compression comes from the first two
220 or three hundred k of block size, a fact worth bearing in
221 mind when using bzip2 on small machines. It is also
222 important to appreciate that the decompression memory
223 requirement is set at compression time by the choice of
224 block size.
225
226 For files compressed with the default 900k block size,
227 bunzip2 will require about 3700 kbytes to decompress. To
228 support decompression of any file on a 4 megabyte machine,
229 bunzip2 has an option to decompress using approximately
230 half this amount of memory, about 2300 kbytes. Decompres-
231 sion speed is also halved, so you should use this option
232 only where necessary. The relevant flag is -s.
233
234 In general, try and use the largest block size memory con-
235 straints allow, since that maximises the compression
236 achieved. Compression and decompression speed are virtu-
237 ally unaffected by block size.
238
239 Another significant point applies to files which fit in a
240 single block -- that means most files you'd encounter
241 using a large block size. The amount of real memory
242 touched is proportional to the size of the file, since the
243 file is smaller than a block. For example, compressing a
244 file 20,000 bytes long with the flag -9 will cause the
245 compressor to allocate around 7600k of memory, but only
246 touch 400k + 20000 * 8 = 560 kbytes of it. Similarly, the
247 decompressor will allocate 3700k but only touch 100k +
248 20000 * 4 = 180 kbytes.
249
250 Here is a table which summarises the maximum memory usage
251 for different block sizes. Also recorded is the total
252 compressed size for 14 files of the Calgary Text Compres-
253 sion Corpus totalling 3,141,622 bytes. This column gives
254 some feel for how compression varies with block size.
255 These figures tend to understate the advantage of larger
256 block sizes for larger files, since the Corpus is domi-
257 nated by smaller files.
258
259 Compress Decompress Decompress Corpus
260 Flag usage usage -s usage Size
261
262 -1 1200k 500k 350k 914704
263 -2 2000k 900k 600k 877703
264 -3 2800k 1300k 850k 860338
265 -4 3600k 1700k 1100k 846899
266 -5 4400k 2100k 1350k 845160
267 -6 5200k 2500k 1600k 838626
268 -7 6100k 2900k 1850k 834096
269 -8 6800k 3300k 2100k 828642
270 -9 7600k 3700k 2350k 828642
271
272
273 RECOVERING DATA FROM DAMAGED FILES
274 bzip2 compresses files in blocks, usually 900kbytes long.
275 Each block is handled independently. If a media or trans-
276 mission error causes a multi-block .bz2 file to become
277 damaged, it may be possible to recover data from the
278 undamaged blocks in the file.
279
280 The compressed representation of each block is delimited
281 by a 48-bit pattern, which makes it possible to find the
282 block boundaries with reasonable certainty. Each block
283 also carries its own 32-bit CRC, so damaged blocks can be
284 distinguished from undamaged ones.
285
286 bzip2recover is a simple program whose purpose is to
287 search for blocks in .bz2 files, and write each block out
288 into its own .bz2 file. You can then use bzip2 -t to test
289 the integrity of the resulting files, and decompress those
290 which are undamaged.
291
292 bzip2recover takes a single argument, the name of the dam-
293 aged file, and writes a number of files "rec0001file.bz2",
294 "rec0002file.bz2", etc, containing the extracted blocks.
295 The output filenames are designed so that the use of
296 wildcards in subsequent processing -- for example, "bzip2
297 -dc rec*file.bz2 > recovered_data" -- lists the files in
298 the correct order.
299
300 bzip2recover should be of most use dealing with large .bz2
301 files, as these will contain many blocks. It is clearly
302 futile to use it on damaged single-block files, since a
303 damaged block cannot be recovered. If you wish to min-
304 imise any potential data loss through media or transmis-
305 sion errors, you might consider compressing with a smaller
306 block size.
307
308
309 PERFORMANCE NOTES
310 The sorting phase of compression gathers together similar
311 strings in the file. Because of this, files containing
312 very long runs of repeated symbols, like "aabaabaabaab
313 ..." (repeated several hundred times) may compress more
314 slowly than normal. Versions 0.9.5 and above fare much
315 better than previous versions in this respect. The ratio
316 between worst-case and average-case compression time is in
317 the region of 10:1. For previous versions, this figure
318 was more like 100:1. You can use the -vvvv option to mon-
319 itor progress in great detail, if you want.
320
321 Decompression speed is unaffected by these phenomena.
322
323 bzip2 usually allocates several megabytes of memory to
324 operate in, and then charges all over it in a fairly ran-
325 dom fashion. This means that performance, both for com-
326 pressing and decompressing, is largely determined by the
327 speed at which your machine can service cache misses.
328 Because of this, small changes to the code to reduce the
329 miss rate have been observed to give disproportionately
330 large performance improvements. I imagine bzip2 will per-
331 form best on machines with very large caches.
332
333
334 CAVEATS
335 I/O error messages are not as helpful as they could be.
336 bzip2 tries hard to detect I/O errors and exit cleanly,
337 but the details of what the problem is sometimes seem
338 rather misleading.
339
340 This manual page pertains to version 1.0 of bzip2. Com-
341 pressed data created by this version is entirely forwards
342 and backwards compatible with the previous public
343 releases, versions 0.1pl2, 0.9.0 and 0.9.5, but with the
344 following exception: 0.9.0 and above can correctly decom-
345 press multiple concatenated compressed files. 0.1pl2 can-
346 not do this; it will stop after decompressing just the
347 first file in the stream.
348
349 bzip2recover uses 32-bit integers to represent bit posi-
350 tions in compressed files, so it cannot handle compressed
351 files more than 512 megabytes long. This could easily be
352 fixed.
353
354
355 AUTHOR
356 Julian Seward, jseward@acm.org.
357
358 http://sourceware.cygnus.com/bzip2
359 http://www.muraroa.demon.co.uk
360
361 The ideas embodied in bzip2 are due to (at least) the fol-
362 lowing people: Michael Burrows and David Wheeler (for the
363 block sorting transformation), David Wheeler (again, for
364 the Huffman coder), Peter Fenwick (for the structured cod-
365 ing model in the original bzip, and many refinements), and
366 Alistair Moffat, Radford Neal and Ian Witten (for the
367 arithmetic coder in the original bzip). I am much
368 indebted for their help, support and advice. See the man-
369 ual in the source distribution for pointers to sources of
370 documentation. Christian von Roques encouraged me to look
371 for faster sorting algorithms, so as to speed up compres-
372 sion. Bela Lubkin encouraged me to improve the worst-case
373 compression performance. Many people sent patches, helped
374 with portability problems, lent machines, gave advice and
375 were generally helpful.
376