The Premise: GZIP vs BZIP2 vs XZ (the new LZMA)
I was asking myself the other day (as often as I do), what are the relative differences between the available FOSS compression mechanisms available today? Now, apart from having a recent class in language processing about the matter, I took it upon myself to give a few of them a try, and report it here.
The Data
I needed some data which would work to the compressor's advantage. To that end, I decided that I needed a substantially large corpus with which to play around. The word list which comes on OS X is large, but I wanted something massive. Then I thought, perhaps XML would be great. But where to find a gigantic XML file?
I took a look at xmark, and set it with a scaling factor of 5 (download the source for more info). This is a well-formed XML generator. Setting it as I did produced over 500 megabytes worth of XML.
The Compressors
For the test, I decided to skip compress and zip, and focus on the newer ones. These include gzip, mgzip (a mutli-threaded version), bzip2, and xz. I used the latest versions (at the time of this posting) of each, and set off to see how well they did.
The Results
First, the run time:
time gzip -9 BIG.xml : real 0m48.609s time mgzip -t 2 -9 BIG.xml : real 0m24.941s time bzip2 BIG.xml : real 1m31.825s time xz -9 BIG.xml : real 9m51.624s
Next, for the compression:
585540595 BIG.xml 128643043 BIG.xml.bz2 191123758 BIG.xml.gz 28386648 BIG.xml.xz
Conclusion
Yes, that does say that a 559 MB file was compressed to 28 MB. If you're willing to wait for it, XZ is by far the best choice out there. On a related note, it uses a heck of a lot of memory in '-9' mode (just shy of 700 MB). But then again, the developers did warn you in the man page!