Monday, November 9, 2009

Compression Algorithms and their Performance

The Premise: GZIP vs BZIP2 vs XZ (the new LZMA)

I was asking myself the other day (as often as I do), what are the relative differences between the available FOSS compression mechanisms available today? Now, apart from having a recent class in language processing about the matter, I took it upon myself to give a few of them a try, and report it here.

The Data

I needed some data which would work to the compressor's advantage. To that end, I decided that I needed a substantially large corpus with which to play around. The word list which comes on OS X is large, but I wanted something massive. Then I thought, perhaps XML would be great. But where to find a gigantic XML file?

I took a look at xmark, and set it with a scaling factor of 5 (download the source for more info). This is a well-formed XML generator. Setting it as I did produced over 500 megabytes worth of XML.

The Compressors

For the test, I decided to skip compress and zip, and focus on the newer ones. These include gzip, mgzip (a mutli-threaded version), bzip2, and xz. I used the latest versions (at the time of this posting) of each, and set off to see how well they did.

The Results

First, the run time:

time gzip -9 BIG.xml       : real 0m48.609s
time mgzip -t 2 -9 BIG.xml : real 0m24.941s
time bzip2 BIG.xml         : real 1m31.825s
time xz -9 BIG.xml         : real 9m51.624s

Next, for the compression:

585540595 BIG.xml
128643043 BIG.xml.bz2
191123758 BIG.xml.gz
 28386648 BIG.xml.xz

Conclusion

Yes, that does say that a 559 MB file was compressed to 28 MB. If you're willing to wait for it, XZ is by far the best choice out there. On a related note, it uses a heck of a lot of memory in '-9' mode (just shy of 700 MB). But then again, the developers did warn you in the man page!

1 comments :

Sebastian Weigand said...

If you're curious as to the machine's specifics, check out this post.

Post a Comment