Monday, November 9, 2009

Compression Algorithms and their Performance

1 comments

The Premise: GZIP vs BZIP2 vs XZ (the new LZMA)

I was asking myself the other day (as often as I do), what are the relative differences between the available FOSS compression mechanisms available today? Now, apart from having a recent class in language processing about the matter, I took it upon myself to give a few of them a try, and report it here.

The Data

I needed some data which would work to the compressor's advantage. To that end, I decided that I needed a substantially large corpus with which to play around. The word list which comes on OS X is large, but I wanted something massive. Then I thought, perhaps XML would be great. But where to find a gigantic XML file?

I took a look at xmark, and set it with a scaling factor of 5 (download the source for more info). This is an well-formed XML generator. Setting it as I did produced over 500 megabytes worth of XML.

The Compressors

For the test, I decided to skip compress and zip, and focus on the newer ones. These include gzip, mgzip (a mutli-threaded version), bzip2, and xz. I used the latest versions (at the time of this posting) of each, and set off to see how well they did.

The Results

First, the run time:

time gzip -9 BIG.xml       : real 0m48.609s
time mgzip -t 2 -9 BIG.xml : real 0m24.941s
time bzip2 BIG.xml         : real 1m31.825s
time xz -9 BIG.xml         : real 9m51.624s

Next, for the compression:

585540595 BIG.xml
128643043 BIG.xml.bz2
191123758 BIG.xml.gz
 28386648 BIG.xml.xz

Conclusion

Yes, that does say that a 559 MB file was compressed to 28 MB. If you're willing to wait for it, XZ is by far the best choice out there. On a related note, it uses a heck of a lot of memory in '-9' mode (just shy of 700 MB). But then again, the developers did warn you in the man page!

New Layout

0 comments

I've been at it once more, and I've updated the layout yet again.

This time though, I think I'll keep it this way for a while. I'm much happier with it.

Thursday, November 5, 2009

My Issue with Microsoft (Part 2)

0 comments

I've been meaning to update this post, and over the ages, I've come across this particular, well-formed essay-like rant which sums up most of my complaints with the Microsoft: Why I Hate Microsoft

I think it's a pretty good read, especially if you're like me.

Wednesday, November 4, 2009

New Googly Features

0 comments

Some minimal changes have been made to the site. While I don't like having so much stuff here that it gets cumbersome, I do like playing around with all the neat things which Google has to offer (most notably Profiles and Friend Connect)