The Case for Babar: A Tool for Creating Hadoop Sequence Files

Posted: February 22nd, 2010 | Author: ryan | Filed under: Hadoop, Programming | View Comments

Mark Sands and myself have been tooling around lately with using Sequence Files in Hadoop jobs. In our case, a Sequence File contains many files which are to be processed as a single unit in a streaming job.

Trouble started creeping up during the creation of the sequence file when the input files were quite large. As you might guess, not only are they large, but we have a lot of them. We are using Hadoop after all. Stuart Sierra’s Tar-to-Seq utility was working quite nicely until this new input set, as the testing set was comprised of much smaller files.

On some machines, the heap size had to be increased to even pack a single-image input tar into a sequence file. Once that problem was solved it was still glaringly obvious that we needed to find a better way.

And Babar was born.

With a heaping spoonful of help from Michael Armbrust we were able to fight through the Java and the first version was pushed to GitHub. Babar is still actively being developed, along with a few nifty features I won’t mention here. If you want to know, ask Mark.

Babar takes a list of URLs and packs them into a sequence file. Then we can process each file in the existing streaming job just as before.

Tar-to-Seq is still useful, but with Babar harnesses MapReduce to grab each file (and you get automatic retries and the like for free). It also uses SequenceFileOutputFormat natively, so you don’t have to worry about the details of writing a sequence file yourself, since Hadoop does that for free too.

Grab it from GitHub, or from my fork, also on GitHub.



Leave a Reply

blog comments powered by Disqus