The Case for Babar: A Tool for Creating Hadoop Sequence Files

Posted: February 22nd, 2010 | Author: ryan | Filed under: Hadoop, Programming | View Comments

Mark Sands and myself have been tooling around lately with using Sequence Files in Hadoop jobs. In our case, a Sequence File contains many files which are to be processed as a single unit in a streaming job.

Trouble started creeping up during the creation of the sequence file when the input files were quite large. As you might guess, not only are they large, but we have a lot of them. We are using Hadoop after all. Stuart Sierra’s Tar-to-Seq utility was working quite nicely until this new input set, as the testing set was comprised of much smaller files.

On some machines, the heap size had to be increased to even pack a single-image input tar into a sequence file. Once that problem was solved it was still glaringly obvious that we needed to find a better way.

And Babar was born.

With a heaping spoonful of help from Michael Armbrust we were able to fight through the Java and the first version was pushed to GitHub. Babar is still actively being developed, along with a few nifty features I won’t mention here. If you want to know, ask Mark.

Babar takes a list of URLs and packs them into a sequence file. Then we can process each file in the existing streaming job just as before.

Tar-to-Seq is still useful, but with Babar harnesses MapReduce to grab each file (and you get automatic retries and the like for free). It also uses SequenceFileOutputFormat natively, so you don’t have to worry about the details of writing a sequence file yourself, since Hadoop does that for free too.

Grab it from GitHub, or from my fork, also on GitHub.


Conference Talks and Poster Sessions

Posted: January 26th, 2010 | Author: ryan | Filed under: Hadoop, Python | Tags: , , , , | View Comments

I think I might have more draft posts than published ones. That’s kind of sad. I’m going to try and ease the pain, if only just slightly, right here right now.

I had started writing a post while I attended Big LAMP Camp in Nashville, TN back in early November 2009. The conference and camp were a lot of fun and well attended. I met a lot of great people. The list of speakers shows just goes to show how the traditional LAMP stack has evolved to include languages besides PHP (Ruby and RoR had a strong showing). Although I was the lonely Python developer. My talk was about how Galaxy Zoo is using Python and Hadoop to process large astronomical data sets.

More recently I traveled to Washington D.C. with other members of the Galaxy Zoo team to the 215th meeting of the American Astronomical Society. There I presented a poster on the same topic.


My Hadoop Gotchas

Posted: November 3rd, 2009 | Author: ryan | Filed under: Hadoop, Programming, Tips and Tricks | View Comments

Hadoop and I have had our ups and downs lately. I have been accumulating personal notes about certain issues I’ve had and their solutions. This is a record of the things that have bitten me.



If your streaming job has a lot of dependencies that you want to ship along in your jar you might find it useful to jar them as well and throw a -file my_deps.jar at hadoop. The gotcha is this: hadoop won’t automatically unjarify for you. To get around this simply wrap your mapper/reducer with an other script to unjar and then execute your original mapper/reducer.

#!/bin/bash

tar -xzvf my_deps.tgz .
python ./mapper.py
# You've probably already escalated the permissions to +x
# since it were required previously by Hadoop,
# but now we can be more explicit.
# Just don't forget to do the same on mapper_wrapper.sh.



I was having a lot of issue with HDFS. I couldn’t issue a bin/hadoop dfs -mkdir, -put, or -copyFromLocal without all kinds of connection issues or cryptic java errors.

After copious amounts of frustration I finally seemed to fix the issue, or at least find a work-around.

Warning: the following commands will destroy your data in HDFS.

> bin/stop-dfs.sh
> rm -rf hadoop-ryan/*
> bin/hadoop namenode -format
> bin/start-dfs.sh

Note that hadoop-ryan/* is my hadoop.tmp.dir.

<property>
<name>hadoop.tmp.dir</name>
<value>/Users/ryan/Dev/hadoop/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>

Running Cloudera’s Hadoop Distribution on OS X

Posted: November 1st, 2009 | Author: ryan | Filed under: Hadoop, Programming, Tutorial | View Comments

Lately, I have been spending a lot of time working with Hadoop (or should I say, trying to work with Hadoop). After bouncing between various versions, ranging from 0.18-0.20, and between the canonical Apache Hadoop to the patched and improved Cloudera Hadoop Distirbution I’m settling on Cloudera 0.18. Specifically, for the purposes of this article I will be using Cloudera version 0.18.3+76. You may find this, as well as additional releases at http://archive.cloudera.com/cdh/.

Let me start by pointing you toward the pages that helped get me to where I am.

You’ll notice that Apache’s Hadoop documentation doesn’t (at time of writing) cover 0.20. Hence, I starting working with 0.18. What is missing are some details about the configuration files. If you are not aware, 0.20 moved conf/hadoop-site.xml to conf/core-site.xml, conf/hdfs-site.xml and conf/mapred-site.xml. The default configuration is stored in src/core/core-default.xmlsrc/hdfs/hdfs-default.xml and src/mapred/mapred-default.xml. Although these files should provide all the clues I needed to properly configure Hadoop 0.20, I didn’t want to pour over them to try and weed out only the values needing to be changed. I turned to Google and found that many people were having the same issues as myself. I won’t explain the problems in detail now.

Back to getting things working with 0.18.

Some tutorials out there begin with creating a dedicated hadoop users. I did not do that. Rather, I simply stashed the tarball into my working directory and changed some configuration values appropriately. I shall call the directory containing the extracted hadoop-* directory HADOOP_DIR.

Here are the steps that should get you up and running (hopefully, and without pain).

  1. cd HADOOP_DIR
  2. Get you hostname by opening Terminal.app and running hostname
  3. open conf/hadoop-site.xml in your favorite editor and paste the following, replacing Ryans-MacBook.local with you own hostname and HADOOP_DIR with the directory you untared the tarball

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 
<!-- Put site-specific property overrides in this file. -->
 
<configuration>
 
<property>
  <name>hadoop.tmp.dir</name>
  <value>HADOOP_DIR/hadoop-${user.name}</value>
  <description>A base for other temporary directories.</description>
</property>
 
<property>
  <name>fs.default.name</name>
  <value>hdfs://Ryans-MacBook.local:9000</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>
 
<property>
  <name>mapred.job.tracker</name>
  <value>Ryans-MacBook.local:9001</value>
  <description>The host and port that the MapReduce job tracker runs
  at. If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>
 
<property>
<name>mapred.tasktracker.tasks.maximum</name>
<value>8</value>
<description>The maximum number of tasks that will be run simultaneously by a
a task tracker
</description>
</property>
 
<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>
 
</configuration>

The important changes I have made are replacing localhost with Ryans-Macbook.local. That’s about it. Now you should be able to point your browser at http://ryans-macbook.local:50070 to get at your NameNode and http://ryans-macbook.local:50030 for you JobTracker. I am not sure what lives at http://ryans-macbook.local:50075. Be sure to change your hostname. You can probably just use localhost in the browser (but not in the configuration files!?). It is not clear to me why I cannot use localhost in the configuration files, but it is okay in the browser. A quick edit of /etc/hosts didn’t seem to make any difference. Furthermore, it is still unclear why must use those ports over 9000 and 9001 for the NameNode and JobTracker, as those are the ports I specified in my site configuration.

Apparently I am missing something. If you know, please do enlighten me.