My Hadoop Gotchas

Posted: November 3rd, 2009 | Author: ryan | Filed under: Hadoop, Programming, Tips and Tricks | View Comments

Hadoop and I have had our ups and downs lately. I have been accumulating personal notes about certain issues I’ve had and their solutions. This is a record of the things that have bitten me.



If your streaming job has a lot of dependencies that you want to ship along in your jar you might find it useful to jar them as well and throw a -file my_deps.jar at hadoop. The gotcha is this: hadoop won’t automatically unjarify for you. To get around this simply wrap your mapper/reducer with an other script to unjar and then execute your original mapper/reducer.

#!/bin/bash

tar -xzvf my_deps.tgz .
python ./mapper.py
# You've probably already escalated the permissions to +x
# since it were required previously by Hadoop,
# but now we can be more explicit.
# Just don't forget to do the same on mapper_wrapper.sh.



I was having a lot of issue with HDFS. I couldn’t issue a bin/hadoop dfs -mkdir, -put, or -copyFromLocal without all kinds of connection issues or cryptic java errors.

After copious amounts of frustration I finally seemed to fix the issue, or at least find a work-around.

Warning: the following commands will destroy your data in HDFS.

> bin/stop-dfs.sh
> rm -rf hadoop-ryan/*
> bin/hadoop namenode -format
> bin/start-dfs.sh

Note that hadoop-ryan/* is my hadoop.tmp.dir.

<property>
<name>hadoop.tmp.dir</name>
<value>/Users/ryan/Dev/hadoop/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>


Leave a Reply

blog comments powered by Disqus