My Hadoop Gotchas
Posted: November 3rd, 2009 | Author: ryan | Filed under: Hadoop, Programming, Tips and Tricks | View CommentsHadoop and I have had our ups and downs lately. I have been accumulating personal notes about certain issues I’ve had and their solutions. This is a record of the things that have bitten me.
If your streaming job has a lot of dependencies that you want to ship along in your jar you might find it useful to jar them as well and throw a
-file my_deps.jar at hadoop. The gotcha is this: hadoop won’t automatically unjarify for you. To get around this simply wrap your mapper/reducer with an other script to unjar and then execute your original mapper/reducer.
#!/bin/bash tar -xzvf my_deps.tgz . python ./mapper.py # You've probably already escalated the permissions to +x # since it were required previously by Hadoop, # but now we can be more explicit. # Just don't forget to do the same on mapper_wrapper.sh.
I was having a lot of issue with HDFS. I couldn’t issue a
bin/hadoop dfs -mkdir, -put, or -copyFromLocal without all kinds of connection issues or cryptic java errors.
After copious amounts of frustration I finally seemed to fix the issue, or at least find a work-around.
Warning: the following commands will destroy your data in HDFS.
> bin/stop-dfs.sh > rm -rf hadoop-ryan/* > bin/hadoop namenode -format > bin/start-dfs.sh
Note that hadoop-ryan/* is my hadoop.tmp.dir.
<property>
<name>hadoop.tmp.dir</name>
<value>/Users/ryan/Dev/hadoop/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>




Leave a Reply