Running Cloudera’s Hadoop Distribution on OS X

Posted: November 1st, 2009 | Author: ryan | Filed under: Hadoop, Programming, Tutorial | View Comments

Lately, I have been spending a lot of time working with Hadoop (or should I say, trying to work with Hadoop). After bouncing between various versions, ranging from 0.18-0.20, and between the canonical Apache Hadoop to the patched and improved Cloudera Hadoop Distirbution I’m settling on Cloudera 0.18. Specifically, for the purposes of this article I will be using Cloudera version 0.18.3+76. You may find this, as well as additional releases at http://archive.cloudera.com/cdh/.

Let me start by pointing you toward the pages that helped get me to where I am.

You’ll notice that Apache’s Hadoop documentation doesn’t (at time of writing) cover 0.20. Hence, I starting working with 0.18. What is missing are some details about the configuration files. If you are not aware, 0.20 moved conf/hadoop-site.xml to conf/core-site.xml, conf/hdfs-site.xml and conf/mapred-site.xml. The default configuration is stored in src/core/core-default.xmlsrc/hdfs/hdfs-default.xml and src/mapred/mapred-default.xml. Although these files should provide all the clues I needed to properly configure Hadoop 0.20, I didn’t want to pour over them to try and weed out only the values needing to be changed. I turned to Google and found that many people were having the same issues as myself. I won’t explain the problems in detail now.

Back to getting things working with 0.18.

Some tutorials out there begin with creating a dedicated hadoop users. I did not do that. Rather, I simply stashed the tarball into my working directory and changed some configuration values appropriately. I shall call the directory containing the extracted hadoop-* directory HADOOP_DIR.

Here are the steps that should get you up and running (hopefully, and without pain).

  1. cd HADOOP_DIR
  2. Get you hostname by opening Terminal.app and running hostname
  3. open conf/hadoop-site.xml in your favorite editor and paste the following, replacing Ryans-MacBook.local with you own hostname and HADOOP_DIR with the directory you untared the tarball

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 
<!-- Put site-specific property overrides in this file. -->
 
<configuration>
 
<property>
  <name>hadoop.tmp.dir</name>
  <value>HADOOP_DIR/hadoop-${user.name}</value>
  <description>A base for other temporary directories.</description>
</property>
 
<property>
  <name>fs.default.name</name>
  <value>hdfs://Ryans-MacBook.local:9000</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>
 
<property>
  <name>mapred.job.tracker</name>
  <value>Ryans-MacBook.local:9001</value>
  <description>The host and port that the MapReduce job tracker runs
  at. If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>
 
<property>
<name>mapred.tasktracker.tasks.maximum</name>
<value>8</value>
<description>The maximum number of tasks that will be run simultaneously by a
a task tracker
</description>
</property>
 
<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>
 
</configuration>

The important changes I have made are replacing localhost with Ryans-Macbook.local. That’s about it. Now you should be able to point your browser at http://ryans-macbook.local:50070 to get at your NameNode and http://ryans-macbook.local:50030 for you JobTracker. I am not sure what lives at http://ryans-macbook.local:50075. Be sure to change your hostname. You can probably just use localhost in the browser (but not in the configuration files!?). It is not clear to me why I cannot use localhost in the configuration files, but it is okay in the browser. A quick edit of /etc/hosts didn’t seem to make any difference. Furthermore, it is still unclear why must use those ports over 9000 and 9001 for the NameNode and JobTracker, as those are the ports I specified in my site configuration.

Apparently I am missing something. If you know, please do enlighten me.



Leave a Reply

blog comments powered by Disqus