Running Cloudera’s Hadoop Distribution on OS X

Posted: November 1st, 2009 | Author: ryan | Filed under: Hadoop, Programming, Tutorial | View Comments

Lately, I have been spending a lot of time working with Hadoop (or should I say, trying to work with Hadoop). After bouncing between various versions, ranging from 0.18-0.20, and between the canonical Apache Hadoop to the patched and improved Cloudera Hadoop Distirbution I’m settling on Cloudera 0.18. Specifically, for the purposes of this article I will be using Cloudera version 0.18.3+76. You may find this, as well as additional releases at http://archive.cloudera.com/cdh/.

Let me start by pointing you toward the pages that helped get me to where I am.

You’ll notice that Apache’s Hadoop documentation doesn’t (at time of writing) cover 0.20. Hence, I starting working with 0.18. What is missing are some details about the configuration files. If you are not aware, 0.20 moved conf/hadoop-site.xml to conf/core-site.xml, conf/hdfs-site.xml and conf/mapred-site.xml. The default configuration is stored in src/core/core-default.xmlsrc/hdfs/hdfs-default.xml and src/mapred/mapred-default.xml. Although these files should provide all the clues I needed to properly configure Hadoop 0.20, I didn’t want to pour over them to try and weed out only the values needing to be changed. I turned to Google and found that many people were having the same issues as myself. I won’t explain the problems in detail now.

Back to getting things working with 0.18.

Some tutorials out there begin with creating a dedicated hadoop users. I did not do that. Rather, I simply stashed the tarball into my working directory and changed some configuration values appropriately. I shall call the directory containing the extracted hadoop-* directory HADOOP_DIR.

Here are the steps that should get you up and running (hopefully, and without pain).

  1. cd HADOOP_DIR
  2. Get you hostname by opening Terminal.app and running hostname
  3. open conf/hadoop-site.xml in your favorite editor and paste the following, replacing Ryans-MacBook.local with you own hostname and HADOOP_DIR with the directory you untared the tarball

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 
<!-- Put site-specific property overrides in this file. -->
 
<configuration>
 
<property>
  <name>hadoop.tmp.dir</name>
  <value>HADOOP_DIR/hadoop-${user.name}</value>
  <description>A base for other temporary directories.</description>
</property>
 
<property>
  <name>fs.default.name</name>
  <value>hdfs://Ryans-MacBook.local:9000</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>
 
<property>
  <name>mapred.job.tracker</name>
  <value>Ryans-MacBook.local:9001</value>
  <description>The host and port that the MapReduce job tracker runs
  at. If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>
 
<property>
<name>mapred.tasktracker.tasks.maximum</name>
<value>8</value>
<description>The maximum number of tasks that will be run simultaneously by a
a task tracker
</description>
</property>
 
<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>
 
</configuration>

The important changes I have made are replacing localhost with Ryans-Macbook.local. That’s about it. Now you should be able to point your browser at http://ryans-macbook.local:50070 to get at your NameNode and http://ryans-macbook.local:50030 for you JobTracker. I am not sure what lives at http://ryans-macbook.local:50075. Be sure to change your hostname. You can probably just use localhost in the browser (but not in the configuration files!?). It is not clear to me why I cannot use localhost in the configuration files, but it is okay in the browser. A quick edit of /etc/hosts didn’t seem to make any difference. Furthermore, it is still unclear why must use those ports over 9000 and 9001 for the NameNode and JobTracker, as those are the ports I specified in my site configuration.

Apparently I am missing something. If you know, please do enlighten me.


Set Up Email Delivery for Google Apps using Webhostingpad and cPanel

Posted: June 26th, 2009 | Author: ryan | Filed under: Tutorial | View Comments

Jump to the solution.

I’ve been using the free version of Google Apps for a few domains about two years now. One of those domains has been using Google Page Creator to host a small website and Gmail for the past year and a half. Google isn’t offering new sign-ups for Page Creator, in favor of Sites. As far as I can tell this means that all new Apps accounts will be using Sites. I’ve not been able to use Sites as a simple web host like Page Creator. I’m not sure, but I don’t think it’s designed to work like that. Being tired of maintaining my own servers and power bills, this means paying for a real hosting provider.

Recently I purchased the ryanbalfanz.net domain name and began searching for a hosting provider. I didn’t want to pay too much and wasn’t looking for anything fancy. I found my current host after many hours of shopping around. Webhostingpad offers the features I need at a steal (if pre-paid for three years). On top of an already great deal I had a $25 promotional code. My bill for 3 years was under $50 for unlimited space and bandwidth! After about two weeks of service, I am pretty happy. They do offer a 30 day money back guarantee but I don’t think I’ll be giving them a call to claim a refund. The call, by the way, would be a local call for me (they are based in Rolling Meadows just outside of Chicago).

Webhostingpad uses cPanel, a control panel used to administer your server. It’s not difficult to use if you’ve maintained a server before (e.g. web, mail, ftp, etc.) and are familiar with things like DNS, Apache, FTP, etc. I’ve got the website up and running just fine. What I forgot to do was update my MX records. I wonder how much email I never got because it was returned to sender?

Google Apps Admin Help does give instructions for doing this with various hosts. However, it doesn’t give specific instructions for Webhostingpad. The key kind to notice in cPanel -> Mail -> MX Entries is how priorities are handled (0-10, 0 is highest priority). Google’s general MX record instructions says to use the following:

    Priority Mail Server
    1 ASPMX.L.GOOGLE.COM.
    5 ALT1.ASPMX.L.GOOGLE.COM.
    5 ALT2.ASPMX.L.GOOGLE.COM.
    10 ASPMX2.GOOGLEMAIL.COM.
    10 ASPMX3.GOOGLEMAIL.COM.

which doesn’t quite work because of the duplicate priorities. It just changes an existing record’s mail server of the same priority. Now, I already know that for GoDaddy the priorites are: 10, 20, 30, 40, 50. Since cPanel expects an integer in (0,10) I used: 0, 1, 2, 3, 4 as the priorities as in:

    Priority Mail Server
    0 ASPMX.L.GOOGLE.COM.
    1 ALT1.ASPMX.L.GOOGLE.COM.
    2 ALT2.ASPMX.L.GOOGLE.COM.
    3 ASPMX2.GOOGLEMAIL.COM.
    4 ASPMX3.GOOGLEMAIL.COM.

That’s it, pretty simple. Don’t forget to delete any existing records, especially the domain default which claims the highest priority (0).