26
Jan 10

Now Writing For EdwardsvilleScence.com

Not too long ago I wrote my first article for Edwardsville Scene about one of my favorite bands.

Check it out: Japanese Bat Bomb is back, for now

Post to Twitter Tweet This Post


26
Jan 10

Conference Talks and Poster Sessions

I think I might have more draft posts than published ones. That’s kind of sad. I’m going to try and ease the pain, if only just slightly, right here right now.

I had started writing a post while I attended Big LAMP Camp in Nashville, TN back in early November 2009. The conference and camp were a lot of fun and well attended. I met a lot of great people. The list of speakers shows just goes to show how the traditional LAMP stack has evolved to include languages besides PHP (Ruby and RoR had a strong showing). Although I was the lonely Python developer. My talk was about how Galaxy Zoo is using Python and Hadoop to process large astronomical data sets.

More recently I traveled to Washington D.C. with other members of the Galaxy Zoo team to the 215th meeting of the American Astronomical Society. There I presented a poster on the same topic.

Post to Twitter Tweet This Post


03
Nov 09

My Hadoop Gotchas

Hadoop and I have had our ups and downs lately. I have been accumulating personal notes about certain issues I’ve had and their solutions. This is a record of the things that have bitten me.



If your streaming job has a lot of dependencies that you want to ship along in your jar you might find it useful to jar them as well and throw a -file my_deps.jar at hadoop. The gotcha is this: hadoop won’t automatically unjarify for you. To get around this simply wrap your mapper/reducer with an other script to unjar and then execute your original mapper/reducer.

#!/bin/bash

tar -xzvf my_deps.tgz .
python ./mapper.py
# You've probably already escalated the permissions to +x
# since it were required previously by Hadoop,
# but now we can be more explicit.
# Just don't forget to do the same on mapper_wrapper.sh.



I was having a lot of issue with HDFS. I couldn’t issue a bin/hadoop dfs -mkdir, -put, or -copyFromLocal without all kinds of connection issues or cryptic java errors.

After copious amounts of frustration I finally seemed to fix the issue, or at least find a work-around.

Warning: the following commands will destroy your data in HDFS.

> bin/stop-dfs.sh
> rm -rf hadoop-ryan/*
> bin/hadoop namenode -format
> bin/start-dfs.sh

Note that hadoop-ryan/* is my hadoop.tmp.dir.

<property>
<name>hadoop.tmp.dir</name>
<value>/Users/ryan/Dev/hadoop/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>

Post to Twitter Tweet This Post


01
Nov 09

Running Cloudera’s Hadoop Distribution on OS X

Lately, I have been spending a lot of time working with Hadoop (or should I say, trying to work with Hadoop). After bouncing between various versions, ranging from 0.18-0.20, and between the canonical Apache Hadoop to the patched and improved Cloudera Hadoop Distirbution I’m settling on Cloudera 0.18. Specifically, for the purposes of this article I will be using Cloudera version 0.18.3+76. You may find this, as well as additional releases at http://archive.cloudera.com/cdh/.

Let me start by pointing you toward the pages that helped get me to where I am.

You’ll notice that Apache’s Hadoop documentation doesn’t (at time of writing) cover 0.20. Hence, I starting working with 0.18. What is missing are some details about the configuration files. If you are not aware, 0.20 moved conf/hadoop-site.xml to conf/core-site.xml, conf/hdfs-site.xml and conf/mapred-site.xml. The default configuration is stored in src/core/core-default.xmlsrc/hdfs/hdfs-default.xml and src/mapred/mapred-default.xml. Although these files should provide all the clues I needed to properly configure Hadoop 0.20, I didn’t want to pour over them to try and weed out only the values needing to be changed. I turned to Google and found that many people were having the same issues as myself. I won’t explain the problems in detail now.

Back to getting things working with 0.18.

Some tutorials out there begin with creating a dedicated hadoop users. I did not do that. Rather, I simply stashed the tarball into my working directory and changed some configuration values appropriately. I shall call the directory containing the extracted hadoop-* directory HADOOP_DIR.

Here are the steps that should get you up and running (hopefully, and without pain).

  1. cd HADOOP_DIR
  2. Get you hostname by opening Terminal.app and running hostname
  3. open conf/hadoop-site.xml in your favorite editor and paste the following, replacing Ryans-MacBook.local with you own hostname and HADOOP_DIR with the directory you untared the tarball

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 
<!-- Put site-specific property overrides in this file. -->
 
<configuration>
 
<property>
  <name>hadoop.tmp.dir</name>
  <value>HADOOP_DIR/hadoop-${user.name}</value>
  <description>A base for other temporary directories.</description>
</property>
 
<property>
  <name>fs.default.name</name>
  <value>hdfs://Ryans-MacBook.local:9000</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>
 
<property>
  <name>mapred.job.tracker</name>
  <value>Ryans-MacBook.local:9001</value>
  <description>The host and port that the MapReduce job tracker runs
  at. If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>
 
<property>
<name>mapred.tasktracker.tasks.maximum</name>
<value>8</value>
<description>The maximum number of tasks that will be run simultaneously by a
a task tracker
</description>
</property>
 
<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>
 
</configuration>

The important changes I have made are replacing localhost with Ryans-Macbook.local. That’s about it. Now you should be able to point your browser at http://ryans-macbook.local:50070 to get at your NameNode and http://ryans-macbook.local:50030 for you JobTracker. I am not sure what lives at http://ryans-macbook.local:50075. Be sure to change your hostname. You can probably just use localhost in the browser (but not in the configuration files!?). It is not clear to me why I cannot use localhost in the configuration files, but it is okay in the browser. A quick edit of /etc/hosts didn’t seem to make any difference. Furthermore, it is still unclear why must use those ports over 9000 and 9001 for the NameNode and JobTracker, as those are the ports I specified in my site configuration.

Apparently I am missing something. If you know, please do enlighten me.

Post to Twitter Tweet This Post


18
Oct 09

Spam Everywhere

I’ve not been writing very much lately so I haven’t been maintaining comments either. I just marked over 50 comments as spam!

What is the best way to handle all the spam!?

Post to Twitter Tweet This Post


07
Aug 09

Visit to UC Berkeley

I am writing this post from my temporary desk in the RAD Lab at UC Berkeley in the Computer Science Division. I’ll be spending some time here learning about Hadoop, a framework for distributed computing. Luckily for me, my long-time friend, Michael Armbrust is a fourth-year CS PhD student here working on distributed applications. There are a lot of people here doing research in large-scale distributed and parallel computing, including Michael.

Since third grade, Michael and I would spend countless hours in front of our computers. We wrote little applications in QBasic then graduated to Visual Basic, all while a small business venture in the background. Our largest project was called ‘System Assistant’. The System Assistant was a plugin-based system tray bound application. It’s main function was to load DLLs of individual self-contained helper applications. I can’t even remember how many plugins we wrote for it, but it was pretty awesome.

After Michael picked me up from the North Berkeley BART station we went the the CS graduate student social hour which included wine and cheese, bread and even smoked salmon. After that we sat down for a bit to talk about setting some goals for my trip before going out on the town.

We were finally able sit down in front of a keyboard, like old times, yesterday after dragging ourselves out of bed for a 9am conference call. Eventually coding ensued, and about 14 hours later still sitting in the same conference room since that morning we had managed to encapsulate my existing Python code to run it using Hadoop. Around 11pm that night we had run a Hadoop job against 10,000 PDS files totaling ~30GB. Processing took just barely over 15 minutes. The bottleneck turned out to be loading the data into HDFS, which itself took about an hour.

I’ve got a really good foundation now and can spend next week tweaking and polishing everything.

Post to Twitter Tweet This Post


07
Aug 09

Visiting The Adler Planetarium

I met Lucy Fortsen, Vice President for Research at Adler, while we were both attending the 214th meeting of the American Astronomical Society in Pasadena, CA back in June. Some of the Galaxy Zoo team members went out for a few drinks one night and we got to talking about how wonderful Chicago is (a fact I was reminded of as soon as I took my first step outside of Union Station). Lucy and I got to talking about Galaxy Zoo and our roles in the collaboration. One thing led to another and before I knew it I was planning a trip to visit Adler and other people involved in the project.

I spent most of my day yesterday July 20 doing two things: traveling, or visiting The Adler Planetarium. The train ride from the St. Louis area is five hours long. I spent most of those five hours working on projects for my part in the Galaxy Zoo collaboration. Eventually, at 10:00 am + 20-something minutes of lateness I arrived at Union Station in the heart of downtown Chicago and was off the the planetarium.

My original plan for the week was to meet with Nancy Dribins on Monday, Doug Roberts sometime over Thursday or Friday and meet with Lucy when time allowed. As it turns out we all met that Monday, which also happened to be the 40th anniversary of the Apollo 11 landing. Needless to say, there was a lot of excitement in the air that day, and cake.

The Apollo 11 Lunar Module Cake

The Apollo 11 Lunar Module Cake

I also played with the new ‘Moon Wall’ that Adler recently debuted. The Moon Wall is basically at giant display for high-res lunar images. The display simulates flying around the Moon and has a basic joystick control. I was pretty neat.

The Moon Wall at The Adler Planetarium

The Moon Wall at The Adler Planetarium

I also managed to find some time to walk around and see a couple shows. It had been several years since my last visit to Adler, so I had a lot of fun and couldn’t have visited on a better day. Here are a few other pictures I took.

Chicago Skyline with 'Man Enters the Cosmos' in the foreground

Chicago Skyline with 'Man Enters the Cosmos' in the foreground

DSC01166

Post to Twitter Tweet This Post


17
Jul 09

LRO Spots Apollo Landing Sites

NASA just released images of several Apollo mission landing sites. The images, taken by the Lunar Reconnaissance Orbiter Camera (LROC), are the first set of imagery expected to be released.

Credit: NASA

Credit: NASA

For additional images check out the LROC Image Browser.

Post to Twitter Tweet This Post


17
Jul 09

A Python Generator to Generate Sublists of Increasing Length

Here’s a nifty generator to take a list, l, of length n and get each of n sublists, l[0:i], where i ranges from 1 to n+1.

1
2
3
4
5
6
7
8
9
10
def gen_sub_lists(l):
	"""Generate sublists of a list.
 
	Sublists of [0,1,2,...,n] are
	[0], [0,1], [0,1,2], . . . , [0,1,2,...,n].
	"""
 
	lengths = range(len(l))
	for end in lengths:
		yield l[0:end+1]

Post to Twitter Tweet This Post


16
Jul 09

Skype + Google = Free Calls

If you didn’t know, you can make calls using Skype to 800 numbers for free.

Also, if you didn’t know, Google offers a free directory service called Goog-411. By calling 1-800-GOOG-411, speaking the city and state and the business you can have Google connect you for free.

Do you see where I am going with this?

Just call 1-800-GOOG-411 using Skype, tell Google what you want and you’re connected.

I find this especially useful when I am working in my office in the basement at school because cell reception is terrible.

Post to Twitter Tweet This Post


Twitter links powered by Tweet This v1.6.1, a WordPress plugin for Twitter.