Python Gotcha, Nested Lambda to the Rescue

*Note: This has been sitting sitting around as a draft for quite awhile. I’m finally posting it because it came up recently; luckily this time I was prepared.

This was brought to my attention by Jason Yan of Disqus.

Consider the following code fragment. What’s is it’s output?

num_funcs = []
for i in xrange(0, 101):
	num_funcs.append(lambda x: x + i)
print num_funcs[10](20)

If you guessed 30, look again and consider the value of i during the print statement. It’s value will be 100.

What is num_funcs[10]? Clearly it is a lambda function. More importantly, it is a lambda function which has not yet been evaluated. Now consider again, the value i. What’s the the output? Hopefully you see that it is not 30, but rather 120. If you’re still unsure why that is true, let’s code the loop in such a way which will produce the expected output. Namely we would like num_funcs[n](20) to evaluate to n+20.

The trick here, is to force i to be the iteration value at the time of the append, rather than letting it persist as a variable until the print statement. To do this we simply close over the original anonymous function with another, thus establishing deeper scope and hence forcing i to be evaluated at the time of the append, like so.

num_funcs = []
for i in xrange(0, 101):
	func = (lambda j: (lambda x: x + i))(i)
	num_funcs.append(func)
print num_funcs[10](20)

Note that I’ve the argument to the outer lambda is called j, but we are passing i.

This is a little tricky, because it is so subtle. Don’t get burned.

The Case for Babar: A Tool for Creating Hadoop Sequence Files

Mark Sands and myself have been tooling around lately with using Sequence Files in Hadoop jobs. In our case, a Sequence File contains many files which are to be processed as a single unit in a streaming job.

Trouble started creeping up during the creation of the sequence file when the input files were quite large. As you might guess, not only are they large, but we have a lot of them. We are using Hadoop after all. Stuart Sierra’s Tar-to-Seq utility was working quite nicely until this new input set, as the testing set was comprised of much smaller files.

On some machines, the heap size had to be increased to even pack a single-image input tar into a sequence file. Once that problem was solved it was still glaringly obvious that we needed to find a better way.

And Babar was born.

With a heaping spoonful of help from Michael Armbrust we were able to fight through the Java and the first version was pushed to GitHub. Babar is still actively being developed, along with a few nifty features I won’t mention here. If you want to know, ask Mark.

Babar takes a list of URLs and packs them into a sequence file. Then we can process each file in the existing streaming job just as before.

Tar-to-Seq is still useful, but with Babar harnesses MapReduce to grab each file (and you get automatic retries and the like for free). It also uses SequenceFileOutputFormat natively, so you don’t have to worry about the details of writing a sequence file yourself, since Hadoop does that for free too.

Grab it from GitHub, or from my fork, also on GitHub.

Starting out with Django

I had started reading The Django Book a while back but couldn’t get into it. I had been around a lot of Rails folks recently, mostly The Collective, and succumb to geek pressure to choose Rails over Django. Being a Python guy myself, it was a hard sell. I never did anything especially cool with Rails until recently (more on that to come sometime)

Less than a week ago I bought The Definitive Guide to Django: Web Development Done Right. I prefer to read a hardcopy; for me hardcopy > HTML > PDF (in most cases).

I’m about halfway in and have noticed a handful of issues. Apparantly, I’m not the only one. Notkeepingitreal.com has covered the issues in the first few chapters.

Conference Talks and Poster Sessions

I think I might have more draft posts than published ones. That’s kind of sad. I’m going to try and ease the pain, if only just slightly, right here right now.

I had started writing a post while I attended Big LAMP Camp in Nashville, TN back in early November 2009. The conference and camp were a lot of fun and well attended. I met a lot of great people. The list of speakers shows just goes to show how the traditional LAMP stack has evolved to include languages besides PHP (Ruby and RoR had a strong showing). Although I was the lonely Python developer. My talk was about how Galaxy Zoo is using Python and Hadoop to process large astronomical data sets.

More recently I traveled to Washington D.C. with other members of the Galaxy Zoo team to the 215th meeting of the American Astronomical Society. There I presented a poster on the same topic.

My Hadoop Gotchas

Hadoop and I have had our ups and downs lately. I have been accumulating personal notes about certain issues I’ve had and their solutions. This is a record of the things that have bitten me.



If your streaming job has a lot of dependencies that you want to ship along in your jar you might find it useful to jar them as well and throw a -file my_deps.jar at hadoop. The gotcha is this: hadoop won’t automatically unjarify for you. To get around this simply wrap your mapper/reducer with an other script to unjar and then execute your original mapper/reducer.

#!/bin/bash

tar -xzvf my_deps.tgz .
python ./mapper.py
# You've probably already escalated the permissions to +x
# since it were required previously by Hadoop,
# but now we can be more explicit.
# Just don't forget to do the same on mapper_wrapper.sh.



I was having a lot of issue with HDFS. I couldn’t issue a bin/hadoop dfs -mkdir, -put, or -copyFromLocal without all kinds of connection issues or cryptic java errors.

After copious amounts of frustration I finally seemed to fix the issue, or at least find a work-around.

Warning: the following commands will destroy your data in HDFS.

> bin/stop-dfs.sh
> rm -rf hadoop-ryan/*
> bin/hadoop namenode -format
> bin/start-dfs.sh

Note that hadoop-ryan/* is my hadoop.tmp.dir.

<property>
<name>hadoop.tmp.dir</name>
<value>/Users/ryan/Dev/hadoop/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>

Running Cloudera’s Hadoop Distribution on OS X

Lately, I have been spending a lot of time working with Hadoop (or should I say, trying to work with Hadoop). After bouncing between various versions, ranging from 0.18-0.20, and between the canonical Apache Hadoop to the patched and improved Cloudera Hadoop Distirbution I’m settling on Cloudera 0.18. Specifically, for the purposes of this article I will be using Cloudera version 0.18.3+76. You may find this, as well as additional releases at http://archive.cloudera.com/cdh/.

Let me start by pointing you toward the pages that helped get me to where I am.

You’ll notice that Apache’s Hadoop documentation doesn’t (at time of writing) cover 0.20. Hence, I starting working with 0.18. What is missing are some details about the configuration files. If you are not aware, 0.20 moved conf/hadoop-site.xml to conf/core-site.xml, conf/hdfs-site.xml and conf/mapred-site.xml. The default configuration is stored in src/core/core-default.xmlsrc/hdfs/hdfs-default.xml and src/mapred/mapred-default.xml. Although these files should provide all the clues I needed to properly configure Hadoop 0.20, I didn’t want to pour over them to try and weed out only the values needing to be changed. I turned to Google and found that many people were having the same issues as myself. I won’t explain the problems in detail now.

Back to getting things working with 0.18.

Some tutorials out there begin with creating a dedicated hadoop users. I did not do that. Rather, I simply stashed the tarball into my working directory and changed some configuration values appropriately. I shall call the directory containing the extracted hadoop-* directory HADOOP_DIR.

Here are the steps that should get you up and running (hopefully, and without pain).

  1. cd HADOOP_DIR
  2. Get you hostname by opening Terminal.app and running hostname
  3. open conf/hadoop-site.xml in your favorite editor and paste the following, replacing Ryans-MacBook.local with you own hostname and HADOOP_DIR with the directory you untared the tarball








  hadoop.tmp.dir
  HADOOP_DIR/hadoop-${user.name}
  A base for other temporary directories.


  fs.default.name
  hdfs://Ryans-MacBook.local:9000
  The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.


  mapred.job.tracker
  Ryans-MacBook.local:9001
  The host and port that the MapReduce job tracker runs
  at. If "local", then jobs are run in-process as a single map
  and reduce task.
  


mapred.tasktracker.tasks.maximum
8
The maximum number of tasks that will be run simultaneously by a
a task tracker



  dfs.replication
  1
  Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  



The important changes I have made are replacing localhost with Ryans-Macbook.local. That’s about it. Now you should be able to point your browser at http://ryans-macbook.local:50070 to get at your NameNode and http://ryans-macbook.local:50030 for you JobTracker. I am not sure what lives at http://ryans-macbook.local:50075. Be sure to change your hostname. You can probably just use localhost in the browser (but not in the configuration files!?). It is not clear to me why I cannot use localhost in the configuration files, but it is okay in the browser. A quick edit of /etc/hosts didn’t seem to make any difference. Furthermore, it is still unclear why must use those ports over 9000 and 9001 for the NameNode and JobTracker, as those are the ports I specified in my site configuration.

Apparently I am missing something. If you know, please do enlighten me.

Spam Everywhere

Update: I’m now using the Disqus comment system. They provide a nifty service; check them out.

I’ve not been writing very much lately so I haven’t been maintaining comments either. I just marked over 50 comments as spam!

What is the best way to handle all the spam!?