Posted: July 15th, 2010 | Author: ryan | Filed under: Uncategorized | View Comments
*Note: This has been sitting sitting around as a draft for quite awhile. I’m finally posting it because it came up recently; luckily this time I was prepared.
This was brought to my attention by Jason Yan of Disqus.
Consider the following code fragment. What’s is it’s output?
1
2
3
4
| num_funcs = []
for i in xrange(0, 101):
num_funcs.append(lambda x: x + i)
print num_funcs[10](20) |
If you guessed 30, look again and consider the value of i during the print statement. It’s value will be 100.
What is num_funcs[10]? Clearly it is a lambda function. More importantly, it is a lambda function which has not yet been evaluated. Now consider again, the value i. What’s the the output? Hopefully you see that it is not 30, but rather 120. If you’re still unsure why that is true, let’s code the loop in such a way which will produce the expected output. Namely we would like num_funcs[n](20) to evaluate to n+20.
The trick here, is to force i to be the iteration value at the time of the append, rather than letting it persist as a variable until the print statement. To do this we simply close over the original anonymous function with another, thus establishing deeper scope and hence forcing i to be evaluated at the time of the append, like so.
1
2
3
4
5
| num_funcs = []
for i in xrange(0, 101):
func = (lambda j: (lambda x: x + i))(i)
num_funcs.append(func)
print num_funcs[10](20) |
Note that I’ve the argument to the outer lambda is called j, but we are passing i.
This is a little tricky, because it is so subtle. Don’t get burned.
Posted: February 26th, 2010 | Author: ryan | Filed under: Music | Tags: Edwardsville, Edwardsville Scene, Luster, Music, Stagger Inn | View Comments
For those who may be interested, head over to Edwardsville Scene to read my latest piece on Luster.

Luster
Posted: February 22nd, 2010 | Author: ryan | Filed under: Hadoop, Programming | View Comments
Mark Sands and myself have been tooling around lately with using Sequence Files in Hadoop jobs. In our case, a Sequence File contains many files which are to be processed as a single unit in a streaming job.
Trouble started creeping up during the creation of the sequence file when the input files were quite large. As you might guess, not only are they large, but we have a lot of them. We are using Hadoop after all. Stuart Sierra’s Tar-to-Seq utility was working quite nicely until this new input set, as the testing set was comprised of much smaller files.
On some machines, the heap size had to be increased to even pack a single-image input tar into a sequence file. Once that problem was solved it was still glaringly obvious that we needed to find a better way.
And Babar was born.
With a heaping spoonful of help from Michael Armbrust we were able to fight through the Java and the first version was pushed to GitHub. Babar is still actively being developed, along with a few nifty features I won’t mention here. If you want to know, ask Mark.
Babar takes a list of URLs and packs them into a sequence file. Then we can process each file in the existing streaming job just as before.
Tar-to-Seq is still useful, but with Babar harnesses MapReduce to grab each file (and you get automatic retries and the like for free). It also uses SequenceFileOutputFormat natively, so you don’t have to worry about the details of writing a sequence file yourself, since Hadoop does that for free too.
Grab it from GitHub, or from my fork, also on GitHub.
Posted: February 13th, 2010 | Author: ryan | Filed under: Django, Python | Tags: Django, Python, Web Development, Web Framework | View Comments
I had started reading The Django Book a while back but couldn’t get into it. I had been around a lot of Rails folks recently, mostly The Collective, and succumb to geek pressure to choose Rails over Django. Being a Python guy myself, it was a hard sell. I never did anything especially cool with Rails until recently (more on that to come sometime)
Less than a week ago I bought The Definitive Guide to Django: Web Development Done Right. I prefer to read a hardcopy; for me hardcopy > HTML > PDF (in most cases).
I’m about halfway in and have noticed a handful of issues. Apparantly, I’m not the only one. Notkeepingitreal.com has covered the issues in the first few chapters.
Posted: January 26th, 2010 | Author: ryan | Filed under: Music | Tags: edwadsville scene, Music | View Comments
Not too long ago I wrote my first article for Edwardsville Scene about one of my favorite bands.
Check it out: Japanese Bat Bomb is back, for now
Posted: January 26th, 2010 | Author: ryan | Filed under: Hadoop, Python | Tags: AAS, poster, posters, talk, talks | View Comments
I think I might have more draft posts than published ones. That’s kind of sad. I’m going to try and ease the pain, if only just slightly, right here right now.
I had started writing a post while I attended Big LAMP Camp in Nashville, TN back in early November 2009. The conference and camp were a lot of fun and well attended. I met a lot of great people. The list of speakers shows just goes to show how the traditional LAMP stack has evolved to include languages besides PHP (Ruby and RoR had a strong showing). Although I was the lonely Python developer. My talk was about how Galaxy Zoo is using Python and Hadoop to process large astronomical data sets.
More recently I traveled to Washington D.C. with other members of the Galaxy Zoo team to the 215th meeting of the American Astronomical Society. There I presented a poster on the same topic.
Posted: November 3rd, 2009 | Author: ryan | Filed under: Hadoop, Programming, Tips and Tricks | View Comments
Hadoop and I have had our ups and downs lately. I have been accumulating personal notes about certain issues I’ve had and their solutions. This is a record of the things that have bitten me.
If your
streaming job has a lot of dependencies that you want to ship along in your jar you might find it useful to jar them as well and throw a
-file my_deps.jar at hadoop. The gotcha is this: hadoop won’t automatically unjarify for you. To get around this simply wrap your mapper/reducer with an other script to unjar and then execute your original mapper/reducer.
#!/bin/bash
tar -xzvf my_deps.tgz .
python ./mapper.py
# You've probably already escalated the permissions to +x
# since it were required previously by Hadoop,
# but now we can be more explicit.
# Just don't forget to do the same on mapper_wrapper.sh.
I was having a lot of issue with HDFS. I couldn’t issue a
bin/hadoop dfs -mkdir,
-put, or
-copyFromLocal without all kinds of connection issues or cryptic java errors.
After copious amounts of frustration I finally seemed to fix the issue, or at least find a work-around.
Warning: the following commands will destroy your data in HDFS.
> bin/stop-dfs.sh
> rm -rf hadoop-ryan/*
> bin/hadoop namenode -format
> bin/start-dfs.sh
Note that hadoop-ryan/* is my hadoop.tmp.dir.
<property>
<name>hadoop.tmp.dir</name>
<value>/Users/ryan/Dev/hadoop/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>
Posted: November 1st, 2009 | Author: ryan | Filed under: Hadoop, Programming, Tutorial | View Comments
Lately, I have been spending a lot of time working with Hadoop (or should I say, trying to work with Hadoop). After bouncing between various versions, ranging from 0.18-0.20, and between the canonical Apache Hadoop to the patched and improved Cloudera Hadoop Distirbution I’m settling on Cloudera 0.18. Specifically, for the purposes of this article I will be using Cloudera version 0.18.3+76. You may find this, as well as additional releases at http://archive.cloudera.com/cdh/.
Let me start by pointing you toward the pages that helped get me to where I am.
You’ll notice that Apache’s Hadoop documentation doesn’t (at time of writing) cover 0.20. Hence, I starting working with 0.18. What is missing are some details about the configuration files. If you are not aware, 0.20 moved conf/hadoop-site.xml to conf/core-site.xml, conf/hdfs-site.xml and conf/mapred-site.xml. The default configuration is stored in src/core/core-default.xml, src/hdfs/hdfs-default.xml and src/mapred/mapred-default.xml. Although these files should provide all the clues I needed to properly configure Hadoop 0.20, I didn’t want to pour over them to try and weed out only the values needing to be changed. I turned to Google and found that many people were having the same issues as myself. I won’t explain the problems in detail now.
Back to getting things working with 0.18.
Some tutorials out there begin with creating a dedicated hadoop users. I did not do that. Rather, I simply stashed the tarball into my working directory and changed some configuration values appropriately. I shall call the directory containing the extracted hadoop-* directory HADOOP_DIR.
Here are the steps that should get you up and running (hopefully, and without pain).
- cd HADOOP_DIR
- Get you hostname by opening Terminal.app and running hostname
- open conf/hadoop-site.xml in your favorite editor and paste the following, replacing Ryans-MacBook.local with you own hostname and HADOOP_DIR with the directory you untared the tarball
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
| <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>HADOOP_DIR/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://Ryans-MacBook.local:9000</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>Ryans-MacBook.local:9001</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>mapred.tasktracker.tasks.maximum</name>
<value>8</value>
<description>The maximum number of tasks that will be run simultaneously by a
a task tracker
</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration> |
The important changes I have made are replacing localhost with Ryans-Macbook.local. That’s about it. Now you should be able to point your browser at http://ryans-macbook.local:50070 to get at your NameNode and http://ryans-macbook.local:50030 for you JobTracker. I am not sure what lives at http://ryans-macbook.local:50075. Be sure to change your hostname. You can probably just use localhost in the browser (but not in the configuration files!?). It is not clear to me why I cannot use localhost in the configuration files, but it is okay in the browser. A quick edit of /etc/hosts didn’t seem to make any difference. Furthermore, it is still unclear why must use those ports over 9000 and 9001 for the NameNode and JobTracker, as those are the ports I specified in my site configuration.
Apparently I am missing something. If you know, please do enlighten me.
Posted: October 18th, 2009 | Author: ryan | Filed under: Uncategorized | View Comments
Update: I’m now using the Disqus comment system. They provide a nifty service; check them out.
I’ve not been writing very much lately so I haven’t been maintaining comments either. I just marked over 50 comments as spam!
What is the best way to handle all the spam!?
Posted: August 7th, 2009 | Author: ryan | Filed under: Uncategorized | View Comments
I am writing this post from my temporary desk in the RAD Lab at UC Berkeley in the Computer Science Division. I’ll be spending some time here learning about Hadoop, a framework for distributed computing. Luckily for me, my long-time friend, Michael Armbrust is a fourth-year CS PhD student here working on distributed applications. There are a lot of people here doing research in large-scale distributed and parallel computing, including Michael.
Since third grade, Michael and I would spend countless hours in front of our computers. We wrote little applications in QBasic then graduated to Visual Basic, all while a small business venture in the background. Our largest project was called ‘System Assistant’. The System Assistant was a plugin-based system tray bound application. It’s main function was to load DLLs of individual self-contained helper applications. I can’t even remember how many plugins we wrote for it, but it was pretty awesome.
After Michael picked me up from the North Berkeley BART station we went the the CS graduate student social hour which included wine and cheese, bread and even smoked salmon. After that we sat down for a bit to talk about setting some goals for my trip before going out on the town.
We were finally able sit down in front of a keyboard, like old times, yesterday after dragging ourselves out of bed for a 9am conference call. Eventually coding ensued, and about 14 hours later still sitting in the same conference room since that morning we had managed to encapsulate my existing Python code to run it using Hadoop. Around 11pm that night we had run a Hadoop job against 10,000 PDS files totaling ~30GB. Processing took just barely over 15 minutes. The bottleneck turned out to be loading the data into HDFS, which itself took about an hour.
I’ve got a really good foundation now and can spend next week tweaking and polishing everything.