Alex Pooley's Blog

Hello there, my name is Alex Pooley and I'm a freelance web developer residing in Perth, Western Australia. My passion is in the development of web sites that solve everyday problems. Here's a gallery of some of my notable work. If you need a web site designer or developer, contact me with further details. Lastly, you can read more about me.

MapReduce: A Sole Traders Perspective

January 30th, 2008

Parallel Map Reduce

Us geeks are notorious for ignoring the constraints of real life. It’s simply in our nature to question what may lay beyond the frontiers of human knowledge, while at the same time ignoring the inconvenient fact that we’re 30 years old, living at home, and our only friends are on IRC, and that cross dressing second life stalker that’s starting to creep us out. OK, so I’m stretching a bit, but that’s “basically” why I think many have missed the point of MapReduce.

There’s been a lot of discussion regarding the merits of the MapReduce algorithm. Originally developed by Google to process huge amounts of data in parallel, and now taken by others for similar uses. Some gray bearded old school database fellows stirred the stew a couple of weeks ago, which resulted in a huge backlash from their own readers and the blogging community, and consequently a rebuttal from the gray beards. I have an interest in MapReduce for one reason… it makes writing parallel algorithms easier! This I believe is the true value in MapReduce, of which the value was originally highlighted by Google in their published paper, but seemingly not properly digested by many.

Consider a real life scenario where I, a sole trader with limited resources, want to process large amounts of data in a very processor intensive application. The nature of the algorithm in question is convergent which means the more iterations made, the more accurate the data. Do I …

  1. Reach for my handy Postgres database and attempt to write and then run an algorithm. This route will entail a plethora of extra work which means extra time.

    • Develop a schema for my data
    • Write an adapter to squeeze my data in to the devised schema
    • Wait for my data import system to completely morph the data from it’s natural state to a table based form
    • I’ll need to learn how to write a Postgres function
    • Or, I can use ActiveRecord to connect to the DB, but that will probably be way too slow
    • Learn how to use whatever bindings I can find for my Postgres function
    • Then I’ll need to write a function for my complex algorithm while I learn how to write a real Postgres function. All the while wondering if a function is even the best way to go about this…
    • I’ll need to work indexes and other optimizations, otherwise, why am I stuffing all this stuff in to a database again? Just to learn how to write a Postgres function with associated bindings?
    • Say I want to build a parallel database algorithm… so I can partition my data horizontally, and then cascade my calculations. But, WTF do I start? This sounds a little optimistic…
  2. Implement an automatically parallel MapReduce algorithm by assembling a collection of MapReduce parts. This will mean I have to..

    • Find an appropriate MapReduce platform. Hadoop is open source and used by Yahoo! so that sounds good. I’ll need Java, which I already know - check.
    • Install Hadoop and associated parts on my computers.
    • Devise an algorithm consisting of Map and Reduce implementations

If you look at the work involved for either option, as a person with limited resources, the MapReduce route is certainly the most attractive. I get parallel computation “for free”. Sure, you could argue that once I learn all the DB stuff I’m future proofed, and you could also argue that the DB system could well be much faster once created. Unfortunately the fact is that I no longer live in a world of theories, that was a conscious decision I made over a year ago. While I could learn more about DB’s, and consequently benefit from that experience, I could also completely screw up at any time during that process from one of the many hidden pitfalls that come with entering the unknown. Let’s face it, have you ever heard of a software system developed on time and on budget? How freaking long would it take me to acquire the extra skill required to write a parallel algorithm with a database?!

I code in Ruby because I would rather have my computers do my work. I pay for that in run-time efficiency, but in return I actually get code out the door. In the same vein, I see MapReduce as a system of getting work completed and out the door right now, and I believe this is the true strength of MapReduce.

In this decision, I would take predictable over unpredictable any day. I see MapReduce as a much more straight forward route than using a database, and consequently much more predictable.

Other posts of interest: