Massive Data Storage and Analysis - Flightcaster
Image via CrunchBase
I recently read an article about a company called Flightcaster which is a group of engineers who are using Hadoop, Clojure, and a couple other open source software suites to get real-time flight predictions with FAA data.
This makes me think of the online ad space where a lot of companies are using Hadoop for storage and data analysis to predict ad performance.
During the last several years, an increasing number of systems within government and industry have been collecting massive amounts of raw data which often sits untapped in large data warehouses. FlightCaster strikes me as a great example of the next generation of web applications that will leverage that data: bootstrapped startups that apply machine learning and data processing at scale to solve a focused problem people actually care about.
From DataWrangling http://bit.ly/5lc0kX
One problem with Hadoop is that as you scale massive data sets it’s very time consuming to deliver a query so often times data is rolled up in higher levels into a mySQL table or something like that so it can be randomly accessed by a consumer website. The problem with that is you need to predict in advance all of the different ways a user will want to slice and dice data and pre-run those queries and then roll up the results into mySQL.
The guys at Flightcaster seem to have a way of organizing the stack to get more efficient abstracton of the data:
I’ve been a big fan of using the combination of Rails, Hadoop, and Amazon EC2 along with a high level language (in your case Clojure). Any tips for people out there thinking of using a similar technology stack? How cost effective is running Hadoop on EC2 for you?
Building layer upon layer of abstraction is a big key. On the jvm, you have to do this, it is the path around the verbosity of Java and the vast abyss of poorly done APIs. You just keep searching until you finally find the folks who have built a sane, high level API on top of the thing you want to use - then you wrap it in a high level language like Clojure. The technical term for this is “wrap the crap.”
In our case, we use Cascading as our step up in abstraction on top of Hadoop.
S3 -> EC2 -> Cloudera -> HDFS -> Hadoop -> Cascading -> Clojure. I’m not sure if those layers are exactly the right order, but you get the point. The key is go keep layering until you encapsulate the plumbing and get to the level of abstraction that lets you focus on solving your problem.
Running Hadoop on EC2 has been very cost effective. The biggest issues have come into play with the disconnect between Hadoop and S3. S3 expects open connections to keep reading, and if they don’t, S3 terminates them. S3 is very much the Arnold of the distributed file system world. So if your Hadoop jobs are compute intensive, and they are buffering in data in a lazy loading fashion, they tend to lose the connection to S3 during long processing phases. We’ve worked around this with some hackery, and we are working with Chris Wensel (of Cascading fame) on a more industrial strength solution to the problem.
from DataWrangler http://bit.ly/5lc0kX
So I guess the key to this they’re saying is this Rails type structure called Cascading.
Update: see tweet from @durana who is more of an expert on this.
@btomasette I wouldn’t call Cascading similar to Rails. It is a layer on top of Hadoop’s MR framework for creating jobs in a different way. 17 minutes ago from web in reply to btomasette Retweeted by you
duranaAdam Durana
Here is a presentation showing how Cascading works. A little over my head but I guess the idea is if you can Chain the functions and basically come up with the path of least resistance.
Cascading[1] View more documents from btomasette.

![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_b.png?x-id=503f9d28-6eae-40c0-bfeb-18c55061c6f3)
