A High Level Comparison of Hadoop and Dryad

by Josh Patterson ~ July 20th, 2009. Filed under: Parallel Computing, Research, The Evolution of Computing.

Disclaimer: This is my personal opinion and in no way reflects the opinion of my employer.

Today parallel programming at the commodity level is a booming sector both from big tech standpoint and from a small grassroots “web 2.0″ standpoint; Most everyone who is connected to the internet has more data than they know what to do with. As I use Hadoop on a daily basis, recently I sat down and took a look at the two research papers [1,2] published by Microsoft on their competing platform “Dryad“. Given that I’ve heard some discussion informally comparing the two platforms, I decided to write up my own comparison. I’m going to state upfront that I have only seen Microsoft comparing Dryad against the Map Reduce model in general, and not specifically Hadoop itself. However, it seems like Dryad and Hadoop are going to at least have some overlap in userbase. Therefore I think its a worthwhile exercise to compare them — if only to help someone figure out which one is right for them. The major areas that I wanted to compare the two platforms on were:

  • Processing Model
  • Tools
  • Storage
  • Community
  • Performance
  • Cost

Hadoop employees the venerable Map Reduce [4] programming model that was made popular at Google. Map Reduce has proven quite reliable and robust given its strong community based around the open source version of it, hadoop. With Dryad, Microsoft proposes using a Directed Acyclic Graph (DAG) to combine computational “vertices” with communication “channels” to model data flow graphs.

A DAG is somewhat analagous to the Map Reduce model yet is more complex and can model Map Reduce itself (among other flows). The Dryad team states

The application can discover the size and placement of data at run time, and modify the graph as the computation progresses to make efficient use of the available resources.

Dryad’s argument is that Map Reduce is too simplistic and that

In all these programming paradigms, the system dictates a communication graph, but makes it simple for the developer to supply subroutines to be executed at specified graph verticies.

Map Reduce was designed to be accessible to the widest possible class of developers, and therefore aims for simplicity at the expense of generality and performance. By fixing the boundary between the communication graph and the subroutines that inhabit its verticies, the model guides the developer towards an appropriate level of granularity.

The more I read through the Dryad paper, the more I began to see their argument in that Dryad allowed the programming to be more expressive with a more general parallel computing model. However, is Dryad’s additional complexity a net gain in practical terms? The researchers in Berkley’s RAD lab don’t think so; In their paper on the Nexus [3] project, they state:

However, making a programming model too general increases complexity (Dryad is arguably more complex than Map Reduce) and decreases the opportunity to optimize programs using application specific knowledge.

going on to state:

We believe that no single general programming model for the cloud will exist. Instead, multiple cluster computing frameworks will emerge, providing various programming models with different tradeoffs.

which I feel is a pretty sage observation, at least in a general sense — design is about tradeoffs.

Dryad has some interesting tools, obviously integrating nicely with the .NET framework with DryadLINQ expressions. Hadoop has an interesting array of tools including

  • Pig - A high-level data-flow language and execution framework for parallel computation.
  • Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Cascading
  • HBase
  • Thrift

and many other projects that have a cozy relationship with the Hadoop / HDFS ecosystem. Another inherent aspect to Hadoop’s array of tools is the fact that the project is open source and you can freely download the source code to add bug fixes or new features as you see fit. With Dryad, this is sadly not an option.

As I was reading up on Dryad I was really surprised to find that Microsoft does not include any sort of distributed filesystem “out of the box” for use with the system. Internally, Microsoft uses the Cosmos file system — but this is unfortunately not even available in the academic preview. Instead, there are bindings for NTFS and SQL Server which left me somewhat puzzled. If you want to position a product to go up against a solid engineering effort such as Hadoop, how can you leave out the distributed storage part of it? Hadoop on the other hand comes out of the box with HDFS and also works wiht KFS, HBase, Amazon S3, and others. To be fair, the Dryad team has only compared itself to the Map Reduce programming model and not the Hadoop project itself (HDFS included), so it is premature to make a comparison on this point — but its something that stood out to me.

The community aspect of Hadoop is a really nice feature from the perspective of a developer. The community is extremely active (and vocal) and will engage most any Hadoop related topic with very detailed discussions. A wonderful thing about open source is that a popular project generally has a strong and voracious user and developer base. Even though Microsoft is known to have its own enthusiastic supporters, at the present time Dryad has little beyond a small select academic circle using it. Until a production grade package for Dryad is released the jury will be out on what kind of community it will command.

As I don’t have a Dryad cluster to run tests on currently I am not able to run my own performance tests on it (unlike with Hadoop where I have access to a cluster). I would really like to see more apples to apples comparisons to see if Dryad’s added complexity yields performance increases that are worth the extra design overhead. I’ve learned the hard way that for anything you are going to rely on you need to do your own benchmarking in order to know true “real world” performance of a system (mesh networking coming to mind here). The 2007 Dryad paper [1] understandably did not have a comparison to Hadoop, but I was somewhat surprised to see the 2008 paper [2] not have a direct comparison (instead, the paper talked about Map Reduce in a general sense).To be fair, Hadoop may not have been on the Dryad team’s radar until the paper was well underway, and as I pointed out before, may not be a focal point for them in terms of market segment.

However, I did notice that the Dryad team tested against Jim Gray’s Terasort benchmark [5] and reported numbers for an array of cluster sizes, with their performance being reported as 319 seconds on 240 nodes for sorting 10^12 bytes. Hadoop currently holds the Terasort record [6] by sorting 10^12 bytes in 62 seconds. However, the primary difference was that yahoo team used 1,460 nodes in their Hadoop cluster. As anyone knows these sorts of tests are highly sensitive to parameter tuning and other conditions, so it is somewhat difficult to extrapolate what Dryad’s performance would be at 1,460 nodes. I did develop a quick and dirty fitted logarithmic curve for Dryad’s Terasort performance, and came up with ~800 seconds at 1460 Dryad nodes. (Update: Dimitri Ryaboy has since pointed out a flaw in this projection, in that there is more data involved in the sort basis as Dryad is sorting 3.87 GB per node. This inflates the total number of bytes past the 1TB mark and makes this projection suspect at best. Without an “apples-to-apples” setup, its difficult to get a better assessment. Originally the projection was done out of curiosity and meant to be taken with a very large grain of salt.) However, I did this just for curiosity — its unfair to even begin to project out performance given the myriad of variables involved in these type setups. Regardless, I can safely say that I feel like Hadoop’s implementation of Map Reduce is not holding it back performance wise in terms of the Terrasort benchmark when compared with the DAG implementation of Terasort. I would also be willing to bet that the Dryad team will step up to the plate and make a serious run at the Terasort benchmark now that the Hadoop team at Yahoo has taken the crown.

How much our computation time costs us is generally a prime concern for most big data businesses. Both clusters require a capital outlay for hardware, so as long as you are using commodity hardware (and here we assume you are) then this is a wash. However, the per node license costs is typically where vendors make their money in the OS and RDBMs scenes. With Dryad we only know that it runs on Windows HPC Server so your per node cost would have to reflect that (probably not cheap) and then there more than likely would be a per node license fee for each Dryad node (unless they release a all-in-one package/license). So cost with Dryad, at least in stark comparison with Hadoop, would be an issue for any sizable cluster. There is also the added costs of support, where I’m sure Microsoft would offer a support contract option, and Hadoop has a few vendors that offer enterprise support. Other than that, I feel that the Hadoop community is a strong factor in keeping costs low as you can either request a feature or patch it in yourself — which can save you a lot of time in contrast to waiting for that next 18 month release cycle.

Hadoop offers a robust computing platform that while may be considered more simple than Dryad’s DAG, has proven to be a reliable and powerful parallel computing solution. Sometimes less is more, but until we see more hard numbers published I will cut the Dryad team some slack there. When dealing with “warehouse scale computing”, I believe that the algorithm is going to dominate performance with generalized optimizations lagging behind that. In terms of relevance to the developer community, Dryad having a more complex processing data flow may or may not provide enough incentive to take on its much higher total cost of ownership — only a fully completed Dryad product will be able to accurately answer that. Taking on Hadoop directly may never even be in the Dryad team’s plans as they seem to be targetting a more general parallel programming case; If you are looking to do certain types of parallel processing intensive computations, you may want to give Dryad’s DAG model a look. However, without a “in the box” storage solution like HDFS and more compelling reasons to deal with their more general programming model,  I feel that Dryad may have a hard time finding fans in the general hadoop ecosystem.

References

[1] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly, Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks, http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf

[2] DryadLinq: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language, http://research.microsoft.com/en-us/projects/dryadlinq/dryadlinq.pdf

[3] Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ion Stoica, Scott Shenker, Nexus: A Common Substrate For Cluster Computing, http://www.usenix.org/event/hotcloud09/tech/slides/hindman.pdf

[4] Jeff Dean, Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, http://labs.google.com/papers/mapreduce.html

[5] Terasort, http://sortbenchmark.org/

Comments are closed.