MapReduce

Last week I read a 2004 paper called MapReduce: Simplified Data Processing on Large Clusters. It was written by a couple of Google researchers, and details a simple programming model and library for processing large datasets in parallel. MapReduce is used by Google under the hood for lots of different things, from indexing to machine learning to graph computation. Very handy indeed.

So imagine my surprise to find in last Friday’s edition of ACM TechNews that this paper has been republished in Communications of the ACM this month, albeit in a slightly shorter form. Aside from a few cosmetic changes (updated figure and table), the content of the papers is the same. That is, you don’t gain any knowledge from reading one of the papers that you wouldn’t gain from reading the other. There is no indication in the more recent publication that so much content has been duplicated from an earlier paper, though there is a citation to the older paper. In short, this is not new material, having been first published more than three years ago. Communications of the ACM seems to be trialling a new model, whereby the best articles from conferences are modified and republished for the ACM audience. But seriously, the modifications in the republished MapReduce article are negligible. What gives?