Erweiterung des MapReduce Models für Daten Center Computing
MapReduce is parallel programming model that is regarded as suitable for analyzing large data sets. The model processes the data mainly with a map step and a reduce operation, where the map phase works with the input to create intermediate result and the reduce phase aggregates the output of the map operations to form a final result. Between these two steps there is a shuffle phase, where the intermediate data are sorted by the keys. The Hadoop MapReduce framework is a MapReduce implementation that has been widely used by researchers. At the runtime the framework performs a lot of data copies, for example, the output of each map operation is copied to a central node, where the shuffle phase is performed. Hadoop MapReduce targets on a single cluster with a local network connection. In this case, such data copy may not introduce much overhead. However, for a distributed MapReduce execution with several clusters or even involving different data centers, the Hadoop implementation will show a large performance deficit due to the data movement.
The master thesis aims at optimizing the current Hadoop MapReduce implementation with an additional semantic, i.e., map-reduce-reduce. The basic idea is to separate the reduce phase into two steps: a local reduce and a global reduce. The local reduce aggregates first the outputs of the map operations that are done by the local cluster (node or data center), and the global reduce aggregates the local results towards the final one. For this purpose, the Hadoop source code has to be modified to insert the local reduce phase. In addition, the Hadoop implementation will be extended to multi-cluster. An existing work can serve as the base of this extension, where Hadoop is combined with a global file system.
- Research survey: Research survey on related work, including Map/Reduce and Hadoop system. This task generates one chapter of the final thesis: related work.
- Design and Implementation of the novel Map/Reduce framework
- Performance evaluation: test bed setup and speedup measurement with benchmark applications
- Write-up thesis
The work requires background knowledge about parallel and distributed computing and programming skill in Java. Experiences with MapReduce are desired but not necessary.
 G-Hadoop: master thesis
 Hadoop project: http://hadoop.apache.org/
Dr. Jie Tao: email@example.com