Entwicklung eines Profiling Tools zur Überwachung des Datenverkehrs von MapReduce Ausführungen
Map/Recude  is a programming model implemented by Google inc. for processing and generating distributed huge data sets. This model is based on a map and a reduce function, where the former works with input data in the form of <key,value> pairs and produces intermediate values that are processed by the reduce operation to generate the final results. Hadoop  is an open-source implementation that supports the Map/Reduce executions on a cluster environment. The input and output data of applications in the Hadoop MapReduce framework are managed by the Hadoop Distributed File System (HDFS) . At the runtime the data have to be copied from one computing node to another for the purpose of computation and key sorting. Since MapReduce applications usually process large data sets, the data movement can form a performance bottleneck.
Tasks of the master thesis
The Master thesis aims at developing a profiling tool that traces the data movement during the application is running. The tool may also be capable of reporting the execution time of map and reduce tasks, as well as the overhead for sorting. This information will help the application developers to better manage the data in the program, for example, to set a correct data size for the map tasks or use specific functionalities to shorten the time consumed for the shuffle phase. Existing profiling tools, for example Starfish, can be used as the base of this thesis. Concretely, the Master thesis contains the following tasks:
- Research survey.
- Design of the profiling tool and implementing a prototype with the functionality of visualization.
- Validation with benchmark applications.
- Write-up thesis and scientific paper.
The work requires background knowledge about parallel and distributed computing and programming skills in Java. Experiences with MapReduce are desired but not necessary.
 Map/Reduce: http://labs.google.com/papers/mapreduce-osdi04.pdf
 Hadoop project: http://hadoop.apache.org/
 HDFS: http://hadoop.apache.org/docs/stable/hdfs_design.html
 Starfish: http://www.cs.duke.edu/starfish/index.html
Dr. Jie Tao: email@example.com