Entwicklung eines Profiling Tools zur Überwachung des Datenverkehrs von MapReduce Ausführungen

From Lsdf
Revision as of 15:01, 16 December 2014 by Tao (talk | contribs)

Zurück zur Themenliste


Map/Recude [1] is a programming model implemented by Google inc. for processing and generating distributed huge data sets. This model is based on a map and a reduce function, where the former works with input data in the form of <key,value> pairs and produces intermediate values that are processed by the reduce operation to generate the final results. Hadoop [2] is an open-source implementation that supports the Map/Reduce executions on a cluster environment. The input and output data of applications in the Hadoop MapReduce framework are managed by the Hadoop Distributed File System (HDFS) [3]. At the runtime the data have to be copied from one computing node to another for the purpose of computation and key sorting. Since MapReduce applications usually process large data sets, the data movement can form a performance bottleneck.

Tasks of the master thesis

The Master thesis aims at developing a profiling tool that traces the data movement during the application is running. The tool may also be capable of reporting the execution time of map and reduce tasks, as well as the overhead for sorting. This information will help the application developers to better manage the data in the program, for example, to set a correct data size for the map tasks or use specific functionalities to shorten the time consumed for the shuffle phase. Existing profiling tools, for example Starfish[4], can be used as the base of this thesis. Concretely, the Master thesis contains the following tasks:

  • Research survey.
  • Design of the profiling tool and implementing a prototype with the functionality of visualization.
  • Validation with benchmark applications.
  • Write-up thesis and scientific paper.


The work requires background knowledge about parallel and distributed computing and programming skills in Java. Experiences with MapReduce are desired but not necessary.


[1] Map/Reduce: http://labs.google.com/papers/mapreduce-osdi04.pdf
[2] Hadoop project: http://hadoop.apache.org/
[3] HDFS: http://hadoop.apache.org/docs/stable/hdfs_design.html
[4] http://www.cs.duke.edu/starfish/index.html


Dr. Jie Tao: jie.tao@kit.edu