Entwicklung eines Profiling Tools zur Überwachung des Datenverkehrs von MapReduce Ausführungen

From Lsdf
Revision as of 11:27, 2 October 2013 by Tao (talk | contribs) (Created page with "Zurück zur Themenliste = Overview = Map/Recude [1] is a programming model implemented by Google inc. for processing and generating distributed …")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Zurück zur Themenliste

Overview

Map/Recude [1] is a programming model implemented by Google inc. for processing and generating distributed huge data sets. This model is based on a map and a reduce function, where the former works with input data in the form of <key,value> pairs and produces intermediate values that are processed by the reduce operation to generate the final results. Hadoop [2] is an open-source implementation that supports the Map/Reduce executions on a cluster environment. The input and output data of applications in the Hadoop MapReduce framework are managed by the Hadoop Distributed File System (HDFS) [3]. At the runtime the data have to be copied from one computing node to another for the purpose of computation and key sorting. Since MapReduce applications usually process large data sets, the data movement can form a performance bottleneck.

Tasks of the master thesis

The Master thesis aims at developing a profiling tool that traces the data movement during the application is running. The tool may also be capable of reporting the execution time of map and reduce tasks, as well as the overhead for sorting. This information will help the application developers to better manage the data in the program, for example, to set a correct data size for the map tasks or use specific functionalities to shorten the time consumed for the shuffle phase. The profiling tool will be integrated in the Hadoop MapReduce framework. Concretely, the Master thesis contains the following tasks:

  • Research survey.
  • Design of the profiling tool and implementing a prototype in the Hadoop MapReduce framework.
  • Validation with benchmark applications.
  • Write-up thesis and scientific paper.

Requirements

The work requires background knowledge about parallel and distributed computing and programming skill in Java. Experiences with MapReduce are desired but not necessary.

References

[1] Map/Reduce: http://labs.google.com/papers/mapreduce-osdi04.pdf
[2] Hadoop project: http://hadoop.apache.org/
[3] HDFS: http://hadoop.apache.org/docs/stable/hdfs_design.html

Contact

Dr. Jie Tao: jie.tao@kit.edu