Entwicklung eines MapReduce Frameworks mittels In-memory Datenübertragung

From Lsdf
Revision as of 15:32, 16 December 2014 by Tao (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Zurück zur Themenliste


MapReduce is a parallel programming model that is regarded as suitable for analyzing large data sets. The model processes the data mainly with a map step and a reduce operation, where the map phase works with the input to create intermediate result and the reduce phase aggregates the output of the map operations to form a final result. The Hadoop MapReduce framework [1] is a MapReduce implementation that has been widely used by researchers. The input and output data of applications in the Hadoop MapReduce framework are managed by the Hadoop Distributed File System (HDFS) [2]. At the runtime the Map tasks produce intermediate data, which are managed by HDFS on disks and then read by the Reduce tasks. Hence, a lot of time is spent in working with the intermediate data. The MPI MapReduce implementation [3] optimizes this issue with parallel I/O techniques.

The master thesis also aims at optimizing the operations with intermediate data but with an in-memory technology, i.e., the intermediate data will be stored in the memory and the Reduce tasks read the data directly from teh memory. For this purpose, the local memory of a single node is not sufficient and thus a global memory space across distributed nodes is required.

Partitioned Global Address Space (PGAS) [4] is currently a widely-used approach for building a global address space on a distributed system. The PGAS model provides an abstraction of a global memory address space that is logically partitioned. Each portion of the global address space has an affinity to a certain process or thread. A number of PGAS programming systems have been implemented in the past, including Unified Parallel C (UPC) [5] and Co-Array Fortran [6]. The project DASH [7] at SCC also targets on the PGAS approach.

In addition, there are in-memory caching techniques for databases. Memcached [8] is a good example. This client/server system provides persistent in-memory key-value storage services for database caching, where the values put by one client can be used by the others. Memcached scales well with its consistent hashing across distributed nodes

The master thesis will change the Hadoop implementation with intermediate data, i.e., replacing the data storage in HDFS with either a PGAS approach or the Memcached approach. For this, the Hadoop code must be analysed and modified. The implementation will be validated on a test cluster using stardard benchmark applications delivered together with the Hadoop MapReduce software package.

Work packages

  • Research survey: Research survey on related work, including Map/Reduce and Hadoop system, PGAS and Memcached. This task generates one chapter of the final thesis: related work/background.
  • Design and Implementation of the novel Map/Reduce framework
  • Performance evaluation: test bed setup and performance measurement with benchmark applications
  • Write-up thesis


The work requires background knowledge about parallel and distributed computing and programming skill in Java. Experiences with MapReduce are desired but not necessary.


[1] Hadoop project: http://hadoop.apache.org/
[2] HDFS: http://hadoop.apache.org/docs/stable/hdfs_design.html
[3] MR-MPI: http://mapreduce.sandia.gov/
[4] PGAS: http://www.pgas.org/
[5] W. Carlson, J. Draper, D. Culler, K. Yelick, E. Brooks, and K. Warren. Introduction to UPC and Language Speci_cation. Technical Report CCS-TR-99-157, IDA Center for Computing Sciences, 1999.
[6] R. W. Numrich and J. Reid. Co-array Fortran for Parallel Programming. SIGPLAN Fortran Forum, 17(2):1{31, Aug 1998.
[7] DASH: http://www.dash-project.org/
[8] Memcached: http://memcached.org/


Dr. Jie Tao: jie.tao@kit.edu