Difference between revisions of "Services Usage"

From Lsdf
(Hadoop cluster usage)
 
(7 intermediate revisions by 3 users not shown)
Line 1: Line 1:
Following the usage recommendations means less support requests and better services for all the users.
+
Following these usage recommendations means less support requests and better services for all users. See also [[Services overview]] for information about the infrastructure and usage.
   
 
=== Storage usage ===
 
=== Storage usage ===
   
* Storing small files (say < 1MB) wastes resources unnecessarily and '''dramatically slows down the transfer speed''' in any protocol. Please try to use data store formats which result in less big files, instead of keeping thousands of small files around. Files of size 100MB or bigger will give the best results!
+
* Storing small files wastes resources unnecessarily and '''dramatically slows down the transfer speed''' in any protocol. A small file is currently anything below 200 MB. Please try to use data store formats that return big files, instead of keeping thousands of small files around. e.g. Unix users have tar, cpio, zip, Windows can use similar tools.
   
 
=== Hadoop cluster usage ===
 
=== Hadoop cluster usage ===
   
  +
* The Hadoop cluster consists of 58 nodes with 464 physical cores, 36 GB of RAM and 2 TB of disk each. All nodes are however '''SHARED''' between the different Hadoop tasks and the OpenNebula virtual machines. Please don't assume that you will have access to all those resources at the same time :-)
 
* The HDFS (Hadoop Filesystem) is accessible both natively (API, 'hadoop' command line tool) and via a '''Hadoop FUSE mount'''. This latter one is NOT production quality and '''should NOT be used for data processing'''. It is provided only for the sake of "user friendliness"!
 
* The HDFS (Hadoop Filesystem) is accessible both natively (API, 'hadoop' command line tool) and via a '''Hadoop FUSE mount'''. This latter one is NOT production quality and '''should NOT be used for data processing'''. It is provided only for the sake of "user friendliness"!
 
* You '''should strive to use the Hadoop Map-Reduce framework natively''', as this is where the framework excels and where the performance gains will be maximal.
 
* You '''should strive to use the Hadoop Map-Reduce framework natively''', as this is where the framework excels and where the performance gains will be maximal.
 
* If you use the framework to call external executables (via the streaming-jar for instance), then you must make sure that the '''external executables do NOT start multiple threads'''.
 
* If you use the framework to call external executables (via the streaming-jar for instance), then you must make sure that the '''external executables do NOT start multiple threads'''.
  +
  +
=== Cloud/OpenNebula usage ===
  +
  +
* Keeping unused virtual machines running wastes resources and complicates maintenance tasks. Please '''turn off unused instances''' after you have finished using them!

Latest revision as of 13:13, 19 March 2013

Following these usage recommendations means less support requests and better services for all users. See also Services overview for information about the infrastructure and usage.

Storage usage

  • Storing small files wastes resources unnecessarily and dramatically slows down the transfer speed in any protocol. A small file is currently anything below 200 MB. Please try to use data store formats that return big files, instead of keeping thousands of small files around. e.g. Unix users have tar, cpio, zip, Windows can use similar tools.

Hadoop cluster usage

  • The Hadoop cluster consists of 58 nodes with 464 physical cores, 36 GB of RAM and 2 TB of disk each. All nodes are however SHARED between the different Hadoop tasks and the OpenNebula virtual machines. Please don't assume that you will have access to all those resources at the same time :-)
  • The HDFS (Hadoop Filesystem) is accessible both natively (API, 'hadoop' command line tool) and via a Hadoop FUSE mount. This latter one is NOT production quality and should NOT be used for data processing. It is provided only for the sake of "user friendliness"!
  • You should strive to use the Hadoop Map-Reduce framework natively, as this is where the framework excels and where the performance gains will be maximal.
  • If you use the framework to call external executables (via the streaming-jar for instance), then you must make sure that the external executables do NOT start multiple threads.

Cloud/OpenNebula usage

  • Keeping unused virtual machines running wastes resources and complicates maintenance tasks. Please turn off unused instances after you have finished using them!