Monitoring Scientific Computing Platforms

From Lsdf
Revision as of 13:57, 13 September 2016 by Nico.schlitter (talk | contribs) (Created page with "= Description = At SCC, we operate a large amount of computing resources that we offer to KIT but also to other research facilities in Europe. One important aspect in operatin...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Description

At SCC, we operate a large amount of computing resources that we offer to KIT but also to other research facilities in Europe. One important aspect in operating such platforms is to monitor the resources permanently to determine broken services as soon as possible.

Tasks

In this research project, we want to setup a monitoring system for the batch systems (HTCondor [0] & Hadoop [1]) including the development of plugins to obtain the data, store them in a database, and visualize them finally in a dashboard.

We make use of collectd [1] to collect the data on the monitored system, send then to logstash for pre-processing, store them in Elasticsearch [3], and visualize them in Grafana [4]. You need to develop a collectd plugin to collect the data from batch system, setup the logstash and Elasticsearch stack, and create a Grafana dashboard to show your results.

After implementing your approach, you need to evaluate and write a documentation (including theoretical aspects and your approach) about it.

After the project has finished, you also have to give a presentation about your achievements.


Requirements

  • familiarity with Python and/or C/C++
  • deeper understanding of the Linux operating system

References

[0] http://research.cs.wisc.edu/htcondor/
[1] http://http://hadoop.apache.org/
[2] http://collectd.org
[3] http://elastic.co
[4] http://grafana.org


Contact

Christoph.Koenig@kit.edu