Monitoring Scientific Computing Platforms

From Lsdf

Description

At SCC, we operate a large amount of computing resources that we offer to KIT but also to other research facilities in Europe. One important aspect in operating such platforms is to monitor the resources permanently to determine broken services as soon as possible.

Tasks

In this research project, we want to setup a monitoring system for the batch systems (HTCondor [0] & Hadoop [1]) including the development of plugins to obtain the data, store them in a database, and visualize them finally in a dashboard.

We make use of collectd [2] to collect the data on the monitored system, send then to logstash for pre-processing, store them in an appropriate database (e.g. Elasticsearch [3], InfluxDB [4] or Graphite [5]), and visualize them in Grafana [6]. You need to develop a collectd plugin to collect the data from batch system, setup the database, and create a Grafana dashboard to show your results.

After implementing your approach, you need to evaluate and write a documentation (including theoretical aspects and your approach) about it.

After the project has finished, you also have to give a presentation about your achievements.


Requirements

  • familiarity with Python and/or C/C++
  • deeper understanding of the Linux operating system

References

[0] http://research.cs.wisc.edu/htcondor
[1] http://http://hadoop.apache.org
[2] http://collectd.org
[3] http://elastic.co
[4] https://www.influxdata.com
[5] https://graphiteapp.org/
[6] http://grafana.org


Contact

Christoph.Koenig@kit.edu