Difference between revisions of "Monitoring Scientific Computing Platforms"

From Lsdf
(Created page with "= Description = At SCC, we operate a large amount of computing resources that we offer to KIT but also to other research facilities in Europe. One important aspect in operatin...")
 
 
Line 6: Line 6:
 
finally in a dashboard.
 
finally in a dashboard.
   
We make use of collectd [1] to collect the data on the monitored system, send then to logstash for pre-processing, store them in Elasticsearch [3], and visualize them in Grafana [4]. You need to develop a collectd plugin to collect the data from batch system, setup the logstash and Elasticsearch stack, and create a Grafana dashboard to show your results.
+
We make use of collectd [2] to collect the data on the monitored system, send then to logstash for pre-processing, store them in an appropriate database (e.g. Elasticsearch [3], InfluxDB [4] or Graphite [5]), and visualize them in Grafana [6]. You need to develop a collectd plugin to collect the data from batch system, setup the database, and create a Grafana dashboard to show your results.
   
 
After implementing your approach, you need to evaluate and write a documentation (including theoretical aspects and your approach) about it.
 
After implementing your approach, you need to evaluate and write a documentation (including theoretical aspects and your approach) about it.
Line 19: Line 19:
   
 
= References =
 
= References =
: [0] [http://research.cs.wisc.edu/htcondor/ http://research.cs.wisc.edu/htcondor/]
+
: [0] [http://research.cs.wisc.edu/htcondor/ http://research.cs.wisc.edu/htcondor]
: [1] [http://http://hadoop.apache.org http://http://hadoop.apache.org/]
+
: [1] [http://http://hadoop.apache.org http://http://hadoop.apache.org]
 
: [2] [http://collectd.org http://collectd.org]
 
: [2] [http://collectd.org http://collectd.org]
 
: [3] [http://elastic.co http://elastic.co]
 
: [3] [http://elastic.co http://elastic.co]
: [4] [http://grafana.org http://grafana.org]
+
: [4] [https://www.influxdata.com/ https://www.influxdata.com]
  +
: [5] [https://graphiteapp.org/ https://graphiteapp.org/]
  +
: [6] [http://grafana.org http://grafana.org]
   
   

Latest revision as of 16:56, 14 September 2016

Description

At SCC, we operate a large amount of computing resources that we offer to KIT but also to other research facilities in Europe. One important aspect in operating such platforms is to monitor the resources permanently to determine broken services as soon as possible.

Tasks

In this research project, we want to setup a monitoring system for the batch systems (HTCondor [0] & Hadoop [1]) including the development of plugins to obtain the data, store them in a database, and visualize them finally in a dashboard.

We make use of collectd [2] to collect the data on the monitored system, send then to logstash for pre-processing, store them in an appropriate database (e.g. Elasticsearch [3], InfluxDB [4] or Graphite [5]), and visualize them in Grafana [6]. You need to develop a collectd plugin to collect the data from batch system, setup the database, and create a Grafana dashboard to show your results.

After implementing your approach, you need to evaluate and write a documentation (including theoretical aspects and your approach) about it.

After the project has finished, you also have to give a presentation about your achievements.


Requirements

  • familiarity with Python and/or C/C++
  • deeper understanding of the Linux operating system

References

[0] http://research.cs.wisc.edu/htcondor
[1] http://http://hadoop.apache.org
[2] http://collectd.org
[3] http://elastic.co
[4] https://www.influxdata.com
[5] https://graphiteapp.org/
[6] http://grafana.org


Contact

Christoph.Koenig@kit.edu