Elastic Search, Logstash and Kibana

From Gridkaschool
Revision as of 14:10, 10 September 2015 by Sambroj (talk | contribs) (Installation of Elasticsearch)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Overview

Authors: Samuel Ambroj Pérez (SCC, KIT), Kajorn Pathomkeerati (IAI, KIT), [1] [2]

Introduction to Elasticsearch and Logstash

File:GKS15-ESandLOGS.pdf

Introduction to Kibana

File:Elk-slides-pdf.pdf

Installation of ELK in one single machine (Debian 8)

Installation of Elasticsearch

Connect to the first machine (on the left side) that has been provided to you:

ssh gks@141.52.X.X

This machine is going to be used for a while, so do not connect yet to the second VM.

The gks user has sudo rights because is included in the sudoers file, so from your gks user, execute:

sudo -i bash

Update and upgrade all the packages:

# apt-get update
# apt-get upgrade

Change the timezone (optional):

# dpkg-reconfigure tzdata

Installation of aptitude, curl, openjdk (open source java), chkconfig and libc6-dev:

# apt install -y aptitude curl openjdk-7-jdk chkconfig libc6-dev

Download and install the Public Signing Key:

# wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -

Save the repository definition:

# echo "deb http://packages.elastic.co/elasticsearch/1.7/debian stable main" | sudo tee -a /etc/apt/sources.list.d/elasticsearch-1.7.list

Update in order to make the repository ready to use and install Elasticsearch:

# aptitude update && aptitude install elasticsearch

Start Elasticsearch:

# /etc/init.d/elasticsearch start

Check the status (two options):

# systemctl status elasticsearch
# /etc/init.d/elasticsearch status

Modify chkconfig in order to start ES when booting:

# chkconfig elasticsearch on

Elasticsearch is now installed. Congratulations! Let's continue with the installation of Logstash and finally Kibana.

Installation of Logstash

Download the .deb file:

# wget -q https://download.elastic.co/logstash/logstash/packages/debian/logstash_1.5.4-1_all.deb 

Install it:

# dpkg -i logstash_1.5.4-1_all.deb

It would be enough, but when preparing the tutorial we saw the following WARNING:

WARN -- Concurrent: [DEPRECATED] Java 7 is deprecated, please use Java 8.
Java 7 support is only best effort, it may not work. It will be removed in next release (1.0).

So, we install the Oracle java, version 8.

Installation of Oracle Java 8

Download the tarball:

# wget --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u60-b27/jdk-8u60-linux-x64.tar.gz

Create the folder /opt/jdk and move the tarball there:

# mkdir /opt/jdk
# mv jdk-8u60-linux-x64.tar.gz /opt/jdk/

Extract the tarball there:

# cd /opt/jdk
# tar -xzvf jdk-8u60-linux-x64.tar.gz

Update alternatives:

# update-alternatives --install /usr/bin/java java /opt/jdk/jdk1.8.0_60/bin/java 100
# update-alternatives --install /usr/bin/javac javac /opt/jdk/jdk1.8.0_60/bin/javac 100

Display the priorities and the version of Java:

# update-alternatives --display java
# java -version

java version "1.7.0_79" OpenJDK Runtime Environment (IcedTea 2.5.6) (7u79-2.5.6-1~deb8u1) OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)

It is not pointing to the Oracle version, so we increase the value from 100 to 10000:

# update-alternatives --install /usr/bin/java java /opt/jdk/jdk1.8.0_60/bin/java 10000
# update-alternatives --install /usr/bin/javac javac /opt/jdk/jdk1.8.0_60/bin/javac 10000
# java -version

java version "1.8.0_60" Java(TM) SE Runtime Environment (build 1.8.0_60-b27) Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)


Installation of Kibana 4

Download Kibana4:

# wget https://download.elastic.co/kibana/kibana/kibana-4.1.1-linux-x64.tar.gz

Extract the file:

# tar xvf kibana-4.1.1-linux-x64.tar.gz

Move it and change to a shorter name:

# mv kibana-4.1.1-linux-x64/ /opt/
# mv /opt/kibana-4.1.1-linux-x64/ /opt/kibana4

Launch Kibana:

# cd /opt/kibana4/
# ./bin/kibana > /dev/null &

Check that Kibana is working in a browser:

http://141.52.X.X:5601

The access is not secured. In order to make it more secure we are going to install a reverse nginx proxy in the next section.

Installation and configuration of an nginx reverse proxy (Debian 8)

Kibana comes with a plugin called Shield which allows you to easily protect this data with a username and password, while simplifying your architecture. Advanced security features like encryption, role-based access control, IP filtering, and auditing are also available when you need them.

It is necessary a Gold or Platinum subscription to use it. We do not have it for the tutorial, so we secure our Kibana server using an nginx reverse proxy.

The steps are the following.

Install nginx and apache2-utils (see man aptitude in case you do not know the -y option)

# aptitude install -y nginx apache2-utils

Create the folder where the certificate and public key for your server will be saved:

# mkdir /etc/nginx/ssl

Get the certificate and public key:

# openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout /etc/nginx/ssl/nginx.key -out /etc/nginx/ssl/nginx.crt

Create the user admin (you can choose any other) which will be allowed to access to your Kibana server (choose a password)

# htpasswd -c /etc/nginx/htpasswd.users admin     

Change the file /etc/nginx/sites-enabled/default to the following one:

# cat /etc/nginx/sites-enabled/default 
server {
    listen 80;
    return 301 https://$host$request_uri;
}
server {
listen 443 ssl;

server_name <your-server>;  

ssl on;
ssl_protocols  SSLv2 TLSv1; # Remove SSLv3 because of security hole!
ssl_ciphers  ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP;
ssl_prefer_server_ciphers   on;

ssl_certificate /etc/nginx/ssl/nginx.crt;
ssl_certificate_key /etc/nginx/ssl/nginx.key;

auth_basic "Restricted Access";
auth_basic_user_file /etc/nginx/htpasswd.users;
location / {
proxy_pass http://localhost:5601;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_cache_bypass $http_upgrade;
}
}
  

Restart the nginx service:

# /etc/init.d/nginx restart

In a browser:

141.52.X.X

Accept the certificate and access using your username and password.

Exercise 1: Daily value of Yahoo stock market (using csv plugin)

Due to the fact that we cannot use real data with sensitive information, we are going to see a first example using the publicly available data at Yahoo Finance web page. We have selected the daily evolution of the Yahoo index (YHOO) from the beginning (April 1st, 1996) to today, but you can choose another one.

For the Yahoo index, you can get the data from here:

# wget -O yahoo-stock.csv 'http://ichart.finance.yahoo.com/table.csv?s=YHOO&c=1962'

Take a look at this file:

# vi yahoo-stock.csv

We would like to use Logstash in order to parse the data and send it to Elasticsearch. There is a filter plugin called csv, so take a look at it:

https://www.elastic.co/guide/en/logstash/current/plugins-filters-csv.html

Please, try to think about the plausible logstash configuration file before checking the following solution (save this file in /etc/logstash/conf.d/<filename>.conf)

input {  
  file {
    path => "/root/csvkit-tutorial/yahoo-stock.csv"
    sincedb_path => "/dev/null"
    start_position => "beginning"    
    type => "stock-market"
  }
}
filter {  
  csv {
      separator => ","
      columns => ["Date","Open","High","Low","Close","Volume","Adj Close"]
  }
  mutate {convert => ["High", "float"]}
  mutate {convert => ["Open", "float"]}
  mutate {convert => ["Low", "float"]}
  mutate {convert => ["Close", "float"]}
  mutate {convert => ["Volume", "float"]}
  mutate {convert => ["Adj Close", "float"]}
}
output {  
    elasticsearch {
        action => "index"
        host => "localhost"
        index => "stock"
        workers => 1
    }
}
 

Execute logstash:

# /opt/logstash/bin/logstash agent -f /etc/logstash/conf.d/stock-market.conf  > /var/log/logstash/execution-log 2>&1 &

Check if the data has been imported to Elasticsearch:

# curl 'localhost:9200/_cat/indices?v'

You will see yellow status for the stock index because we are working with one single host and the number of replicas is distinct from zero.

To solve it, set the property index.number_of_replicas equal to zero in the file /etc/elasticsearch/elasticsearch.yml:

index.number_of_replicas: 0

It is necessary to delete the indices in Elasticsearch and to kill the current logstash process (ps aux | grep logstash).

In order to delete the indices with curl:

# curl -XDELETE 'localhost:9200/stock?pretty'

Run logstash again and check if the data has been imported again.

Now, it is time to play with Kibana (in a browser):

141.52.X.X

We will show you how to do it. When you are ready, call us.

Exercise 2: Daily value of Yahoo stock market (JSON import)

You have previously downloaded the .csv file and imported to Elasticsearch using the filter plugin csv. It is possible to ingest JSON data directly to Elasticsearch. That is the reason why we are going to convert the csv file to JSON format and see how the direct JSON import to Elasticsearch works.

csvkit

The web page can be found here:

http://csvkit.readthedocs.org/en/0.9.1/

Install the following packages and csvkit using pip (The PyPA recommended tool for installing Python packages):

# aptitude install python-dev python-pip python-setuptools build-essential
# pip install csvkit

Converting csv2json with the --stream (one-line) option:

# csvjson --stream yahoo-stock.csv > yahoo-stream.json

Delete the stock index:

# curl -XDELETE 'localhost:9200/stock?pretty'

Check that it has been deleted:

# curl 'localhost:9200/_cat/indices?v'

Try to import the obtained JSON file to Elasticsearch with the following command:

# curl -XPOST 'localhost:9200/stock/stock-market/_bulk?pretty' --data-binary @yahoo-stream-ELK.json

There will be an error because Elasticsearch needs the _id field for every JSON line.

We can solve this problem manually with a bash script (you can also try perl, python, awk ...)

# COUNT=1; while read line; do echo {\"index\":{\"_id\":\"$COUNT\"}}; echo $line; COUNT=$(( $COUNT + 1 )); done < yahoo-stream.json > yahoo-stream-ELK.json

Import the one-single line JSON to Elasticsearch:

# curl -XPOST 'localhost:9200/stock/stock-market/_bulk?pretty' --data-binary @yahoo-stream-ELK.json

Navigate to your Kibana server and check the fields for the stock index. You will see that the fields Date, Open, High, Low, Close, Volume and Adj Close are strings.

You can convert a string to a number in a JSON file, simply removing the double quotes of the numeric fields.

Solution: perl command for removing quotes around numbers:

# perl -pe 's{"((\d+)?(?:\.\d+)?)"}{$1}g' yahoo-stream-ELK.json > out2.json

Delete again the stock index, execute the import of the JSON file and check your Kibana server.

Comment: there are also two codec plugins called json and json_lines. Do you think that they could be used?

Exercise 3: Tape log using grok filter

Look for the file /root/tape/lsmess2014-10-22 in your first VM. It contains 9168 log messages of tape cartridges and drives of a single day.

For specific and rare log files, there are no filters available in order to simplify our work. In that case, the use of grok filter, using Oniguruma regular expressions is really useful.

Take a look also on the file lsmess-logs file (also in the /root/tape/ folder). It contains regular expressions for many log entries. It was tested with the tape logs of a whole year.

With these two aforementioned files, we can also analyse the tape logs of our system.

The logstash configuration file is also present in the very same folder with the name lsmess.conf.

Take a look in the patterns and configuration file and make it work.

Sense Chrome plugin (alternative to curl)

A Google Chrome Extension for working with Elasticsearch

To use this extension, Google Chrome must be installed on your machine.

Exercise 4: ELK working with two different hosts

Now, it is time to use the second machine. Connect to it:

ssh gks@141.52.X.X

Install Elasticsearch and Kibana using the corresponding previous sections (also Oracle Java 8). If you have time, also install the reverse nginx proxy.

In this exercise the first machine is going to run Logstash for the Yahoo daily index example and send the parsed JSON file to the second machine, where Elasticsearch and Kibana will be running. ES and Kibana will not be running any longer in our first host.

When you are familiar with the general idea, delete the data stored in ES in the first machine, stop there ES and kill the kibana process (ps aux | grep kibana).

The following configuration file in your first machine will make "some magic" and you will see the parsed file in your second instance:

input {  
  file {
    path => "/root/csvkit-tutorial/yahoo-stock.csv"
    sincedb_path => "/dev/null"
    start_position => "beginning"    
    type => "stock-market"
  }
}
filter {  
  csv {
      separator => ","
      columns => ["Date","Open","High","Low","Close","Volume","Adj Close"]
  }
  mutate {convert => ["High", "float"]}
  mutate {convert => ["Open", "float"]}
  mutate {convert => ["Low", "float"]}
  mutate {convert => ["Close", "float"]}
  mutate {convert => ["Volume", "float"]}
  mutate {convert => ["Adj Close", "float"]}
}
output {  
    elasticsearch {
        action => "index"
        host => "<second_machine_address>"
        index => "stock"
        workers => 1
    }
}
 

Insert Mysql DB data into ES

In addition to ELK, two components are needed for this scenario:

  • MySQL Server, a DBMS for SQL Databases
  • River, an ES plugin for data import


MySQL Server Installation

Set up components:

# aptitude install mysql-server mysql-client

set up
user name : root
password : root 


Create a database:

# mysql -u root –p

> create database db_name;
> exit


Import data from a dump file to the database:

# mysql -u root -p db_name < dump_file.sql


Check if data are properly imported:

# mysql -u root -p db_name

> show tables;
> show columns from some_table;


River Installation

Install Plugin:

# cd /usr/share/elasticsearch
# ./bin/plugin --install river-jdbc -url 
'http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-river-jdbc/1.5.0.5/elasticsearch-river-jdbc-1.5.0.5-plugin.zip'


Install a JDBC connector for MySQL:

# cd /usr/share/elasticsearch/plugins/
# wget http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.36.tar.gz
# tar -zxvf mysql-connector-java-5.1.36.tar.gz --wildcards '*.jar‘
# mv mysql-connector-java-5.1.36/mysql-connector-java-5.1.36-bin.jar  ./river-jdbc/
# rm -rf mysql-*
 

Restart Elasticsearch Service:

# /etc/init.d/elasticsearch stop
# /etc/init.d/elasticsearch start

Using River:

A river _meta document must be correctly defined. River will know the data source and all necessary parameters from this document.

For example, a _meta document can be defined like this:

 # curl –XPUT 'localhost:9200/_river/type_name/_meta' -d
 '{  
    "type" : "jdbc",  
 	"jdbc" : {        
 		"url" : "jdbc:mysql://localhost",       
 		"user" : "db_user",
 		"password" : "db_user_password",
 		"sql" : "SELECT * FROM table_name", 
 		"index": "es_index", 
 		"type" : "es_type", 
 		"type_mapping": {  …  } 
 	}
 }'
 


Check if the data are imported:

# curl -XPOST 'localhost:9200/tweet_stream/tweet/_search' -d 
 '{
     "size": 20
 '}
 

Exercise 5: Dumping a twitter mysql DB into ES

Given:

Again, use wget:

# wget https://wiki.scc.kit.edu/gridkaschool/upload/e/e0/Tweet_sql_dump.sql


We create a new database for the dump file and import data by using a restore function:

# mysql -u root –p

> create database db_name;
> exit
# mysql -u root -p db_name < dump_file.sql


In the next step, we will flat all tweet data to facilitate the visualization. This can be seen as combining tables with JOIN:

SELECT tid as _id, tweet, hashtag.hashtag, lang, created_at 
FROM tweet 
LEFT JOIN hashtag_tweet ON tweet.id = hashtag_tweet.tweet_id 
LEFT JOIN hashtag ON hashtag.id = hashtag_tweet.hashtag_id


Because the format of date time in tweets can not be detected by Elasticsearch, we have to do the mapping manually:

 "tweet": {
  "dynamic": true,
  "properties": {
      "created_at":{
          "type": "date",
          "format": "EEE MMM dd HH:mm:ss Z yyyy"
      }
  }
 }
 

Finally, a _meta document used by river can be seen like this:

# curl -XPUT 'localhost:9200/_river/type_name/_meta' -d
  '{
    "type" : "jdbc",
    "jdbc" : {
          "url" : "jdbc:mysql://localhost:3306/db_name",
          "user" : "db_user",
          "password" : "db_user_passwd",
          "sql" : "select tid as _id, tweet, hashtag.hashtag, lang, created_at 
  		from tweet 
  		left join hashtag_tweet on tweet.id = hashtag_tweet.tweet_id 
  		left join hashtag on hashtag.id = hashtag_tweet.hashtag_id",
          "index": "tweet_stream",
          "type" : "tweet",
          "type_mapping": {
              "tweet": {
                  "dynamic": true,
                  "properties": {
                      "created_at":{
                          "type": "date",
                          "format": "EEE MMM dd HH:mm:ss Z yyyy"
                      }
                  }
              }
          }
      }
  }'
 

After previous step the data should be available in Elasticsearch and ready to use by Kibana. Now you can open Kibana and try to visualize something meaningful.

For example:

  • Number of tweets in total
  • Number of tweets by a language
  • Top hashtags
  • Top tweet languages
  • etc.

After the visualizations are saved, you can build a dashboard by using those visualizations.

Extra Exercise: Energy Consumption Data by EIA

If you have interesting in energy domain, this dump file can show you some annual consumption data (from 2000 - 2015) by usage sector and energy resource. The data are provided by Energy Information Administration (EIA)

For manual mapping, please use:

"type_name": {
  "dynamic": true,
  "properties": {
      "Created": {
             "type": "date",
             "format": "YYYYMMDD"
       },
       "Description": {
             "type": "string", 
             "index": "not_analyzed"
       }
  }
}