Difference between revisions of "Hadoop Workshop"

Revision as of 18:56, 18 August 2014

Content

Session 1 - Setup
- The Hadoop Ecosystem
  - Prepare the DEMO VM (we use a prepared virtual cluster in pseudo-distributed mode)
  - Working with HUE, the Hadoop Web-UI

Session 2 - The Kite-SDK, a Convinient Framework
- The KITE-SDK
  - Accessing datasets using the KITE-API
  - Metadata management
  - Kite-Modules

Session 3 - Real Time Indexing
- How to index large datasets with Cloudera Search?
  - Importing data with Flume
  - Real Time Indexing with Morphlines

Session 4 - Introduction to Apache Crunch
- Data processing pipelines for Spark, MapReduce, or simply a Workstation
  - The Crunch data model
  - Crunch data pipelines

Material

Slides:

 will be available after the workshop

Hand-Out:

 will be provided as a hardcopy and as PDF after the workshop

Important Information

For this workshop a personal notebook is necessary.
You will use VirtualBox to run the Workshop-VM.
Please download the VM here ... (TODO: ask Pavel for the download location)
The VM uses 3 GB RAM and 1 CPU. A better setup would be: 4 GB and 2 CPU.
The VM uses up to 70 GB space on your HDD but the initial size is around 4 GB.

Workshop Exercise

Efficient Data Management with Apache Hadoop
We will walk through the dataset life cycle. Starting with data ingestion and real time indexing we use several tools to conserve important datasets and to extract information using high level processing and query frameworks.

Tools

HUE
Apache Flume
Apache SOLR
Apache Hive & Cloudera Impala
Apache Crunch

@@ Line 36: / Line 36: @@
 * You will use [https://www.virtualbox.org/wiki/Downloads VirtualBox] to run the Workshop-VM.
 * Please download the VM here ... ('''TODO''': ''ask Pavel for the download location'')
-* The VM uses 3 GB RAM and 1 CPU. A better setup would be: 4 GB and 2 CPU.
+* The VM uses '''3 GB RAM''' and '''1 CPU'''. ''A better setup would be: 4 GB and 2 CPU''.
-* The VM uses up to 70 GB space on your HDD but the initial size is around 4 GB.
+* The VM uses up to 70 GB space on your HDD but the initial size is around '''4 GB'''.
-== Abstract ==
-In the last couple of years cloud  computing has achieved an important status in the IT scene.<br />
-The renting of computing power, storage and applications according to requirements  is regarded as future business.<br />
-This tutorial  course gives an introduction of the basic concepts of the Infrastructure-as-a-Service (IaaS) model<br />
-based on the cloud offerings provided by Amazon, one of the present leading commercial cloud  computing providers.
 == Workshop Exercise ==
-''Efficient Data Management with Apache Hadoop''
+''Efficient Data Management with Apache Hadoop''<br />
 We will walk through the dataset life cycle. Starting with data ingestion and real time indexing we use several tools to conserve
 important datasets and to extract information using high level processing and query frameworks.

Difference between revisions of "Hadoop Workshop"

Revision as of 18:56, 18 August 2014

Contents

Content

Material

Important Information

Workshop Exercise

Tools

Navigation menu

Views

Personal tools

Navigation

Search

Tools