Difference between revisions of "Hadoop Workshop"
From Gridkaschool
(Created page with "== Content == * Session 1 - Setup ** The Hadoop Ecosystem *** Prepare the DEMO VM *** HUE, the Hadoop Web-UI * Session 2 - The Kite-SDK, a Convinient Framework ** The KITE-SDK …") |
(→Important Information) |
||
(11 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
== Content == |
== Content == |
||
− | * Session 1 - Setup |
+ | * ''Session 1'' - '''Setup''' |
** The Hadoop Ecosystem |
** The Hadoop Ecosystem |
||
− | *** Prepare the DEMO VM |
+ | *** Prepare the DEMO VM (we use a prepared virtual cluster in pseudo-distributed mode) |
− | *** HUE, the Hadoop Web-UI |
+ | *** Working with HUE, the Hadoop Web-UI |
− | * Session 2 - The Kite-SDK, a Convinient Framework |
+ | * ''Session 2'' - '''The Kite-SDK, a Convinient Framework for Hadoop Developers''' |
** The KITE-SDK |
** The KITE-SDK |
||
− | *** Accessing |
+ | *** Accessing datasets using the KITE-API |
*** Metadata management |
*** Metadata management |
||
*** Kite-Modules |
*** Kite-Modules |
||
− | + | * ''Session 3'' - '''Real Time Indexing''' |
|
+ | ** How to index large datasets with Cloudera Search? |
||
*** Importing data with Flume |
*** Importing data with Flume |
||
− | *** |
+ | *** Real Time Indexing with Morphlines |
− | + | * ''Session 4'' - '''Introduction to Apache Crunch''' |
|
+ | ** Data processing pipelines for Spark, MapReduce, or simply a Workstation |
||
− | *** The Crunch Datamodel |
||
+ | *** The Crunch data model |
||
*** Crunch data pipelines |
*** Crunch data pipelines |
||
− | |||
== Material == |
== Material == |
||
Line 29: | Line 30: | ||
will be provided as a hardcopy and as PDF after the workshop |
will be provided as a hardcopy and as PDF after the workshop |
||
− | |||
− | == Important Information == |
||
− | |||
− | * For this workshop a personal notebook is necessary. You will use [VirtualBox] to run the WorkshopVW |
||
− | * If you use Windows, please prepare the program "PuTTY" for this workshop: |
||
− | http://the.earth.li/~sgtatham/putty/latest/x86/putty.exe |
||
− | |||
− | == Abstract == |
||
− | |||
− | In the last couple of years cloud computing has achieved an important status in the IT scene.<br /> |
||
− | The renting of computing power, storage and applications according to requirements is regarded as future business.<br /> |
||
− | This tutorial course gives an introduction of the basic concepts of the Infrastructure-as-a-Service (IaaS) model<br /> |
||
− | based on the cloud offerings provided by Amazon, one of the present leading commercial cloud computing providers. |
||
== Workshop Exercise == |
== Workshop Exercise == |
||
− | ''Efficient Data Management with Apache Hadoop'' |
+ | ''Efficient Data Management with Apache Hadoop''<br /> |
We will walk through the dataset life cycle. Starting with data ingestion and real time indexing we use several tools to conserve |
We will walk through the dataset life cycle. Starting with data ingestion and real time indexing we use several tools to conserve |
||
important datasets and to extract information using high level processing and query frameworks. |
important datasets and to extract information using high level processing and query frameworks. |
||
Line 51: | Line 39: | ||
===Tools=== |
===Tools=== |
||
* HUE |
* HUE |
||
− | * Flume |
+ | * Apache Flume |
− | * SOLR |
+ | * Apache SOLR |
− | + | * Apache Hive & Cloudera Impala |
|
− | + | * Apache Crunch |
|
+ | |||
+ | Everything you need is prepared in our Workshop-VM. |
||
+ | |||
+ | === Important Information === |
||
+ | |||
+ | * For this workshop a personal notebook is necessary. |
||
+ | * You will use [https://www.virtualbox.org/wiki/Downloads VirtualBox] to run the Workshop-VM. |
||
+ | * Please download the VM here https://drive.google.com/folderview?id=0B2I4C6eKUshDd3BCRGhpbUVzMWs&usp=sharing |
||
+ | * The VM uses '''3 GB RAM''' and '''1 CPU'''. ''A better setup would be: 4 GB and 2 CPU''. |
||
+ | * The VM uses up to 70 GB space on your HDD but the initial size is around '''4 GB'''. |
Latest revision as of 14:17, 22 August 2014
Content
- Session 1 - Setup
- The Hadoop Ecosystem
- Prepare the DEMO VM (we use a prepared virtual cluster in pseudo-distributed mode)
- Working with HUE, the Hadoop Web-UI
- The Hadoop Ecosystem
- Session 2 - The Kite-SDK, a Convinient Framework for Hadoop Developers
- The KITE-SDK
- Accessing datasets using the KITE-API
- Metadata management
- Kite-Modules
- The KITE-SDK
- Session 3 - Real Time Indexing
- How to index large datasets with Cloudera Search?
- Importing data with Flume
- Real Time Indexing with Morphlines
- How to index large datasets with Cloudera Search?
- Session 4 - Introduction to Apache Crunch
- Data processing pipelines for Spark, MapReduce, or simply a Workstation
- The Crunch data model
- Crunch data pipelines
- Data processing pipelines for Spark, MapReduce, or simply a Workstation
Material
Slides:
will be available after the workshop
Hand-Out:
will be provided as a hardcopy and as PDF after the workshop
Workshop Exercise
Efficient Data Management with Apache Hadoop
We will walk through the dataset life cycle. Starting with data ingestion and real time indexing we use several tools to conserve
important datasets and to extract information using high level processing and query frameworks.
Tools
- HUE
- Apache Flume
- Apache SOLR
- Apache Hive & Cloudera Impala
- Apache Crunch
Everything you need is prepared in our Workshop-VM.
Important Information
- For this workshop a personal notebook is necessary.
- You will use VirtualBox to run the Workshop-VM.
- Please download the VM here https://drive.google.com/folderview?id=0B2I4C6eKUshDd3BCRGhpbUVzMWs&usp=sharing
- The VM uses 3 GB RAM and 1 CPU. A better setup would be: 4 GB and 2 CPU.
- The VM uses up to 70 GB space on your HDD but the initial size is around 4 GB.