Hadoop Workshop

From Gridkaschool
Jump to: navigation, search

Contents

Content

  • Session 1 - Setup
    • The Hadoop Ecosystem
      • Prepare the DEMO VM (we use a prepared virtual cluster in pseudo-distributed mode)
      • Working with HUE, the Hadoop Web-UI
  • Session 2 - The Kite-SDK, a Convinient Framework for Hadoop Developers
    • The KITE-SDK
      • Accessing datasets using the KITE-API
      • Metadata management
      • Kite-Modules
  • Session 3 - Real Time Indexing
    • How to index large datasets with Cloudera Search?
      • Importing data with Flume
      • Real Time Indexing with Morphlines
  • Session 4 - Introduction to Apache Crunch
    • Data processing pipelines for Spark, MapReduce, or simply a Workstation
      • The Crunch data model
      • Crunch data pipelines

Material

Slides:

 will be available after the workshop

Hand-Out:

 will be provided as a hardcopy and as PDF after the workshop


Workshop Exercise

Efficient Data Management with Apache Hadoop
We will walk through the dataset life cycle. Starting with data ingestion and real time indexing we use several tools to conserve important datasets and to extract information using high level processing and query frameworks.

Tools

  • HUE
  • Apache Flume
  • Apache SOLR
  • Apache Hive & Cloudera Impala
  • Apache Crunch
 Everything you need is prepared in our Workshop-VM.

Important Information

Personal tools