Difference between revisions of "Hadoop Hands-on"
From Gridkaschool
(→Books) |
|||
(16 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | Tuesday, 28.8.2012, 13:00 - 18:30 |
||
+ | =Objectives= |
||
+ | The focus of this session is on the hadoop ecosystem and the interplay of many specialized tools for data analysis. |
||
− | Hadoop hands on |
||
+ | We look into the Java API as well, but not in so much detail as in a pure developer class. We will try to show a big picture |
||
− | 28.8.2012 – 13:30 |
||
+ | of hadoop in the context of scientific computing. You will learn, what hadoop can be used for, and what it is not intended to be |
||
− | Session 1 |
||
+ | applied to. Therefore we will discuss the underlying principles as well as the programming model and installation / configuration |
||
− | The hadoop ecosystem: HDFS, MR, HUE, Sqoop, Hive, Pig, HBase, Flume, Oozie |
||
+ | procedures. You will test some of the commands on a real cluster and some life demos give you an idea of lots of features provided |
||
− | What is CDH and the Cloudera-Manager? |
||
+ | by the web based user interface. |
||
− | Installation, starting and basic configurations of a small cluster |
||
+ | =Prerequisites= |
||
− | Session 2 |
||
+ | * Basic understanding of Unix/Linux OS management is needed to do the exercises. |
||
+ | * No prior knowledge of Hadoop is required, as we go through the basic concepts. |
||
+ | * For this workshop a personal notebook is recommendet. |
||
+ | * If you use Windows: please install "PuTTY" and the VMWare-Player. |
||
+ | =Recommendet Material= |
||
− | HDFS intro (Name Node, Data Node, Secondary Name Node) |
||
+ | ==Books== |
||
− | How is data stored in HDFS? |
||
+ | * Hadoop the Defenitive Guide [http://www.amazon.de/Hadoop-Definitive-Guide-Tom-White/dp/1449311520/ref=sr_1_fkmr1_1?ie=UTF8&qid=1345918087&sr=8-1-fkmr1] |
||
+ | * Hadoop in Action [http://www.amazon.de/Hadoop-Action-Chuck-Lam/dp/1935182196/ref=sr_1_1?s=books-intl-de&ie=UTF8&qid=1345918219&sr=1-1] |
||
+ | * Data Intensive Text Processing with MapReduce [http://www.amazon.de/Data-Intensive-Processing-Mapreduce-Author-Paperback/dp/B006V38ZCK/ref=sr_1_2?ie=UTF8&qid=1345918261&sr=8-2] |
||
+ | ==Scripts from last year== |
||
− | Properties and configurations, relevant for efficient working with HDFS. |
||
+ | * Introduction [http://gridka-school.scc.kit.edu/2011/downloads/Hadoop_tutorial-1-Introduction.pdf] |
||
− | HDFS commands |
||
+ | * MapReduce [http://gridka-school.scc.kit.edu/2011/downloads/Hadoop_tutorial-2_4-MapReduce.pdf] |
||
+ | * Pig [http://gridka-school.scc.kit.edu/2011/downloads/Hadoop_tutorial-5-Pig.pdf] |
||
+ | * Hand-out [http://gridka-school.scc.kit.edu/2011/downloads/Hadoop_tutorial-Hand_outs.pdf] |
||
− | Session 3 |
||
− | Working with the webbased-GUI |
||
+ | =Content= |
||
− | Running and tracking jobs |
||
+ | ==Session A== |
||
+ | * The hadoop ecosystem: HDFS, MR, HUE, Sqoop, Hive, Pig, HBase, Flume, Oozie |
||
+ | * What is CDH and the Cloudera-Manager? |
||
+ | * Installation, starting and basic configurations of a small cluster |
||
+ | ==Session B== |
||
− | Java-API and samples |
||
+ | * HDFS intro (Name Node, Data Node, Secondary Name Node) |
||
+ | * How is data stored in HDFS? |
||
+ | * Properties and configurations, relevant for efficient working with HDFS. |
||
+ | * HDFS commands |
||
+ | ==Session C== |
||
− | Streaming API sample |
||
+ | * Working with the webbased-GUI |
||
+ | * Running and tracking jobs |
||
+ | * Java-API and samples |
||
+ | * Streaming API sample |
||
− | Session |
+ | ==Session D== |
+ | * Map Reduce details, Java-API and Streaming |
||
+ | * HDFS details, using the webbased-GUI for deeper insights |
||
+ | * Breaking down a cluster and heal it |
||
+ | ==Session E== |
||
− | Map Reduce details, Java-API and Streaming (awk sample) |
||
+ | * Intro to Hive and Sqoop |
||
+ | * Dataimport via Sqoop |
||
+ | * Hive scripts |
||
+ | ==Session F (optional)== |
||
− | HDFS details, using the webbased-GUI for deeper insights |
||
+ | * Serialisation and deserialisation (SerDe) and user defined functions (UDF) with Hive |
||
− | |||
+ | * Workflows with oozie |
||
− | Breaking down a cluster and heal it |
||
− | |||
− | Session 5 |
||
− | |||
− | Intro to Hive and Sqoop |
||
− | |||
− | Dataimport via Sqoop |
||
− | |||
− | Hive scripts |
||
− | |||
− | Session 6 (optional) |
||
− | |||
− | SerDe and UDF with Hive |
||
− | |||
− | Workflows with oozie |
Latest revision as of 21:05, 26 August 2012
Tuesday, 28.8.2012, 13:00 - 18:30
Contents
Objectives
The focus of this session is on the hadoop ecosystem and the interplay of many specialized tools for data analysis.
We look into the Java API as well, but not in so much detail as in a pure developer class. We will try to show a big picture
of hadoop in the context of scientific computing. You will learn, what hadoop can be used for, and what it is not intended to be
applied to. Therefore we will discuss the underlying principles as well as the programming model and installation / configuration
procedures. You will test some of the commands on a real cluster and some life demos give you an idea of lots of features provided
by the web based user interface.
Prerequisites
- Basic understanding of Unix/Linux OS management is needed to do the exercises.
- No prior knowledge of Hadoop is required, as we go through the basic concepts.
- For this workshop a personal notebook is recommendet.
- If you use Windows: please install "PuTTY" and the VMWare-Player.
Recommendet Material
Books
- Hadoop the Defenitive Guide [1]
- Hadoop in Action [2]
- Data Intensive Text Processing with MapReduce [3]
Scripts from last year
Content
Session A
- The hadoop ecosystem: HDFS, MR, HUE, Sqoop, Hive, Pig, HBase, Flume, Oozie
- What is CDH and the Cloudera-Manager?
- Installation, starting and basic configurations of a small cluster
Session B
- HDFS intro (Name Node, Data Node, Secondary Name Node)
- How is data stored in HDFS?
- Properties and configurations, relevant for efficient working with HDFS.
- HDFS commands
Session C
- Working with the webbased-GUI
- Running and tracking jobs
- Java-API and samples
- Streaming API sample
Session D
- Map Reduce details, Java-API and Streaming
- HDFS details, using the webbased-GUI for deeper insights
- Breaking down a cluster and heal it
Session E
- Intro to Hive and Sqoop
- Dataimport via Sqoop
- Hive scripts
Session F (optional)
- Serialisation and deserialisation (SerDe) and user defined functions (UDF) with Hive
- Workflows with oozie