A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A...

20
A Crash Course in Apache Hadoop

Transcript of A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A...

Page 1: A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A Crash Course in Apache Hadoop. Event Outline 1. What is Hadoop 2. Current data challenges

A Crash Course in Apache Hadoop

Page 2: A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A Crash Course in Apache Hadoop. Event Outline 1. What is Hadoop 2. Current data challenges

Event Outline1. What is Hadoop

2. Current data challenges

3. Hadoop Solutions

4. Architecture

5. Workshop

Page 3: A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A Crash Course in Apache Hadoop. Event Outline 1. What is Hadoop 2. Current data challenges

Who & When● Origin from Google papers● Originally developed at

Yahoo!○ Doug Cutting, Michael

Cafarella● Project officially began

around 2005.● Named after a toy elephant

Page 4: A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A Crash Course in Apache Hadoop. Event Outline 1. What is Hadoop 2. Current data challenges

Why

● More ways to collect data● Too much data

○ CERN Laboratory○ Google/Yahoo/Facebook

● Forget about processing when you can’t even store it

Page 5: A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A Crash Course in Apache Hadoop. Event Outline 1. What is Hadoop 2. Current data challenges

Analogy● Imagine you needed to transport 2,000,000kg of raw material● How would you do it? (Let’s assume that horsepower is proportional to the

mass that each vehicle can carry)

○ Ferrari 458 will run - $243,000 ~560 Horsepower○ Bugatti Veyron - $2,310,688 ~1200 horsepower○ Brand new Ford F-150 will cost $30,000 ~325 horsepower○ Dodge Caravan - $20,000 ~280 horsepower

Page 6: A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A Crash Course in Apache Hadoop. Event Outline 1. What is Hadoop 2. Current data challenges

What

● Provide a way to reliably access and process large volumes of data

● Designed to scale across many, many machines.

Page 7: A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A Crash Course in Apache Hadoop. Event Outline 1. What is Hadoop 2. Current data challenges

The ASF● Apache Open Source

○ OpenOffice○ HTTP Server○ Subversion○ Tomcat Webserver○ Commons○ Maven ○ Hadoop

● Anyone can view the source code!○ Build/edit/modify on your own

machine

Page 8: A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A Crash Course in Apache Hadoop. Event Outline 1. What is Hadoop 2. Current data challenges

Why data?

Opportunities and analytic insights for businesses

Page 9: A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A Crash Course in Apache Hadoop. Event Outline 1. What is Hadoop 2. Current data challenges

Why Hadoop?● Benefits of the Hadoop Architecture

○ Consolidates Data○ Integrates with many existing

platforms○ Scalable and Affordable○ Real-Time Insights

Page 10: A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A Crash Course in Apache Hadoop. Event Outline 1. What is Hadoop 2. Current data challenges

The Hadoop Ecosystem

Page 11: A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A Crash Course in Apache Hadoop. Event Outline 1. What is Hadoop 2. Current data challenges

Hadoop Architecture

Page 12: A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A Crash Course in Apache Hadoop. Event Outline 1. What is Hadoop 2. Current data challenges

Notable Hadoop Projects● Apache Kafka → Data Streaming● Apache HBase → Big Data Management● Apache Hive → Read and Query from HDFS● Apache ZooKeeper → HA management● Apache Spark → Processing Engine● Apache Ambari → Cluster management

Page 13: A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A Crash Course in Apache Hadoop. Event Outline 1. What is Hadoop 2. Current data challenges

At the Core of Hadoop● Hadoop Distributed File System (HDFS)

● Hadoop MapReduce (Processing Engine)

● Hadoop Common (Core Hadoop Libraries)

● Hadoop YARN (Yet Another Resource Manager)

○ CPU/Storage/Memory management (parallel jobs)

Page 14: A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A Crash Course in Apache Hadoop. Event Outline 1. What is Hadoop 2. Current data challenges

ResourceManager

YARN

Cluster Architecture

{ }Worker Node

NodeManagerDataNode

Ambari, Hive, Zeppelin, Knox, etc..

NameNode(HDFS)

ResourceManager(YARN)

NamenodeHDFS

Page 15: A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A Crash Course in Apache Hadoop. Event Outline 1. What is Hadoop 2. Current data challenges

HDFS Architecture● Fault-tolerant distributed storage

○ Split file into logical blocks○ Store multiple copies of each block

010001001111100101001001110000010101011101001010001001111100101001001110000010101011101001001001010111000

1

2

3

4

1

1

1

2

2

2

3 3

3

4

4

4

File

Cluster

File Blocks

Page 16: A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A Crash Course in Apache Hadoop. Event Outline 1. What is Hadoop 2. Current data challenges

HDFS - Namenode and Heartbeats● Namenode communicates through Heartbeats

○ Keep track of all the data notes○ Which data is stored and where

NameNode

DataNode 1 DataNode 3DataNode 2 DataNode 4

Hey NameNode, I’m Here!

Hey NameNode, I’m Here!

123 123

DataNode 2, can you replicate that 123 block to DataNode 3?

Page 17: A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A Crash Course in Apache Hadoop. Event Outline 1. What is Hadoop 2. Current data challenges

MapReduce in Hadoop

● Shuffle and Sort○ Break a problem into

sub-problems● Batch Processing

Page 18: A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A Crash Course in Apache Hadoop. Event Outline 1. What is Hadoop 2. Current data challenges

What do iOS 4 and Windows 3.1 have in common?

Page 19: A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A Crash Course in Apache Hadoop. Event Outline 1. What is Hadoop 2. Current data challenges

Multi-Use vs. Batch

Page 20: A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A Crash Course in Apache Hadoop. Event Outline 1. What is Hadoop 2. Current data challenges

Workshop