A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A...

A Crash Course in Apache Hadoop

Event Outline1. What is Hadoop

2. Current data challenges

3. Hadoop Solutions

4. Architecture

5. Workshop

Who & When● Origin from Google papers● Originally developed at

Yahoo!○ Doug Cutting, Michael

Cafarella● Project officially began

around 2005.● Named after a toy elephant

● More ways to collect data● Too much data

○ CERN Laboratory○ Google/Yahoo/Facebook

● Forget about processing when you can’t even store it

Analogy● Imagine you needed to transport 2,000,000kg of raw material● How would you do it? (Let’s assume that horsepower is proportional to the

mass that each vehicle can carry)

○ Ferrari 458 will run - $243,000 ~560 Horsepower○ Bugatti Veyron - $2,310,688 ~1200 horsepower○ Brand new Ford F-150 will cost $30,000 ~325 horsepower○ Dodge Caravan - $20,000 ~280 horsepower

● Provide a way to reliably access and process large volumes of data

● Designed to scale across many, many machines.

The ASF● Apache Open Source

○ OpenOffice○ HTTP Server○ Subversion○ Tomcat Webserver○ Commons○ Maven ○ Hadoop

● Anyone can view the source code!○ Build/edit/modify on your own

machine

Why data?

Opportunities and analytic insights for businesses

Why Hadoop?● Benefits of the Hadoop Architecture

○ Consolidates Data○ Integrates with many existing

platforms○ Scalable and Affordable○ Real-Time Insights

The Hadoop Ecosystem

Hadoop Architecture

Notable Hadoop Projects● Apache Kafka → Data Streaming● Apache HBase → Big Data Management● Apache Hive → Read and Query from HDFS● Apache ZooKeeper → HA management● Apache Spark → Processing Engine● Apache Ambari → Cluster management

At the Core of Hadoop● Hadoop Distributed File System (HDFS)

● Hadoop MapReduce (Processing Engine)

● Hadoop Common (Core Hadoop Libraries)

● Hadoop YARN (Yet Another Resource Manager)

○ CPU/Storage/Memory management (parallel jobs)

ResourceManager

Cluster Architecture

{ }Worker Node

NodeManagerDataNode

Ambari, Hive, Zeppelin, Knox, etc..

NameNode(HDFS)

ResourceManager(YARN)

NamenodeHDFS

HDFS Architecture● Fault-tolerant distributed storage

○ Split file into logical blocks○ Store multiple copies of each block

010001001111100101001001110000010101011101001010001001111100101001001110000010101011101001001001010111000

Cluster

File Blocks

HDFS - Namenode and Heartbeats● Namenode communicates through Heartbeats

○ Keep track of all the data notes○ Which data is stored and where

NameNode

DataNode 1 DataNode 3DataNode 2 DataNode 4

Hey NameNode, I’m Here!

123 123

DataNode 2, can you replicate that 123 block to DataNode 3?

MapReduce in Hadoop

● Shuffle and Sort○ Break a problem into

sub-problems● Batch Processing

What do iOS 4 and Windows 3.1 have in common?

Multi-Use vs. Batch

Workshop

A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A...

Documents

Transcript of A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A...

Apache NiFi Crash Course San Jose Hadoop Summit

Apache Hadoop Java API

BIG DATA: Apache Hadoop

20100130 hadoop apache

Hadoop Crash Course Hadoop Summit SJ

Apache Hadoop 2.0

Introduccion apache hadoop

Apache Hadoop Ecosystem - LIAS (Lab · Apache Hadoop Ecosystem ... Apache Drill, Cloudera Impala. Thank you for Your Attention Q & A Apache Hadoop Ecosystem ENSMA …

Apache Hadoop Releaseshadoop.apache.org/old/releases.pdf · Apache Hadoop 2.9.0 is the first release of Hadoop 2.9 line and will be the starting release for Apache Hadoop 2.9.x line

Apache Hadoop Developer Training

Introduction to apache hadoop

Python 3 + apache hadoop

Apache hadoop

Apache Spark & Hadoop

Introduction to Apache Hadoop & Pig - SALSAHPCsalsahpc.indiana.edu/CloudCom2010/slides/PDF/tutorials/Yahoo... · Hadoop & Pig Milind Bhandarkar ... (hadoop, pig) (apache, pig) (hadoop,

Apache Hadoop Developer Training.pdf

Apache Hadoop Tutorial

Apache hadoop technology : Beginners

Apache hadoop hbase

Apache Hadoop 3 Current Status Ajisaka - schd.wsschd.ws/hosted_files/apachebigdata2016/0d/Apache Hadoop 3 Current... · Apache Hadoop 3, Current Status Apache: ... n metrics2 sink