Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos

Hadoop DemystifiedWhat is it? How does Microsoft fit in?

and… of course… some demos!

Presentation for ATL .NET User Group (July, 2014)

Lester Martin

Agenda

• Hadoop 101

–Fundamentally, What is Hadoop?

–How is it Different?

–History of Hadoop

• Components of the Hadoop Ecosystem

• MapReduce, Pig, and Hive Demos

–Word Count

–Open Georgia Dataset Analysis

Connection before Content

• Lester Martin• Hortonworks – Professional Services• [email protected]• http://about.me/lestermartin (links to blog, github, twitter, LI, FB, etc)

© Hortonworks Inc. 2012

Scale-Out Processing

Scalable, Fault Tolerant, Open Source Data Storage and Processing

Page 7

MapReduce

What is Core Apache Hadoop?

Flexibility to Store and Mine Any Type of Data

Ask questions that were previously impossible to ask or solve

Not bound by a single, fixed schema

Excels atProcessing Complex Data

Scale-out architecture divides workloads across multiple nodes

Eliminates ETL bottlenecks

ScalesEconomically

Deployed on “commodity” hardware

Open source platform guards against vendor lock

Scale-Out Storage

HDFS

Scale-Out Resource Mgt

YARN

The Need for Hadoop

• Store and use all types of data• Process ALL the data; not just a sample• Scalability to 1000s of nodes• Commodity hardware

Relational Database vs. Hadoop

Relational HadoopRequired on write schema Required on Read

Reads are fast speed Writes are fast

Standards and structure governance Loosely structured

Limited, no data processing processing Processing coupled with data

Structured data types Multi and unstructured

Interactive OLAP AnalyticsComplex ACID Transactions

Operational Data Store

best fit use Data DiscoveryProcessing unstructured dataMassive storage/processing

Fundamentally, a Simple Algorithm

1. Review stack of quarters

2. Count each year that ends

in an even number

Processing at Scale

Distributed Algorithm – Map:Reduce

Map(total number of quarters)

Reduce(sum each person’s total)

A Brief History of Apache Hadoop

2013

Focus on INNOVATION2005: Hadoop created

at Yahoo!

Focus on OPERATIONS2008: Yahoo team extends focus to

operations to support multiple projects & growing clusters

Yahoo! begins to Operate at scale

EnterpriseHadoop

Apache Project Established

HortonworksData Platform

2004 2008 2010 20122006

STABILITY2011: Hortonworks created to focus on “Enterprise Hadoop“. Starts with

24 key Hadoop engineers from Yahoo

HDP / Hadoop Components

HDP: Enterprise Hadoop Platform

Hortonworks Data Platform (HDP)

• The ONLY 100% open source and complete platform

• Integrates full range of enterprise-ready services

• Certified and tested at scale

• Engineered for deep ecosystem interoperability

OS/VM Cloud Appliance

PLATFORM SERVICES

HADOOP CORE

Enterprise ReadinessHigh Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots

HORTONWORKS DATA PLATFORM (HDP)

OPERATIONAL SERVICES

DATASERVICES

HDFS

SQOOP

FLUME

NFS

LOAD & EXTRACT

WebHDFS

KNOX*

OOZIE

AMBARI

FALCON*

YARN

MAP TEZREDUCE

HIVE &HCATALOG

PIGHBASE

Typical Hadoop Cluster

Hive

• Data warehousing package built on top of Hadoop

• Bringing structure to unstructured data

• Query petabytes of data with HiveQL

• Schema on read

16

Hive: SQL-Like Interface to Hadoop

• Provides basic SQL functionality using MapReduce to execute queries

• Supports standard SQL clausesINSERT INTOSELECTFROM … JOIN … ONWHEREGROUP BYHAVINGORDER BYLIMIT

• Supports basic DDLCREATE/ALTER/DROP TABLE, DATABASE

Page 17

Hortonworks Investment in Apache Hive

Batch AND Interactive SQL-IN-Hadoop

Stinger InitiativeA broad, community-based effort to drive the next generation of HIVE

Page 18

Stinger Phase 3• Hive on Apache Tez• Query Service (always on)

• Buffer Cache• Cost Based Optimizer (Optiq)

Stinger Phase 1:• Base Optimizations• SQL Types• SQL Analytic Functions• ORCFile Modern File Format

Stinger Phase 2:

• SQL Types• SQL Analytic Functions• Advanced Optimizations• Performance Boosts via YARN

SpeedImprove Hive query performance by 100X to allow for interactive query times (seconds)

ScaleThe only SQL interface to Hadoop designed for queries that scale from TB to PB

Goals:

Delivered September 2013HIVE 0.12(HDP 2.0)

Delivered May 2013

HIVE 0.11(HDP 1.3)

DeliveredHive 0.13

…70% complete in 6 months…all IN Hadoop

SQLSupport broadest range of SQL semantics for analytic applications running against Hadoop

Stinger: Enhancing SQL Semantics

Page 19

Hive SQL Datatypes Hive SQL SemanticsINT SELECT, LOAD, INSERT from query

TINYINT/SMALLINT/BIGINT Expressions in WHERE and HAVING

BOOLEAN GROUP BY, ORDER BY, SORT BY

FLOAT Sub-queries in FROM clause

DOUBLE GROUP BY, ORDER BY

STRING CLUSTER BY, DISTRIBUTE BY

TIMESTAMP ROLLUP and CUBE

BINARY UNION

DECIMAL LEFT, RIGHT and FULL INNER/OUTER JOIN

ARRAY, MAP, STRUCT, UNION CROSS JOIN, LEFT SEMI JOIN

CHAR Windowing functions (OVER, RANK, etc.)

VARCHAR INTERSECT, EXCEPT, UNION DISTINCT

DATE Sub-queries in HAVING

Sub-queries in WHERE (IN/NOT IN, EXISTS/NOT EXISTS

Hive 0.10

Hive 12

Hive 0.11

Compete Subset

Hive 13

Pig

• Pig was created at Yahoo! to analyze data in HDFS without writing Map/Reduce code.

• Two components:– SQL like processing language called “Pig Latin”– PIG execution engine producing Map/Reduce code

• Popular uses:– ETL at scale (offloading)– Text parsing and processing to Hive or HBase– Aggregating data from multiple sources

Pig

Sample Code to find dropped call data:

4G_Data = LOAD ‘/archive/FDR_4G.txt’ using TextLoader();Customer_Master = LOAD ‘masterdb.customer_data’ using HCatLoader();4G_Data_Full = JOIN 4G_Data by customerID, CustomerMaster by customerID;X = FILTER 4G_Data_Full BY State == ‘call_dropped’;

Typical Data Analysis Workflow

Powering the Modern Data Architecture

HADOOP 2.0

Multi Use Data PlatformBatch, Interactive, Online, Streaming, …

Page 23

Interact with all data in multiple ways simultaneously

Redundant, Reliable StorageHDFS 2

Cluster Resource ManagementYARN

Standard SQL Processing

Hive

BatchMapReduce

InteractiveTez

Online Data Processing

HBase, Accumulo

Real Time Stream Processing

Storm

others…

HADOOP 1.0

HDFS 1(redundant, reliable storage)

MapReduce(distributed data processing

& cluster resource management)

Single Use SystemBatch Apps

Data Processing Frameworks

(Hive, Pig, Cascading, …)

Word Counting Time!!Hadoop’s “Hello Whirled” Example

A quick refresher of core elements of Hadoop and then code walk-thrus with Java MapReduce and Pig

Page 25

Core Hadoop Concepts

• Applications are written in high-level code–Developers need not worry about network programming, temporal

dependencies or low-level infrastructure

• Nodes talk to each other as little as possible–Developers should not write code which communicates between

nodes– “Shared nothing” architecture

• Data is spread among machines in advance–Computation happens where the data is stored, wherever possible

– Data is replicated multiple times on the system for increased availability and reliability

Page 26

Hadoop: Very High-Level Overview

• When data is loaded in the system, it is split into “blocks”–Typically 64MB or 128MB

• Map tasks (first part of MapReduce) work on relatively small portions of data–Typically a single block

• A master program allocates work to nodes such that a Map tasks will work on a block of data stored locally on that node whenever possible–Many nodes work in parallel, each on their own part of the overall

dataset

Page 27

Fault Tolerance

• If a node fails, the master will detect that failure and re-assign the work to a different node on the system

• Restarting a task does not require communication with nodes working on other portions of the data

• If a failed node restarts, it is automatically added back to the system and assigned new tasks

• If a nodes appears to be running slowly, the master can redundantly execute another instance of the same task–Results from the first to finish will be used–Known as “speculative execution”

Page 28

Hadoop Components

• Hadoop consists of two core components–The Hadoop Distributed File System (HDFS)–MapReduce

• Many other projects based around core Hadoop (the “Ecosystem”)–Pig, Hive, Hbase, Flume, Oozie, Sqoop, Datameer, etc

• A set of machines running HDFS and MapReduce is known as a Hadoop Cluster– Individual machines are known as nodes–A cluster can have as few as one node, as many as several

thousand– More nodes = better performance!

Page 29

Hadoop Components: HDFS

• HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster

• Data is split into blocks and distributed across multiple nodes in the cluster–Each block is typically 64MB (the default) or 128MB in size

• Each block is replicated multiple times–Default is to replicate each block three times–Replicas are stored on different nodes

– This ensures both reliability and availability

Page 30

HDFS Replicated Blocks Visualized

Page 31

HDFS *is* a File System

• Screenshot for “Name Node UI”

Page 32

Accessing HDFS

• Applications can read and write HDFS files directly via a Java API

• Typically, files are created on a local filesystem and must be moved into HDFS

• Likewise, files stored in HDFS may need to be moved to a machine’s local filesystem

• Access to HDFS from the command line is achieved with the hdfs dfs command–Provides various shell-like commands as you find on Linux–Replaces the hadoop fs command

• Graphical tools available like the Sandbox’s Hue File Browser and Red Gate’s HDFS Explorer

Page 33

hdfs dfs Examples

• Copy file foo.txt from local disk to the user’s directory in HDFS

–This will copy the file to /user/username/fooHDFS.txt

• Get a directory listing of the user’s home directory in HDFS

• Get a directory listing of the HDFS root directory

Page 34

hdfs dfs –put fooLocal.txt fooHDFS.txt

hdfs dfs –ls

hdfs dfs –ls /

hdfs dfs Examples (continued)

• Display the contents of a specific HDFS file

• Move that file back to the local disk

• Create a directory called input under the user’s home directory

• Delete the HDFS directory input and all its contents

Page 35

hdfs dfs –cat /user/fred/fooHDFS.txt

hdfs dfs –mkdir input

hdfs dfs –rm –r input

hdfs dfs –get /user/fred/fooHDFS.txt barLocal.txt

Hadoop Components: MapReduce

• MapReduce is the system used to process data in the Hadoop cluster

• Consists of two phases: Map, and then Reduce–Between the two is a stage known as the shuffle and sort

• Each Map task operates on a discrete portion of the overall dataset–Typically one HDFS block of data

• After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase–Source code examples and live demo coming!

Page 36

Features of MapReduce

• Hadoop attempts to run tasks on nodes which hold their portion of the data locally, to avoid network traffic

• Automatic parallelization, distribution, and fault-tolerance

• Status and monitoring tools• A clean abstraction for programmers

–MapReduce programs are usually written in Java– Can be written in any language using Hadoop Streaming– All of Hadoop is written in Java

–With “housekeeping” taken care of by the framework, developers can concentrate simply on writing Map and Reduce functions

Page 37

MapReduce Visualized

Page 38

Detailed Administrative Console

• Screenshot from “Job Tracker UI”

Page 39

MapReduce: The Mapper

• The Mapper reads data in the form of key/value pairs (KVPs)

• It outputs zero or more KVPs• The Mapper may use or completely ignore the input key–For example, a standard pattern is to read a line of a file at a time

– The key is the byte offset into the file at which the line starts– The value is the contents of the line itself– Typically the key is considered irrelevant with this pattern

• If the Mapper writes anything out, it must in the form of KVPs–This “intermediate data” is NOT stored in HDFS (local storage

only without replication)

Page 40

MapReducer: The Reducer

• After the Map phase is over, all the intermediate values for a given intermediate key are combined together into a list

• This list is given to a Reducer–There may be a single Reducer, or multiple Reducers–All values associated with a particular intermediate key are

guaranteed to go to the same Reducer–The intermediate keys, and their value lists, are passed in sorted

order

• The Reducer outputs zero or more KVPs–These are written to HDFS– In practice, the Reducer often emits a single KVP for each input

key

Page 41

MapReduce Example: Word Count

• Count the number of occurrences of each word in a large amount of input data

Page 42

map(String input_key, String input_value)foreach word in input_value:

emit(w,1)

reduce(String output_key, Iter<int> intermediate_vals)set count = 0foreach val in intermediate_vals:

count += valemit(output_key, count)

MapReduce Example: Map Phase

Page 43

• Input to the Mapper

• Ignoring the key– It is just an offset

• Output from the Mapper

• No attempt is made to optimize within a record in this example– This is a great use case for a

“Combiner”

(8675, ‘I will not eat green eggs and ham’)

(8709, ‘I will not eat them Sam I am’)

(‘I’, 1), (‘will’, 1), (‘not’, 1), (‘eat’, 1), (‘green’, 1), (‘eggs’, 1), (‘and’, 1), (‘ham’, 1), (‘I’, 1), (‘will’, 1), (‘not’, 1), (‘eat’, 1), (‘them’, 1), (‘Sam’, 1),(‘I’, 1), (‘am’, 1)

MapReduce Example: Reduce Phase

Page 44

• Input to the Reducer

• Notice keys are sorted and associated values for same key are in a single list– Shuffle & Sort did this for us

• Output from the Reducer

• All done!

(‘I’, [1, 1, 1])(‘Sam’, [1])(‘am’, [1])(‘and’, [1])(‘eat’, [1, 1])(‘eggs’, [1])(‘green’, [1])(‘ham’, [1])(‘not’, [1, 1])(‘them’, [1])(‘will’, [1, 1])

(‘I’, 3)(‘Sam’, 1)(‘am’, 1)(‘and’, 1)(‘eat’, 2)(‘eggs’, 1)(‘green’, 1)(‘ham’, 1)(‘not’, 2)(‘them’, 1)(‘will’, 2)

Code Walkthru & Demo Time!!

• Word Count Example– Java MapReduce–Pig

Page 45

Additional DemonstrationsA Real-World Analysis Example

Compare/contrast solving the same problem with Java MapReduce, Pig, and Hive

Page 46

Dataset: Open Georgia

• Salaries & Travel Reimbursements–Organization

– Local Boards of Education– Several Atlanta-area districts; multiple years

– State Agencies, Boards, Authorities and Commissions– Dept of Public Safety; 2010

Page 47

Format & Sample Data

Page 48

NAME (String) TITLE (String)SALARY (float)

ORG TYPE (String)

ORG (String) YEAR (int)

ABBOTT,DEEDEE W GRADES 9-12 TEACHER 52,122.10 LBOEATLANTA INDEPENDENT SCHOOL SYSTEM

2010

ALLEN,ANNETTE DSPEECH-LANGUAGE PATHOLOGIST

92,937.28 LBOEATLANTA INDEPENDENT SCHOOL SYSTEM

2010

BAHR,SHERREEN T GRADE 5 TEACHER 52,752.71 LBOECOBB COUNTY SCHOOL DISTRICT

2010

BAILEY,ANTOINETTE R

SCHOOL SECRETARY/CLERK

19,905.90 LBOECOBB COUNTY SCHOOL DISTRICT

2010

BAILEY,ASHLEY NEARLY INTERVENTION PRIMARY TEACHER

43,992.82 LBOECOBB COUNTY SCHOOL DISTRICT

2010

CALVERT,RONALD MARTIN

STATE PATROL (SP) 51,370.40 SABACPUBLIC SAFETY, DEPARTMENT OF

2010

CAMERON,MICHAEL D

PUBLIC SAFETY TRN (AL)

34,748.60 SABACPUBLIC SAFETY, DEPARTMENT OF

2010

DAAS,TARWYN TARA

GRADES 9-12 TEACHER 41,614.50 LBOEFULTON COUNTY BOARD OF EDUCATION

2011

DABBS,SANDRA L GRADES 9-12 TEACHER 79,801.59 LBOEFULTON COUNTY BOARD OF EDUCATION

2011

E'LOM,SOPHIA LIS PERSONNEL - GENERAL ADMIN

75,509.00 LBOEFULTON COUNTY BOARD OF EDUCATION

2012

EADDY,FENNER R SUBSTITUTE 13,469.00 LBOEFULTON COUNTY BOARD OF EDUCATION

2012

EADY,ARNETTA A ASSISTANT PRINCIPAL 71,879.00 LBOEFULTON COUNTY BOARD OF EDUCATION

2012

Simple Use Case

• For all loaded State of Georgia salary information–Produce statistics for each specific job title

– Number of employees– Salary breakdown

– Minimum– Maximum– Average

–Limit the data to investigate– Fiscal year 2010– School district employees

Page 49

Code Walkthru & Demo; Part Deux!

• Word Count Example– Java MapReduce–Pig–Hive

Page 50

Demo Wrap-Up

• All code, test data, wiki pages, and blog posting can be found, or linked to, from –https://github.com/lestermartin/hadoop-exploration

• This deck can be found on SlideShare–http://www.slideshare.net/lestermartin

• Questions?

Page 51

Thank You!!

• Lester Martin• Hortonworks – Professional Services• [email protected]• http://about.me/lestermartin (links to blog, github, twitter, LI, FB, etc)

Page 52

Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos

Data & Analytics

Transcript of Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos