Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
-
Upload
lester-martin -
Category
Data & Analytics
-
view
743 -
download
3
description
Transcript of Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop DemystifiedWhat is it? How does Microsoft fit in?
and… of course… some demos!
Presentation for ATL .NET User Group (July, 2014)
Lester Martin
Page 1
Agenda
• Hadoop 101
–Fundamentally, What is Hadoop?
–How is it Different?
–History of Hadoop
• Components of the Hadoop Ecosystem
• MapReduce, Pig, and Hive Demos
–Word Count
–Open Georgia Dataset Analysis
Page 2
Connection before Content
• Lester Martin• Hortonworks – Professional Services• [email protected]• http://about.me/lestermartin (links to blog, github, twitter, LI, FB, etc)
Page 3
© Hortonworks Inc. 2012
Scale-Out Processing
Scalable, Fault Tolerant, Open Source Data Storage and Processing
Page 7
MapReduce
What is Core Apache Hadoop?
Flexibility to Store and Mine Any Type of Data
Ask questions that were previously impossible to ask or solve
Not bound by a single, fixed schema
Excels atProcessing Complex Data
Scale-out architecture divides workloads across multiple nodes
Eliminates ETL bottlenecks
ScalesEconomically
Deployed on “commodity” hardware
Open source platform guards against vendor lock
Scale-Out Storage
HDFS
Scale-Out Resource Mgt
YARN
The Need for Hadoop
• Store and use all types of data• Process ALL the data; not just a sample• Scalability to 1000s of nodes• Commodity hardware
Page 5
Relational Database vs. Hadoop
Relational HadoopRequired on write schema Required on Read
Reads are fast speed Writes are fast
Standards and structure governance Loosely structured
Limited, no data processing processing Processing coupled with data
Structured data types Multi and unstructured
Interactive OLAP AnalyticsComplex ACID Transactions
Operational Data Store
best fit use Data DiscoveryProcessing unstructured dataMassive storage/processing
Page 6
Fundamentally, a Simple Algorithm
1. Review stack of quarters
2. Count each year that ends
in an even number
Page 7
Processing at Scale
Page 8
Distributed Algorithm – Map:Reduce
Page 9
Map(total number of quarters)
Reduce(sum each person’s total)
A Brief History of Apache Hadoop
Page 10
2013
Focus on INNOVATION2005: Hadoop created
at Yahoo!
Focus on OPERATIONS2008: Yahoo team extends focus to
operations to support multiple projects & growing clusters
Yahoo! begins to Operate at scale
EnterpriseHadoop
Apache Project Established
HortonworksData Platform
2004 2008 2010 20122006
STABILITY2011: Hortonworks created to focus on “Enterprise Hadoop“. Starts with
24 key Hadoop engineers from Yahoo
HDP / Hadoop Components
Page 11
HDP: Enterprise Hadoop Platform
Page 12
Hortonworks Data Platform (HDP)
• The ONLY 100% open source and complete platform
• Integrates full range of enterprise-ready services
• Certified and tested at scale
• Engineered for deep ecosystem interoperability
OS/VM Cloud Appliance
PLATFORM SERVICES
HADOOP CORE
Enterprise ReadinessHigh Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots
HORTONWORKS DATA PLATFORM (HDP)
OPERATIONAL SERVICES
DATASERVICES
HDFS
SQOOP
FLUME
NFS
LOAD & EXTRACT
WebHDFS
KNOX*
OOZIE
AMBARI
FALCON*
YARN
MAP TEZREDUCE
HIVE &HCATALOG
PIGHBASE
Typical Hadoop Cluster
Page 13
HDFS - Writing Files
Rack1 Rack2 Rack3 RackN
request write
Hadoop Client
return DNs, etc.
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
write blocksblock report
fs sync Backup NNper NN
checkpoint
Name Node
Hive
• Data warehousing package built on top of Hadoop
• Bringing structure to unstructured data
• Query petabytes of data with HiveQL
• Schema on read
16
Hive: SQL-Like Interface to Hadoop
• Provides basic SQL functionality using MapReduce to execute queries
• Supports standard SQL clausesINSERT INTOSELECTFROM … JOIN … ONWHEREGROUP BYHAVINGORDER BYLIMIT
• Supports basic DDLCREATE/ALTER/DROP TABLE, DATABASE
Page 17
Hortonworks Investment in Apache Hive
Batch AND Interactive SQL-IN-Hadoop
Stinger InitiativeA broad, community-based effort to drive the next generation of HIVE
Page 18
Stinger Phase 3• Hive on Apache Tez• Query Service (always on)
• Buffer Cache• Cost Based Optimizer (Optiq)
Stinger Phase 1:• Base Optimizations• SQL Types• SQL Analytic Functions• ORCFile Modern File Format
Stinger Phase 2:
• SQL Types• SQL Analytic Functions• Advanced Optimizations• Performance Boosts via YARN
SpeedImprove Hive query performance by 100X to allow for interactive query times (seconds)
ScaleThe only SQL interface to Hadoop designed for queries that scale from TB to PB
Goals:
Delivered September 2013HIVE 0.12(HDP 2.0)
Delivered May 2013
HIVE 0.11(HDP 1.3)
DeliveredHive 0.13
…70% complete in 6 months…all IN Hadoop
SQLSupport broadest range of SQL semantics for analytic applications running against Hadoop
Stinger: Enhancing SQL Semantics
Page 19
Hive SQL Datatypes Hive SQL SemanticsINT SELECT, LOAD, INSERT from query
TINYINT/SMALLINT/BIGINT Expressions in WHERE and HAVING
BOOLEAN GROUP BY, ORDER BY, SORT BY
FLOAT Sub-queries in FROM clause
DOUBLE GROUP BY, ORDER BY
STRING CLUSTER BY, DISTRIBUTE BY
TIMESTAMP ROLLUP and CUBE
BINARY UNION
DECIMAL LEFT, RIGHT and FULL INNER/OUTER JOIN
ARRAY, MAP, STRUCT, UNION CROSS JOIN, LEFT SEMI JOIN
CHAR Windowing functions (OVER, RANK, etc.)
VARCHAR INTERSECT, EXCEPT, UNION DISTINCT
DATE Sub-queries in HAVING
Sub-queries in WHERE (IN/NOT IN, EXISTS/NOT EXISTS
Hive 0.10
Hive 12
Hive 0.11
Compete Subset
Hive 13
Pig
• Pig was created at Yahoo! to analyze data in HDFS without writing Map/Reduce code.
• Two components:– SQL like processing language called “Pig Latin”– PIG execution engine producing Map/Reduce code
• Popular uses:– ETL at scale (offloading)– Text parsing and processing to Hive or HBase– Aggregating data from multiple sources
Pig
Sample Code to find dropped call data:
4G_Data = LOAD ‘/archive/FDR_4G.txt’ using TextLoader();Customer_Master = LOAD ‘masterdb.customer_data’ using HCatLoader();4G_Data_Full = JOIN 4G_Data by customerID, CustomerMaster by customerID;X = FILTER 4G_Data_Full BY State == ‘call_dropped’;
Typical Data Analysis Workflow
Powering the Modern Data Architecture
HADOOP 2.0
Multi Use Data PlatformBatch, Interactive, Online, Streaming, …
Page 23
Interact with all data in multiple ways simultaneously
Redundant, Reliable StorageHDFS 2
Cluster Resource ManagementYARN
Standard SQL Processing
Hive
BatchMapReduce
InteractiveTez
Online Data Processing
HBase, Accumulo
Real Time Stream Processing
Storm
others…
HADOOP 1.0
HDFS 1(redundant, reliable storage)
MapReduce(distributed data processing
& cluster resource management)
Single Use SystemBatch Apps
Data Processing Frameworks
(Hive, Pig, Cascading, …)
Word Counting Time!!Hadoop’s “Hello Whirled” Example
A quick refresher of core elements of Hadoop and then code walk-thrus with Java MapReduce and Pig
Page 25
Core Hadoop Concepts
• Applications are written in high-level code–Developers need not worry about network programming, temporal
dependencies or low-level infrastructure
• Nodes talk to each other as little as possible–Developers should not write code which communicates between
nodes– “Shared nothing” architecture
• Data is spread among machines in advance–Computation happens where the data is stored, wherever possible
– Data is replicated multiple times on the system for increased availability and reliability
Page 26
Hadoop: Very High-Level Overview
• When data is loaded in the system, it is split into “blocks”–Typically 64MB or 128MB
• Map tasks (first part of MapReduce) work on relatively small portions of data–Typically a single block
• A master program allocates work to nodes such that a Map tasks will work on a block of data stored locally on that node whenever possible–Many nodes work in parallel, each on their own part of the overall
dataset
Page 27
Fault Tolerance
• If a node fails, the master will detect that failure and re-assign the work to a different node on the system
• Restarting a task does not require communication with nodes working on other portions of the data
• If a failed node restarts, it is automatically added back to the system and assigned new tasks
• If a nodes appears to be running slowly, the master can redundantly execute another instance of the same task–Results from the first to finish will be used–Known as “speculative execution”
Page 28
Hadoop Components
• Hadoop consists of two core components–The Hadoop Distributed File System (HDFS)–MapReduce
• Many other projects based around core Hadoop (the “Ecosystem”)–Pig, Hive, Hbase, Flume, Oozie, Sqoop, Datameer, etc
• A set of machines running HDFS and MapReduce is known as a Hadoop Cluster– Individual machines are known as nodes–A cluster can have as few as one node, as many as several
thousand– More nodes = better performance!
Page 29
Hadoop Components: HDFS
• HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster
• Data is split into blocks and distributed across multiple nodes in the cluster–Each block is typically 64MB (the default) or 128MB in size
• Each block is replicated multiple times–Default is to replicate each block three times–Replicas are stored on different nodes
– This ensures both reliability and availability
Page 30
HDFS Replicated Blocks Visualized
Page 31
HDFS *is* a File System
• Screenshot for “Name Node UI”
Page 32
Accessing HDFS
• Applications can read and write HDFS files directly via a Java API
• Typically, files are created on a local filesystem and must be moved into HDFS
• Likewise, files stored in HDFS may need to be moved to a machine’s local filesystem
• Access to HDFS from the command line is achieved with the hdfs dfs command–Provides various shell-like commands as you find on Linux–Replaces the hadoop fs command
• Graphical tools available like the Sandbox’s Hue File Browser and Red Gate’s HDFS Explorer
Page 33
hdfs dfs Examples
• Copy file foo.txt from local disk to the user’s directory in HDFS
–This will copy the file to /user/username/fooHDFS.txt
• Get a directory listing of the user’s home directory in HDFS
• Get a directory listing of the HDFS root directory
Page 34
hdfs dfs –put fooLocal.txt fooHDFS.txt
hdfs dfs –ls
hdfs dfs –ls /
hdfs dfs Examples (continued)
• Display the contents of a specific HDFS file
• Move that file back to the local disk
• Create a directory called input under the user’s home directory
• Delete the HDFS directory input and all its contents
Page 35
hdfs dfs –cat /user/fred/fooHDFS.txt
hdfs dfs –mkdir input
hdfs dfs –rm –r input
hdfs dfs –get /user/fred/fooHDFS.txt barLocal.txt
Hadoop Components: MapReduce
• MapReduce is the system used to process data in the Hadoop cluster
• Consists of two phases: Map, and then Reduce–Between the two is a stage known as the shuffle and sort
• Each Map task operates on a discrete portion of the overall dataset–Typically one HDFS block of data
• After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase–Source code examples and live demo coming!
Page 36
Features of MapReduce
• Hadoop attempts to run tasks on nodes which hold their portion of the data locally, to avoid network traffic
• Automatic parallelization, distribution, and fault-tolerance
• Status and monitoring tools• A clean abstraction for programmers
–MapReduce programs are usually written in Java– Can be written in any language using Hadoop Streaming– All of Hadoop is written in Java
–With “housekeeping” taken care of by the framework, developers can concentrate simply on writing Map and Reduce functions
Page 37
MapReduce Visualized
Page 38
Detailed Administrative Console
• Screenshot from “Job Tracker UI”
Page 39
MapReduce: The Mapper
• The Mapper reads data in the form of key/value pairs (KVPs)
• It outputs zero or more KVPs• The Mapper may use or completely ignore the input key–For example, a standard pattern is to read a line of a file at a time
– The key is the byte offset into the file at which the line starts– The value is the contents of the line itself– Typically the key is considered irrelevant with this pattern
• If the Mapper writes anything out, it must in the form of KVPs–This “intermediate data” is NOT stored in HDFS (local storage
only without replication)
Page 40
MapReducer: The Reducer
• After the Map phase is over, all the intermediate values for a given intermediate key are combined together into a list
• This list is given to a Reducer–There may be a single Reducer, or multiple Reducers–All values associated with a particular intermediate key are
guaranteed to go to the same Reducer–The intermediate keys, and their value lists, are passed in sorted
order
• The Reducer outputs zero or more KVPs–These are written to HDFS– In practice, the Reducer often emits a single KVP for each input
key
Page 41
MapReduce Example: Word Count
• Count the number of occurrences of each word in a large amount of input data
Page 42
map(String input_key, String input_value)foreach word in input_value:
emit(w,1)
reduce(String output_key, Iter<int> intermediate_vals)set count = 0foreach val in intermediate_vals:
count += valemit(output_key, count)
MapReduce Example: Map Phase
Page 43
• Input to the Mapper
• Ignoring the key– It is just an offset
• Output from the Mapper
• No attempt is made to optimize within a record in this example– This is a great use case for a
“Combiner”
(8675, ‘I will not eat green eggs and ham’)
(8709, ‘I will not eat them Sam I am’)
(‘I’, 1), (‘will’, 1), (‘not’, 1), (‘eat’, 1), (‘green’, 1), (‘eggs’, 1), (‘and’, 1), (‘ham’, 1), (‘I’, 1), (‘will’, 1), (‘not’, 1), (‘eat’, 1), (‘them’, 1), (‘Sam’, 1),(‘I’, 1), (‘am’, 1)
MapReduce Example: Reduce Phase
Page 44
• Input to the Reducer
• Notice keys are sorted and associated values for same key are in a single list– Shuffle & Sort did this for us
• Output from the Reducer
• All done!
(‘I’, [1, 1, 1])(‘Sam’, [1])(‘am’, [1])(‘and’, [1])(‘eat’, [1, 1])(‘eggs’, [1])(‘green’, [1])(‘ham’, [1])(‘not’, [1, 1])(‘them’, [1])(‘will’, [1, 1])
(‘I’, 3)(‘Sam’, 1)(‘am’, 1)(‘and’, 1)(‘eat’, 2)(‘eggs’, 1)(‘green’, 1)(‘ham’, 1)(‘not’, 2)(‘them’, 1)(‘will’, 2)
Code Walkthru & Demo Time!!
• Word Count Example– Java MapReduce–Pig
Page 45
Additional DemonstrationsA Real-World Analysis Example
Compare/contrast solving the same problem with Java MapReduce, Pig, and Hive
Page 46
Dataset: Open Georgia
• Salaries & Travel Reimbursements–Organization
– Local Boards of Education– Several Atlanta-area districts; multiple years
– State Agencies, Boards, Authorities and Commissions– Dept of Public Safety; 2010
Page 47
Format & Sample Data
Page 48
NAME (String) TITLE (String)SALARY (float)
ORG TYPE (String)
ORG (String) YEAR (int)
ABBOTT,DEEDEE W GRADES 9-12 TEACHER 52,122.10 LBOEATLANTA INDEPENDENT SCHOOL SYSTEM
2010
ALLEN,ANNETTE DSPEECH-LANGUAGE PATHOLOGIST
92,937.28 LBOEATLANTA INDEPENDENT SCHOOL SYSTEM
2010
BAHR,SHERREEN T GRADE 5 TEACHER 52,752.71 LBOECOBB COUNTY SCHOOL DISTRICT
2010
BAILEY,ANTOINETTE R
SCHOOL SECRETARY/CLERK
19,905.90 LBOECOBB COUNTY SCHOOL DISTRICT
2010
BAILEY,ASHLEY NEARLY INTERVENTION PRIMARY TEACHER
43,992.82 LBOECOBB COUNTY SCHOOL DISTRICT
2010
CALVERT,RONALD MARTIN
STATE PATROL (SP) 51,370.40 SABACPUBLIC SAFETY, DEPARTMENT OF
2010
CAMERON,MICHAEL D
PUBLIC SAFETY TRN (AL)
34,748.60 SABACPUBLIC SAFETY, DEPARTMENT OF
2010
DAAS,TARWYN TARA
GRADES 9-12 TEACHER 41,614.50 LBOEFULTON COUNTY BOARD OF EDUCATION
2011
DABBS,SANDRA L GRADES 9-12 TEACHER 79,801.59 LBOEFULTON COUNTY BOARD OF EDUCATION
2011
E'LOM,SOPHIA LIS PERSONNEL - GENERAL ADMIN
75,509.00 LBOEFULTON COUNTY BOARD OF EDUCATION
2012
EADDY,FENNER R SUBSTITUTE 13,469.00 LBOEFULTON COUNTY BOARD OF EDUCATION
2012
EADY,ARNETTA A ASSISTANT PRINCIPAL 71,879.00 LBOEFULTON COUNTY BOARD OF EDUCATION
2012
Simple Use Case
• For all loaded State of Georgia salary information–Produce statistics for each specific job title
– Number of employees– Salary breakdown
– Minimum– Maximum– Average
–Limit the data to investigate– Fiscal year 2010– School district employees
Page 49
Code Walkthru & Demo; Part Deux!
• Word Count Example– Java MapReduce–Pig–Hive
Page 50
Demo Wrap-Up
• All code, test data, wiki pages, and blog posting can be found, or linked to, from –https://github.com/lestermartin/hadoop-exploration
• This deck can be found on SlideShare–http://www.slideshare.net/lestermartin
• Questions?
Page 51
Thank You!!
• Lester Martin• Hortonworks – Professional Services• [email protected]• http://about.me/lestermartin (links to blog, github, twitter, LI, FB, etc)
Page 52