Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop •...
Transcript of Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop •...
© Copyright 2014, Neudesic. All rights reserved.
Understanding Apache Hadoop
Presented By: Orion GebremedhinBI/Big Data Solution Partner , Neudesic LLC.
Data Platform VTSP, Microsoft Corp.
@OrionGM
© Copyright 2014, Neudesic. All rights reserved.
Topics Covered
• Fundamentals of Hadoop
• Major Hadoop Distributions
© Copyright 2014, Neudesic. All rights reserved.
3
Big Data = Hadoop?
* A Modern Data Architecture with Apache Hadoop, Hortonworks Inc. 2014
© Copyright 2014, Neudesic. All rights reserved.
4
The Fundamentals of Hadoop
• Hadoop evolved directly from
commodity scientific supercomputing
clusters developed in the 1990s
• Hadoop consists of:
• MapReduce
• Hadoop Distributed File System
(HDFS)
© Copyright 2014, Neudesic. All rights reserved.
5
What’s New?
© Copyright 2014, Neudesic. All rights reserved.
6
Characteristics of HDFS
• Very high fault tolerance
• Cannot be updated, but
corrections can be appended
• File blocks are replicated
multiple times
Three Types of Nodes:
1. Name Node (Directory)
2. Backup Node (Checkpoint)
3. Data Node (Actual Data)
© Copyright 2014, Neudesic. All rights reserved.
7
Characteristics of MapReduce
• Map Function - Take a task and break it down into small tasks
• Reduce Function - Combine the partial answers and find the combined list
A programing framework for library and runtime (just like .NET)
• When submitting a query, this manages the Task Trackers which trigger the actual Map or Reduce task
Master (Job Tracker)
• The “doers.” Each node in the cluster has a data node and a task tracker
Workers (Task Trackers)
© Copyright 2014, Neudesic. All rights reserved.
8
HDFS and MapReduce
The Main Node: runs the Job tracker and the name node controls the files.
Each node runs two processes: Task Tracker and Data Node
© Copyright 2014, Neudesic. All rights reserved.
9
Basics of MapReduce
1 bill/ sec
= 400 Seconds
400 bills
© Copyright 2014, Neudesic. All rights reserved.
10
Basics of MapReduce
1 bill/ sec
1 bill/ sec
= 200 Seconds
= 200 Seconds
200 Bills
200 BillsTotal =200 seconds
© Copyright 2014, Neudesic. All rights reserved.
11
Basics of MapReduce
= 100 Seconds100 bills
100 bills
100 bills
100 bills
= 100 Seconds
= 100 Seconds
= 100 Seconds
Total =100 seconds
© Copyright 2014, Neudesic. All rights reserved.
12
Basics of MapReduce
Query
Result
Name Node/Job TrackerQuery
Data Nodes/Task Trackers
© Copyright 2014, Neudesic. All rights reserved.
13
MapReduce
• Java
• Write many lines of code
Pig
• Mostly used by Yahoo
• Most used for data processing
• Shares some constructs w/ SQL
• Is more Verbose
• Needs a lot of training for users with limited procedural programming background
• Offers control over the flow of data
Hive
• Mostly used by Facebook for analytic purposes
• Used for analytics
• Relatively easier for developers w/ SQL experience.
• Less control over optimization of data flows compared to Pig
MapReduce, Pig and Hive
• Not as efficient as MapReduce
• Higher productivity for data scientists and developers
© Copyright 2014, Neudesic. All rights reserved.
14
Major Versions of Apache Hadoop
Apache Foundation
© Copyright 2014, Neudesic. All rights reserved.
15
Hortonworks
• 2011: $23 million funding from Yahoo! and Benchmark Capital- positioned as an independent company
• Horton the Elephant - Horton Hears a Who!
• Employs contributors to project Apache Hadoop
• October 2011 partnered with Microsoft : Azure and Windows Server
• Cloudera founded in October 2008…started the effort to be Microsoft Azure Certified in October 2014.
© Copyright 2014, Neudesic. All rights reserved.
16
HDP User Interface
© Copyright 2014, Neudesic. All rights reserved.
17
HUE Architecture
© Copyright 2014, Neudesic. All rights reserved.
Questions?Tom Marek
Twitter: @twmarek
Orion Gebremedhin
Twitter: @OrionGM