An Introduction of Apache Hadoop
-
Upload
kms-technology -
Category
Technology
-
view
118 -
download
1
description
Transcript of An Introduction of Apache Hadoop
![Page 2: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/2.jpg)
AN INTRODUCTION OF
APACHE HADOOP
![Page 3: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/3.jpg)
WHO AM I?
Minh Tran
KMS Technology
Current: Software Architect at KMS Technology
Past: Technical at Yahoo!
Senior Engineer at MobiVi, Sciant, ELCA
Admin at JavaVietnam
![Page 4: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/4.jpg)
OBJECTIVES
• Understand what Apache Hadoop is
• Understand problems Hadoop aims to solve
• Explore Hadoop architecture and its
ecosystem
![Page 5: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/5.jpg)
AGENDA
• Hadoop Overview
• Haddop Architecture at a glance
• Hadoop Ecosystem
• A demo of using Hadoop
![Page 6: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/6.jpg)
AGENDA – HADOOP OVERVIEW
• Big Data & Challenges
• What is Hadoop?
• Hadoop Benefits
• Which problem can Hadoop solve?
• Hadoop Installation
![Page 7: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/7.jpg)
WHY DO WE HAVE SO MUCH
DATA?
• Every single day
– Twitter processes 340 million messages
– Facebook stores 2.7 billion comments and
“Likes”
– Google processes about 24 petabytes of data
• And every single minute
– More than 200 million e-mails are sent
– Foursquare processes more than 2,000
check-ins
![Page 8: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/8.jpg)
WHERE DOES DATA COME FROM?
• Science: medical imaging, sensor data,
genome sequencing, weather data,
satellite feeds, etc.
• Legacy: Sales data, customer behavior,
product databases, accounting data, etc.
• System Data: Log files, network messages,
Web Analytics, intrusion detection, spam
filters • (Not all of this maps cleanly to the relational model)
![Page 9: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/9.jpg)
DATA ANALYSIS CHALLENGE
• Huge volumes of data
• Mixed sources result in many different formats
– XML
– CSV
– EDI
– Log files
– Objects
– SQL
– Text
– JSON
– Binary
– etc.
![Page 10: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/10.jpg)
WHAT IS HADOOP?
• Scalable data storage and processing
– Open source Apache project
– Harnesses the power of commodity servers
– Distributed and fault-tolerant
• “Core” Hadoop consists of two main parts
– HDFS (storage)
– MapReduce (processing)
![Page 11: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/11.jpg)
WHO USES HADOOP?
![Page 12: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/12.jpg)
BENEFITS OF ANALYZING WITH
HADOOP
• Previously impossible/impractical
to do this analysis
• Analysis conducted at lower cost
• Analysis conducted in less time
• Greater flexibility
• Linear scalability
![Page 13: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/13.jpg)
WHICH PROBLEM CAN
HADOOP SOLVE?
• Nature of the data
– Complex & multiple data sources
– Lots of it
• Nature of the analysis
– Batch processing
– Parallel execution
– Spread data over a cluster of servers and take the computation
to the data
• Common Hadoop Problems:
– Customer churn analysis
– Recommendation engine
– PoS transaction analysis
– Threat analysis
– Search quality
– Data “sandbox”
![Page 14: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/14.jpg)
HADOOP INSTALLATION
1. Install a Linux machine, for e.g.: Ubuntu
2. Install latest JDK
3. Install Hadoop package, download at
http://hadoop.apache.org/
![Page 15: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/15.jpg)
AGENDA
• Hadoop Overview
• Haddop Architecture at a glance
• Hadoop Ecosystem
• A demo of using Hadoop
![Page 16: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/16.jpg)
AGENDA - HADDOP ARCHITECTURE
AT A GLANCE
• Hadoop Distributed File System
• How MapReduce works
![Page 17: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/17.jpg)
COLLOCATED STORAGE
AND PROCESSING
• Because 10,000 hard disks are better than one
• Solution: store and process data on the same nodes
– Data locality: “Bring the computation to the data”
– Reduces I/O and boosts performance
![Page 18: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/18.jpg)
HARD DISK LATENCY
• Disk seeks are expensive
• Solution: Read lots of data at once to amortize the cost
![Page 19: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/19.jpg)
HDFS BLOCKS
• When a file is added to HDFS, it’s split into blocks
• This is a similar concept to native file systems
– HDFS uses a much larger block size (64 MB), for
performance
![Page 20: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/20.jpg)
Client application
Hadoop file system client
DataNode 1
C
D
B
DataNode 2
A
C
D
DataNode 3
B
A
C
NameNode
/tmp/file1.txt
Block A
Block B
DataNode 3
DataNode 2
DataNode 1
DataNode 3
Block C DataNode 1
DataNode 2
DataNode 3
HDFS High Level Architecture
![Page 21: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/21.jpg)
HOW MAPREDUCE WORKS?
![Page 22: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/22.jpg)
ANOTHER EXAMPLE ABOUT
BUILDING INVERTED INDEX
• Input: a number of text files
• Output: a list of tuples, where each tuple is a word and a list of files
that contain the word
doc1.txt
cat sat mat
doc2.txt
cat sat dog
Input filenames and contents
Mappers Intermediate
output Reducers
cat, doc1.txt
sat, doc1.txt
mat, doc1.txt
cat, doc2.txt
sat, doc2.txt
dog, doc2.txt
part-r-00000
cat: doc1.txt, doc2.txt
part-r-00001
sat: doc1.txt, doc2.txt dog: doc2.txt
part-r-00002
mat: doc1.txt
Output filenames and contents
![Page 23: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/23.jpg)
AGENDA
• Hadoop Overview
• Haddop Architecture at a glance
• Hadoop Ecosystem
• A demo of using Hadoop
![Page 24: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/24.jpg)
HADOOP ECOSYSTEM
![Page 25: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/25.jpg)
AGENDA
• Hadoop Overview
• Haddop Architecture at a glance
• Hadoop Ecosystem
• A demo of using Hadoop
![Page 26: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/26.jpg)
REFERENCES
• Hadoop In Practice – Alex Homes
• Hadoop Real World Solutions Cookbook – Jonathan R. Owens, Jon
Lentz, Brian Femiano
• Hadoop In Action – Chuck Lam
• Hadoop The Definitive Guide – Tom White
• MapReduce Design Patterns – Donald Miner, Adam Shook
• An Introduction to Hadoop – Mark Fei
• http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-
linux-single-node-cluster/
• http://www.crobak.org/2011/12/getting-started-with-apache-hadoop-
0-23-0/
![Page 27: An Introduction of Apache Hadoop](https://reader031.fdocuments.in/reader031/viewer/2022020122/54c668d14a79594b538b47b5/html5/thumbnails/27.jpg)
© 2013 KMS Technology
THANK YOU