M A T T H I A S B R Ä G E R C E R N G S - A S E
INTRODUCTION TO APACHE HADOOP
AGENDA
• Introduction to Big Data • Introduction to Hadoop • HDFS file system • Map/Reduce framework • Hadoop utilities
• Summary
BIG DATA FACTS
• In what timeframe do we now create the same amount of information that we created from the dawn of civilization until 2003?
• 90% of the world’s data was created in the last (how many years)?
• What is 1024 petabytes also knows as?
2 years
2 days
1 exabyte
DATA IS GETTING BIGGER
Rapid growth of global data
from 2009-20201
From 1 to 35
Zetabytes
70% of the data
generated by individuals1
Global mobile data traffic will surpass2
10 exabytes in 2016
The number of mobile-connected devices exceeded
the world's population in 20122
7 billion
Every minute in the Internet3
100.000 Twitter
tweets
240.000 shared Facebook content
(1) CSC Report „big data growth infographic“, (2) Cisco Visual Networking Index 2011-2016, (3) Intel
DATA EXPLOSION COMPOUNDS CHALLENGES
80% of the effort involved in dealing with data is cleaning it up in the first place1
(1) O'Reilly Media
BIG DATA INCLUDES ALL TYPES OF DATA
BIG DATA INCLUDES ALL TYPES OF DATA
• Pre-defined schema • Example: Relational database systems
Structured
• Inconsistent structure • Cannot be stored in rows and tables in a typical database • Examples: logs, tweets, sensor feeds
Semi-structured
• Lacks structure or… • Part of it lack structure • Examples: free-form text, reports, customer feedback forms
Unstructured
EXAMPLE PREDICTING TRENDS AND PREPARING FOR FUTURE DEMANDS
Big data analytics combines enterprises data with other relevant information
Web Browsing Patterns
Movie Releases
Social Media Sentiments
Enterprise Data
Gaming Industry Advertising Buys
EXAMPLE PREDICTING TRENDS AND PREPARING FOR FUTURE DEMANDS
to create predictive model of trends.
A NEW SOLUTION
• Hadoop = HDFS + Map/Reduce
• HDFS provides storage • MapReduce provides analysis
THE HADOOP APPROACH
• Process data in parallel • Replicate data across cluster for reliability
Distribute large amounts of data across thousands of commodity hardware nodes
• Avoids data copy
Analysis moved to data
• Avoids random seek • Easiest way to proccess
Scanning of data
A NEW PARADIGM
• Process data locally • Reduce dependence on bandwidth • Expect failure • Handle failover elegantly • Duplicate finite blocks of data to small groups of
nodes (rather than entire database) • Reduce elapse seek time • Place no conditions on the structure of the data
HADOOP OVERVIEW
HDFS
MapReduce HBase
Hive Chukwa PIG
Zoo
kee
pe
r
Disk Disk Disk Disk
Sqoop
HADOOP COMPONENTS (1/2)
Essentials: • HDFS - a scalable, high-performance distributed file system. • MapReduce - A Java-based job tracking, node management,
and application container for mappers and reducers.
Frameworks: • Chukwa - a data collection system for monitoring, displaying,
and analyzing logs from large distributed systems. • Hive - structured data warehousing infrastructure that provides
a mechanisms for storage, data extraction, transformation, and loading (ETL), and a SQL-like language for querying and analysis.
• HBase - a column-oriented (NoSQL) database designed for real-time storage, retrieval, and search of very large tables (billions of rows/millions of columns) running atop HDFS.
HADOOP COMPONENTS (2/2)
Utilities: • Pig - a set of tools for programmatic flat-file data
analysis that provides a programming language, data transformation, and parallelized processing. • Sqoop - a tool for importing and exporting data
stored in relational databases into Hadoop or Hive, and vice versa using MapReduce tools and standard JDBC drivers. • ZooKeeper - a distributed application management
tool used for managing the nodes in a Hadoop computational network.
HDFS HADOOP DISTR IBUTED F I LE SYSTEM
HADOOP OVERVIEW
HDFS
MapReduce HBase
Hive Chukwa PIG
Zoo
kee
pe
r
Disk Disk Disk Disk
Sqoop
WHAT IS ?
• A scalable, high-performance distributed file system • Primary storage system for Hadoop • Fast reliable • Designed for consistency • Presents a single view of multiple physical disks or
file systems • Deployed only on Linux
HDFS CHARACTERISTICS
• Persistent • Replicated • Linear scalable • Applications sequentially stream reads • Often from very large files
• Optimized for read performance • Avoids random disk seeks
• Write once and read many times • Data stored in blocks • Distributed over many nodes • Block size often range from 128MB to 1GB
HDFS ARCHITECTURE
Secondary NameNode
NameNode
Block Map
DataNode
BL1
BL2 BL7
BL6
DataNode
BL1
BL6 BL2
BL3
DataNode
BL1
BL8 BL9
BL7
Metadata
HDFS COMPONENTS
• Manages DataNodes • Keeps metadata for all nodes & blocks
NameNode
• Manages block reads/writes for HDFS • Manages block replication • Live on racks (rack-aware data organization)
DataNodes
• Talks directly to NameNode then DataNodes
Client
VS
• distributed file system that is well suited for the storage of large files.
• It is NOT a general purpose file system! • HDFS does not work well with less than 5 DataNodes
HDFS
• Built on top of HDFS • Suitable for hundreds of millions or billions of rows • Should not be used for tables with few thousand/million rows • More a “Data Store” than “Data Base” • RDBMS apps cannot be "ported" to HBase by simply
changing a JDBC driver!
HBASE
MAP/REDUCE HOW DOES I T WORKS?
HADOOP OVERVIEW
HDFS
MapReduce HBase
Hive Chukwa PIG
Zoo
kee
pe
r
Disk Disk Disk Disk
Sqoop
WHAT IS MAP/REDUCE? (1/2)
• A framework written in Java • Big Data analytics and processing • Node-local computation • Parallel processes • Handles node fail-over • It all started when Google needed a way to: • Determine which web sites to provide for searches • Do page ranking
WHAT IS MAP/REDUCE? (2/2)
• “Map” applies to all the members of the dataset and returns a list of results • “Reduce” collates and resolves the results from one
or more mapping operations executed in parallel • Very large datasets are split into large subsets called
splits • Separates business logic from multi-processing logic • MapReduce framework developers focus on process
dispatching, locking, and logic flow • App developers focus on implementing the business logic
without worrying about infrastructure or scalability issues
HOW MAP/REDUCE WORKS
BigData Result
“John was ..”
“Hi, John!”
(“John”, 1) (“John”, 3)
Map Reduce
MAP/REDUCE EXAMPLE (1/2)
Toronto, 20 Dubna, 25 Geneva, 22 Rome, 32 Toronto, 4 Rome, 38 Geneva, 18
(Toronto, 20) (Dubna, 25) (Geneva, 22)
Find maximum temperature for each city out of 5 files:
(Toronto, 18) (Geneva, 32) (Rome, 37) (Dubna, 20) (Geneva, 20) (Rome, 33)(Toronto, 22) (Dubna, 19)
(Rome, 31)(Toronto, 31) (Dubna, 22) (Geneva, 19) (Rome, 30)
Let’s assume the other four mapper tasks (working on the other four files not shown here) produced the following intermediate results:
Mapper task result:
(Dubna, 27)
(Rome, 38)
(Toronto, 32)
(Geneva, 33)
MAP/REDUCE EXAMPLE (2/2)
• All five of these output streams are fed into the reduce tasks, which combine the input results and output a single value for each city
Final Result:
(Toronto, 32) (Dubna, 27) (Geneva, 33) (Rome, 38)
PIG A HADOOP SCRIPT ING LANGUAGE
HADOOP OVERVIEW
HDFS
MapReduce HBase
Hive Chukwa PIG
Zoo
kee
pe
r
Disk Disk Disk Disk
Sqoop
WHAT IS PIG?
• A high-level data-flow language (Pig Latin) and execution framework for parallel computation • Pig is made of two main components: • A SQL-like data processing language called Pig Latin • A compiler that compiles and runs Pig Latin scripts
• Pig Latin provides: • Ease of programming. Trivial to achieve parallel execution
of simple, "embarrassingly parallel" data analysis tasks • Optimization opportunities. Permits the system to optimize
execution of tasks automatically, allowing the user to focus on semantics rather than efficiency.
• Extensibility. Users can create their own functions
THE ORIGINS
• Pig was created by Yahoo! to make it easier to analyze the data in HDFS without the complexities of writing a traditional MapReduce program. • With Pig, it is possible to develop MapReduce jobs
with a few lines of Pig Latin
PIG IN THE ECO SYSTEM
• Pig runs on Hadoop utilizing both HDFS and MapReduce • By default, Pig reads and writes files from HDFS • Pig stores intermediate data among MapReduce
jobs
Pig
MapReduce
HDFS
HBase
RUNNING PIG
• A Pig Latin script executes in thee modes 1. MapReduce: the code executes as a MapReduce
application on a Hadoop cluster (default mode) 2. Local: the code executes locally in a single JVM using a
local text file (for development purposes) 3. Interactive: Pig commands are entered manually at a
command prompt known as the Grunt shell
PIG EXAMPLES UNION
grunt> a = LOAD 'A' USING PigStorage(',') AS (a1:int, a2:int, a3:int); grunt> b = LOAD 'B' USING PigStorage(',') AS (b1:int, b2:int, b3:int); grunt> DUMP a; (0,1,2) (1,3,4) grunt> DUMP b; (0,5,2) (1,7,8) grunt> c = UNION a, b AS (c1:int, c2:int, c3:int); grunt> DUMP c; (0,1,2) (0,5,2) (1,3,4) (1,7,8)
PIG EXAMPLES SPLIT
grunt> SPLIT c INTO d IF $0 == 0, e IF $0 == 1; grunt> DUMP d; (0,1,2) (0,5,2) grunt> DUMP e; (1,3,4) (1,7,8)
PIG EXAMPLES FOREACH
grunt> DUMP c; (0,1,2) (0,5,2) (1,3,4) (1,7,8) grunt> mult = FOREACH c GENERATE c2, c2 * c3; grunt> DUMP mult; (1,2) (5,10) (3,12) (7,56)
EXAMPLE OF A PIG SCRIPT
• Find the top 10 URLS for users between 18 and 25
Users = LOAD ‘users’ AS (name, age); FilteredUsers = FILTER Users BY age >= 18 AND age <= 25; Pages = LOAD ‘pages’ AS (user, url) JoinResult = JOIN FilteredUsers BY name, Pages BY users; Grouped = GROUP JoinResult BY url; Summed = FOREACH Grouped GENERATE group; COUNT(JoinResult) AS clicks; Sorted = ORDER Summed BY clicks desc; Top10 = LIMIT sorted 10; STORE Top10 INTO ‘top10sites’;
HIVE A DATA WAREHOUSE SYSTEM FOR HADOOP
HADOOP OVERVIEW
HDFS
MapReduce HBase
Hive Chukwa PIG
Zoo
kee
pe
r
Disk Disk Disk Disk
Sqoop
WHAT IS HIVE?
• Hive is a data warehouse system for Hadoop • facilitates easy data summarization, ad-hoc queries,
and the analysis of large datasets • Hive provides a mechanism to project structure
onto this data and query the data using a SQL-like language called HiveQL
HIVEQL EXAMPLE
• The underlying table www_access consists of three fields: ip, url, and time.
SELECT COUNT(1) FROM www_access;
SELECT COUNT(distinct v['ip']) FROM www_access WHERE v['url']='/’;
Number of Records:
Number of Unique IPs that accessed the Top Page:
SUMMARY WHAT D ID WE LEARN?
TO TAKE AWAY
• Data is getting bigger and more complex to handle • Hadoop = HDFS + Map/Reduce • Will Hadoop replace relational databases? No!
QUESTIONS? THANK YOU FOR YOUR ATTENT ION!
Top Related