Hadoop: Introduction
Wojciech LangiewiczWrocław Java User Group 2014
2/39
About me
● Working with Hadoop and Hadoop related technologies for last 4 years
● Deployed 2 large clusters, bigger one was almost 0.5 PB in total storage
● Currently working as consultant / freelancer in Java and Hadoop
● On site Hadoop trainings from time to time
● In meantime working on Android apps
3/39
Agenda
● Big Data
● Hadoop
● MapReduce basics
● Hadoop processing framework – Map Reduce on YARN
● Hadoop Storage system – HDFS
● Using SQL on Hadoop with Hive
● Connecting Hadoop with RDBMS using Sqoop
● Example of real Hadoop architecture – examples
4/39
Big Data from technological perspective
● Huge amount of data
● Data collection
● Data processing
● Hardware limitations
● System reliability:
– Partial failures
– Data recoverability
– Consistency
– Scalability
5/39
Approaches to Big Data problem
● Vertical scaling
● Horizontal scaling
● Moving data to processing
● Moving processing close to data
6/39
Hadoop - motivations
● Data won't fit on one machine
● More machines → higher chance of failure
● Disk scan faster than seek
● Batch vs real time processing
● Data processing won't fit on one machine
● Move computation close to data
7/39
Hadoop properties
● Linear scalability
● Distributed
● Shared (almost) nothing architecture
● Whole ecosystem of tools and techniques
● Unstructured data
● Raw data analysis
● Transparent data compression
● Replication at it's core
● Self-managing (replication, master election, etc)
● Easy to use
● Massive parallel processing
8/39
Hadoop Architecture
● “Lower” layer: HDFS – data storage and retrieval system
● “Higher” layer: MapReduce – execution engine that relies on HDFS
● Please note that there are other systems that rely on HDFS for data storage, but won't be covered in this presentation
9/39
Map Reduce basics
● Batch processing system
● Handles many distributed systems problems
● Automatic parallelization and distribution
● Fault tolerance
● Job status and monitoring
● Borrows from functional programming
● Based on Google's work: MapReduce: Simplified Data Processing on Large Clusters
10/39
Word Count pseudo code
1: def map(String key, String value)2: foreach word in value:3: emit(word, 1);4:5: def reduce(String key, int[] values)6: int result = 0;7: foreach val in values:8: result += val;9: emit(key, result);10:
11/39
Word Count Example
Source: http://xiaochongzhang.me/blog/?p=338
12/39
Hadoop Map Reduce Architecture
Client
Job Tracker
Task Tracker
Map
Reduce
Task Tracker
Map
Reduce
Task Tracker
Map
Reduce
…...
13/39
What can be expressed as MapReduce?
● grep
● sort
● SQL operators, for example:
– GROUP BY
– DISTINCT
– JOIN
● Recommending friends
● Reverting web indexes
● And many more
14/39
HDFS – Hadoop Distributed File System
● Optimized for streaming access (prefers throughput over latency, no caching)
● Built-in replication
● One master server storing all metadata (Name Node)
● Multiple slaves that store data and report to master (Data Nodes)
● JBOD optimized
● Works better on moderate number of large files vs small files
● Based on Google's work: The Google File System
15/39
HDFS design
16/39
HDFS limitations
● No file updates
● Name Node as SPOF in basic configurations
● Limited security
● Inefficient at handling lots of small files
● No way to provide global synchronization or shared mutable state (this can be an advantage)
17/39
HDFS + MapReduce: Simplified Architecture
Name Node
Job Tracker
Master Node
Slave Node
Data Node
Task Tracker
Slave Node
Data Node
Task Tracker
Slave Node
Data Node
Task Tracker
…....
* Real setup will include few more boxes, but they areomitted here for simplicity
18/39
Hive
● “Data warehousing for Hadoop”
● SQL interface to HDFS files (language is called HiveQL)
● SQL is translated into multiple MR jobs that are executed in order
● Doesn't support UPDATE
● Powerful and easy to use UDF mechanism:add jar /home/hive/my-udfs.jarcreate temporary function lower as 'com.example.Lower';select my_lower(username) from users;
19/39
Hive components
● Shell – similar to MySQL shell
● Driver – responsible for executing jobs
● Compiler – translates SQL into MR job
● Execution engine – manages jobs and job stages (one SQL usually is translated into multiple MR jobs)
● Metastore – schema, location in HDFS, data format
● JDBC interface – allows for any JDBC compatible client to connect
20/39
Hive examples 1/2
● CREATE TABLE page_view(view_time INT, user_id BIGINT,page_url STRING, referrer_url STRING,ip STRING);
● CREATE TABLE users(user_id BIGINT, age INT);
● SELECT * From page_view LIMIT 10;
● SELECTuser_id,COUNT(*) AS cFROM usersWHERE view_time > 10GROUP BY user_id;
21/39
Hive examples 2/2
● CREATE TABLE page_views_age ASSELECTpv.page_url,u.age,COUNT(*) AS countFROM page_view pvJOIN users u ON (u.user_id = pv.user_id)GRUP BY pv.page_url, u.age;
22/39
Hive best practices 1/2
● Use partitions, especially on date columns
● Compress where possible
● JOIN optimization hive.auto.convert.join=true
● Improve parallelism: hive.exec.parallel=true
23/39
Hive best practices 2/2
● SELECT COUNT(DISTINCT user_id) FROM logs;
● SELECT COUNT(*) FROM (SELECT DISTINCT user_id FROM logs);
image source: http://www.slideshare.net/oom65/optimize-hivequeriespptx
24/39
Sqoop
● SQL to Hadoop import/export tool
● Performs a MapReduce query that interacts with target database via JDBC
● Can work with almost all JDBC databases
● Can “natively” import and export Hive tables
● Import supports:
– Full databases
– Full tables
– Query results
● Export can update/append data to SQL tables
25/39
Sqoop examples
● sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES
● sqoop import --connect jdbc:mysql://db.foo.com/corp --table --hive-import
● sqoop export --connect jdbc:mysql://db.example.com/foo --table bar --export-dir /user/hive/warehouse/exportingtable
26/39
Hadoop problems
● Relatively hard to setup – Linux knowledge required
● Hard to find logs – multiple directories on each server
● Name Node can be a SPOF if configured incorrectly
● Not real time – jobs take some setup/warm up time (other projects try to address that
● Performance not visible until you exceed 3-5 servers
● Hard to convince people to use it from the start in some projects (Hive via JDBC can help here)
● Relatively complicated configuration management
27/39
Hadoop ecosystem
● HBase – Big Table database
● Spark – Real time query engine
● Flume – log collection
● Impala – similar to Spark
● HUE – Hive console (MySQL workbench / phpMyAdmin) + user permission
● Oozie – Job scheduling, orchestration, dependency, etc
28/39
Use case examples
● Generic production snapshot updates
– Using asynchronous mechanisms
– Using more synchronous approach
● Friends/product recommendations
29/39
Hadoop use case example: snapshots
● Log collection, aggregation
● Periodic batch jobs (hourly, daily)
● Jobs integrate collected logs and production data
● Results from batch jobs feed production system
● Hadoop jobs generate reports for business users
30/39
Hadoop pipeline – feedback loop
Production system Xgenerates logs
RabbitMQintegration step
logs
Production system Ygenerates logs
logs
HadoopHDFS + MR
Multiple rabbitconsumers write to HDFSlogs
logs – HDFS writes
RDBMS:stores models
feeds production system
Daily jobsDaily processing
Results of daily processing
Updated “snapshots”
Current “snapshots”
Updates “snapshots”stored on production servers
31/39
Feedback loop using sqoop
HadoopHDFS + MR
RDBMS:stores data
for production system
Daily jobs
sqoop export
Hadoop MR jobsqoop import
32/39
Agenda
● Big Data
● Hadoop
● MapReduce basics
● Hadoop processing framework – Map Reduce on YARN
● Hadoop Storage system – HDFS
● Using SQL on Hadoop with Hive
● Connecting Hadoop with RDBMS using Sqoop
● Example of real Hadoop architecture – examples
33/39
How to recommend friends – PYMK 1/4
● Database of users
– CREATE TABLE users (id INT);
● Each user has a list of friends (assume integers)
– CREATE TABLE friends (user1 INT, user2 INT);
● For simplicity: relationship is always bidirectional
● Possible to do in SQL (run on RDBMS or on Hive):
● SELECT users.id, new_friend, COUNT(*) AS common_friends FROM users JOIN friends f1 JOIN f2 ….….….
34/39
PYMK: 2/4 Example
0: 1,2,31: 32: 1,4,53: 0,14: 55: 2,4
We expect to see following recommendations:(1,3)(0,4)(0,5)
0
1
2
3
45
35/39
PYMK 3/4
● For each user emit pairs for all his friends
– Example: user X has friends: 1,5,6, we emit: (1,5), (1,6), (5,6)
● Sort all pairs by first user
● Eliminate direct friendships, if 5&6 are friends, remove them
● Sort all pairs by frequency
● Group by each user in pair
36/39
PYMK 4/5 mapper
//user: integer, friends: integer listfunction map(user, friends) for i = 0 to friends.length-1: emit(user, (1, friends[i])) //direct friends
for j = i+1 to friends.length-1://indirect friendsemit(friends[i], (2, friends[j]))emit(friends[j], (2, friends[i]))
37/39
PYMK 5/5 reducer
//user: integer, rlist: list of pairs (path_length, rfriend)reduce(user, rlist):
recommened = new Map()
for(path_length, rfriend) in rlist:if(path_length == 1)//direct friends
recommened.remove(rfriend)if(path_length == 2)//recommend them
recommened.incrementOrAdd(rfriend)
recommend_list = recommened.toList()recommend_list.sortBy(_.2)
emit(user, recommend_list.toString())
38/39
Additional sources
● Data-Intensive Text Processing with MapReduce: http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf
● Programming Hive: http://shop.oreilly.com/product/0636920023555.do
● Cloudera Quick Start VM: http://www.cloudera.com/content/support/en/downloads/quickstart_vms/cdh-5-1-x1.html
● Hadoop: The Definitive Guide: http://shop.oreilly.com/product/0636920021773.do
39/39
Thanks!Time for questions
Top Related