The Hadoop Ecosystem for Developers

229
The Hadoop Ecosystem Zohar Elkayam & Ronen Fidel Brillix

Transcript of The Hadoop Ecosystem for Developers

Page 1: The Hadoop Ecosystem for Developers

The Hadoop Ecosystem

Zohar Elkayam & Ronen Fidel

Brillix

Page 2: The Hadoop Ecosystem for Developers

Agenda

• Big Data – The Challenge

• Introduction to Hadoop

– Deep dive into HDFS

– MapReduce and YARN

• Improving Hadoop: tools and extensions

• NoSQL and RDBMS

2

Page 3: The Hadoop Ecosystem for Developers

About Brillix

• Brillix is a leading company that specialized in Data Management

• We provide professional services and consulting for Databases, Security and Big Data solutions

3

Page 4: The Hadoop Ecosystem for Developers

Who am I?

• Zohar Elkayam, CTO at Brillix

• DBA, team leader, instructor and a senior consultant for over 17 years

• Oracle ACE Associate

• Involved with Big Data projects since 2011

• Blogger – www.realdbamagic.com

4

Page 5: The Hadoop Ecosystem for Developers

Big Data

Page 6: The Hadoop Ecosystem for Developers

"Big Data"??

Different definitions

“Big data exceeds the reach of commonly used hardware environments

and software tools to capture, manage, and process it with in a tolerable

elapsed time for its user population.” - Teradata Magazine article, 2011

“Big data refers to data sets whose size is beyond the ability of typical

database software tools to capture, store, manage and analyze.”

- The McKinsey Global Institute, 2012

“Big data is a collection of data sets so large and complex that it

becomes difficult to process using on-hand database management

tools.” - Wikipedia, 2014

6

Page 7: The Hadoop Ecosystem for Developers
Page 8: The Hadoop Ecosystem for Developers

A Success Story

8

Page 9: The Hadoop Ecosystem for Developers

More success stories

9

Page 10: The Hadoop Ecosystem for Developers

MORE stories..

• Crime Prevention in Los Angeles

• Diagnosis and treatment of genetic diseases

• Investments in the financial sector

• Generation of personalized advertising

• Astronomical discoveries

10

Page 11: The Hadoop Ecosystem for Developers

Examples of Big Data Use Cases Today

MEDIA/ENTERTAINMENT

Viewers / advertising effectiveness

COMMUNICATIONS

Location-based advertising

EDUCATION &RESEARCH

Experiment sensor analysis

CONSUMER PACKAGED GOODS

Sentiment analysis of what’s hot, problems

HEALTH CARE

Patient sensors, monitoring, EHRs

Quality of care

LIFE SCIENCES

Clinical trials

Genomics

HIGH TECHNOLOGY / INDUSTRIAL MFG.

Mfg quality

Warranty analysis

OIL & GAS

Drilling exploration sensor analysis

FINANCIALSERVICES

Risk & portfolio analysis

New products

AUTOMOTIVE

Auto sensors reporting location, problems

RETAIL

Consumer sentiment

Optimized marketing

LAW ENFORCEMENT & DEFENSE

Threat analysis -social media monitoring, photo analysis

TRAVEL &TRANSPORTATION

Sensor analysis for optimal traffic flows

Customer sentiment

UTILITIES

Smart Meter analysis for network capacity,

ON-LINE SERVICES / SOCIAL MEDIA

People & career matching

Web-site

optimization

11

Page 12: The Hadoop Ecosystem for Developers

Most Requested Uses of Big Data

• Log Analytics & Storage

• Smart Grid / Smarter Utilities

• RFID Tracking & Analytics

• Fraud / Risk Management & Modeling

• 360° View of the Customer

• Warehouse Extension

• Email / Call Center Transcript Analysis

• Call Detail Record Analysis

12

Page 13: The Hadoop Ecosystem for Developers

The Challenge

Page 14: The Hadoop Ecosystem for Developers

Big Data Big Problems

• Unstructured• Unprocessed• Un-aggregated• Un-filtered• Repetitive• Low quality• And generally messy

Oh, and there is a lot of it

14

Page 15: The Hadoop Ecosystem for Developers

The Big Data Challenge

15

Page 16: The Hadoop Ecosystem for Developers

Big Data: Challenge to Value

Business

Value

High Variety

High Volume

High Velocity

Today

Deep Analytics

High Agility

Massive Scalability

Real TimeTomorrow

Challenges

16

Page 17: The Hadoop Ecosystem for Developers

Volume

• Big data come in one size: Big.

• Size is measured in Terabyte(1012), Petabyte(1015), Exabyte(1018), Zettabyte (1021)

• The storing and handling of the data becomes an issue

• Producing value out of the data in a reasonable time is an issue

17

Page 18: The Hadoop Ecosystem for Developers

Some Numbers

• How much data in the world?– 800 Terabytes, 2000– 160 Exabytes, 2006 (1EB = 1018B)– 4.5 Zettabytes, 2012 (1ZB = 1021B)– 44 Zettabytes by 2020

• How much is a zettabyte?– 1,000,000,000,000,000,000,000 bytes– A stack of 1TB hard disks that is 25,400 km high

18

Page 19: The Hadoop Ecosystem for Developers

Data grows fast!

19

Page 20: The Hadoop Ecosystem for Developers

Growth Rate

How much data generated in a day?

– 7 TB, Twitter

– 10 TB, Facebook

20

Page 21: The Hadoop Ecosystem for Developers

Variety

• Big Data extends beyond structured data: including semi-structured and unstructured information: logs, text, audio and videos

• Wide variety of rapidly evolving data types requires highly flexible stores and handling

21

Page 22: The Hadoop Ecosystem for Developers

Structured & Un-Structured

Un-Structured Structured

Objects Tables

Flexible Columns and Rows

Structure Unknown Predefined Structure

Textual and Binary Mostly Textual

22

Page 23: The Hadoop Ecosystem for Developers

Big Data is ANY data:

Unstructured, Semi-Structure and Structured

• Some has fixed structure

• Some is “bring own structure”

• We want to find value in all of it

23

Page 24: The Hadoop Ecosystem for Developers

Data Types by Industry

24

Page 25: The Hadoop Ecosystem for Developers

Velocity

• The speed in which the data is being generated and collected

• Streaming data and large volume data movement

• High velocity of data capture – requires rapid ingestion

• Might cause the backlog problem

25

Page 26: The Hadoop Ecosystem for Developers

Global Internet Device Forecast

26

Page 27: The Hadoop Ecosystem for Developers

Internet of Things

27

Page 28: The Hadoop Ecosystem for Developers

Veracity

• Quality of the data can vary greatly

• Data sources might be messy or corrupted

28

Page 29: The Hadoop Ecosystem for Developers

So, What Defines Big Data?

• When we think that we can produce value from that data and want to handle it

• When the data is too big or moves too fast to handle in a sensible amount of time

• When the data doesn’t fit conventional database structure

• When the solution becomes part of the problem

29

Page 30: The Hadoop Ecosystem for Developers

Handling Big Data

Page 31: The Hadoop Ecosystem for Developers
Page 32: The Hadoop Ecosystem for Developers

Big Data in Practice

• Big data is big: technological infrastructure solutions needed

• Big data is messy: data sources must be cleaned before use

• Big data is complicated: need developers and system admins to manage intake of data

32

Page 33: The Hadoop Ecosystem for Developers

Big Data in Practice (cont.)

• Data must be broken out of silos in order to be mined, analyzed and transformed into value

• The organization must learn how to communicate and interpret the results of analysis

33

Page 34: The Hadoop Ecosystem for Developers

Infrastructure Challenges

• Infrastructure that is built for:

– Large-scale

– Distributed

– Data-intensive jobs that spread the problem across clusters of server nodes

34

Page 35: The Hadoop Ecosystem for Developers

Infrastructure Challenges (cont.)

• Storage:– Efficient and cost-effective enough to capture and

store terabytes, if not petabytes, of data

– With intelligent capabilities to reduce your data footprint such as:

• Data compression

• Automatic data tiering

• Data deduplication

35

Page 36: The Hadoop Ecosystem for Developers

Infrastructure Challenges (cont.)

• Network infrastructure that can quickly import large data sets and then replicate it to various nodes for processing

• Security capabilities that protect highly-distributed infrastructure and data

36

Page 37: The Hadoop Ecosystem for Developers

Introduction To Hadoop

Page 38: The Hadoop Ecosystem for Developers

Apache Hadoop

• Open source project run by Apache (2006)

• Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure

• It Is has been the driving force behind the growth of the big data Industry

• Get the public release from:http://hadoop.apache.org/core/

38

Page 39: The Hadoop Ecosystem for Developers

Hadoop Creation History

39

Page 40: The Hadoop Ecosystem for Developers

Key points

• An open-source framework that uses a simple programming model to enable distributed processing of large data sets on clusters of computers.

• The complete technology stack includes

– common utilities

– a distributed file system

– analytics and data storage platforms

– an application layer that manages distributed processing, parallel computation, workflow, and configuration management

• Cost-effective for handling large unstructured data sets than conventional approaches, and it offers massive scalability and speed

40

Page 41: The Hadoop Ecosystem for Developers

Why use Hadoop?

Cost Flexibility

Near linear

performance up

to 1000s of nodes

Leverages

commodity HW &

open source SW

Versatility with

data, analytics &

operation

Scalability

41

Page 42: The Hadoop Ecosystem for Developers

No, really, why use Hadoop?

• Need to process Multi Petabyte Datasets• Expensive to build reliability in each application• Nodes fail every day

– Failure is expected, rather than exceptional– The number of nodes in a cluster is not constant

• Need common infrastructure– Efficient, reliable, Open Source Apache License

• The above goals are same as Condor, but– Workloads are IO bound and not CPU bound

42

Page 43: The Hadoop Ecosystem for Developers

Hadoop Benefits

• Reliable solution based on unreliable hardware• Designed for large files• Load data first, structure later• Designed to maximize throughput of large scans• Designed to leverage parallelism• Designed to scale• Flexible development platform• Solution Ecosystem

43

Page 44: The Hadoop Ecosystem for Developers

Hadoop Limitations

• Hadoop is scalable but it’s not fast

• Some assembly required

• Batteries not included

• Instrumentation not included either

• DIY mindset

44

Page 45: The Hadoop Ecosystem for Developers

Hadoop Components

Page 46: The Hadoop Ecosystem for Developers

Hadoop Main Components

• HDFS: Hadoop Distributed File System –distributed file system that runs in a clustered environment.

• MapReduce – programming paradigm for running processes over a clustered environments.

47

Page 47: The Hadoop Ecosystem for Developers

HDFS is...

• A distributed file system

• Redundant storage

• Designed to reliably store data using commodity hardware

• Designed to expect hardware failures

• Intended for large files

• Designed for batch inserts

• The Hadoop Distributed File System

48

Page 48: The Hadoop Ecosystem for Developers

HDFS Node Types

HDFS has three types of Nodes

• Namenode (MasterNode)– Distribute files in the cluster– Responsible for the replication between

the datanodes and for file blocks location

• Datanodes– Responsible for actual file store– Serving data from files(data) to client

• BackupNode (version 0.23 and up)• It’s a backup of the NameNode

49

Page 49: The Hadoop Ecosystem for Developers

Typical implementation

• Nodes are commodity PCs

• 30-40 nodes per rack

• Uplink from racks is 3-4 gigabit

• Rack-internal is 1 gigabit

50

Page 50: The Hadoop Ecosystem for Developers

MapReduce is...

• A programming model for expressing distributed computations at a massive scale

• An execution framework for organizing and performing such computations

• An open-source implementation called Hadoop

51

Page 51: The Hadoop Ecosystem for Developers

MapReduce paradigm

• Implement two functions:• MAP - Takes a large problem and divides into sub problems

and performs the same function on all subsystemsMap(k1, v1) -> list(k2, v2)

• REDUCE - Combine the output from all sub-problemsReduce(k2, list(v2)) -> list(v3)

• Framework handles everything else (almost)

• Value with same key must go to the same reducer

52

Page 52: The Hadoop Ecosystem for Developers

Typical large-data problem

• Iterate over a large number of records

• Extract something of interest from each

• Shuffle and sort intermediate results

• Aggregate intermediate results

• Generate final output

Map

Reduce

53

Page 53: The Hadoop Ecosystem for Developers

Divide and Conquer

54

Page 54: The Hadoop Ecosystem for Developers

MapReduce - word count example

function map(String name, String document):

for each word w in document:

emit(w, 1)

function reduce(String word, Iterator

partialCounts):

totalCount = 0

for each count in partialCounts:

totalCount += count

emit(word, totalCount)

55

Page 55: The Hadoop Ecosystem for Developers

MapReduce Word Count Process

56

Page 56: The Hadoop Ecosystem for Developers

MapReduce Advantages

Example: $HADOOP_HOME/bin/hadoop jar @HADOOP_HOME/hadoop-

streaming.jar \

- input myInputDirs \

- output myOutputDir \

- mapper /bin/cat \

- reducer /bin/wc

• Runs programs (jobs) across many computers

• Protects against single server failure by re-run failed steps

• MR jobs can be written in Java, C, Phyton, Ruby and

others

• Users only write Map and Reduce functions

57

Page 57: The Hadoop Ecosystem for Developers

MapReduce is good for...

• Embarrassingly parallel algorithms

• Summing, grouping, filtering, joining

• Off-line batch jobs on massive data sets

• Analyzing an entire large dataset

58

Page 58: The Hadoop Ecosystem for Developers

MapReduce is OK for...

• Iterative jobs (i.e., graph algorithms)

• Each iteration must read/write data to disk

• IO and latency cost of an iteration is high

59

Page 59: The Hadoop Ecosystem for Developers

MapReduce is NOT good for...

• Jobs that need shared state/coordination

• Tasks are shared-nothing

• Shared-state requires scalable state store

• Low-latency jobs

• Jobs on small datasets

• Finding individual records

60

Page 60: The Hadoop Ecosystem for Developers

Deep Dive into HDFS

Page 61: The Hadoop Ecosystem for Developers

HDFS

• Appears as a single disk• Runs on top of a native filesystem

– Ext3,Ext4,XFS

• Based on Google's Filesystem GFS• Fault Tolerant

– Can handle disk crashes, machine crashes, etc...

• Based on Google's Filesystem (GFS or GoogleFS)– gfs-sosp2003.pdf

• http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/gfs-sosp2003.pdf

– http://en.wikipedia.org/wiki/Google_File_System

62

Page 62: The Hadoop Ecosystem for Developers

HDFS is Good for...

• Storing large files– Terabytes, Petabytes, etc...– Millions rather than billions of files– 100MB or more per file

• Streaming data– Write once and read-many times patterns– Optimized for streaming reads rather than random reads– Append operation added to Hadoop 0.21

• “Cheap” Commodity Hardware– No need for super-computers, use less reliable commodity hardware

63

Page 63: The Hadoop Ecosystem for Developers

HDFS is not so good for...

• Low-latency reads– High-throughput rather than low latency for small chunks of

data– HBase addresses this issue

• Large amount of small files– Better for millions of large files instead of billions of small files

• For example each file can be 100MB or more

• Multiple Writers– Single writer per file– Writes only at the end of file, no-support for arbitrary offset

64

Page 64: The Hadoop Ecosystem for Developers

HDFS: Hadoop Distributed File System

• A given file is broken down into blocks (default=64MB), then blocks are replicated across cluster (default=3)

• Optimized for:– Throughput– Put/Get/Delete– Appends

• Block Replication for:– Durability– Availability– Throughput

• Block Replicas are distributed across servers and racks

65

Page 65: The Hadoop Ecosystem for Developers

HDFS Architecture

• Name Node : Maps a file to a file-id and list of Map Nodes

• Data Node : Maps a block-id to a physical location on disk

• Secondary Name Node: Periodic merge of Transaction log

66

Page 66: The Hadoop Ecosystem for Developers

HDFS Daemons

• Filesystem cluster is manager by three types of processes– Namenode

• manages the File System's namespace/meta-data/file blocks• Runs on 1 machine to several machines

– Datanode• Stores and retrieves data blocks• Reports to Namenode• Runs on many machines

– Secondary Namenode• Performs house keeping work so Namenode doesn’t have to• Requires similar hardware as Namenode machine• Not used for high-availability – not a backup for Namenode

67

Page 67: The Hadoop Ecosystem for Developers

Files and Blocks

• Files are split into blocks (single unit of storage)

– Managed by Namenode, stored by Datanode

– Transparent to user

• Replicated across machines at load time

– Same block is stored on multiple machines

– Good for fault-tolerance and access

– Default replication is 3

68

Page 68: The Hadoop Ecosystem for Developers

HDFS Blocks

• Blocks are traditionally either 64MB or 128MB– Default is 128MB

• The motivation is to minimize the cost of seeks as compared to transfer rate– 'Time to transfer' > 'Time to seek'

• For example, lets say– seek time = 10ms– Transfer rate = 100 MB/s

• To achieve seek time of 1% transfer rate– Block size will need to be = 100MB

69

Page 69: The Hadoop Ecosystem for Developers

Block Replication

• Namenode determines replica placement • Replica placements are rack aware

– Balance between reliability and performance• Attempts to reduce bandwidth• Attempts to improve reliability by putting replicas on multiple racks

– Default replication is 3• 1st replica on the local rack• 2nd replica on the local rack but different machine• 3rd replica on the different rack

– This policy may change/improve in the future

70

Page 70: The Hadoop Ecosystem for Developers

Data Correctness

• Use Checksums to validate data– Use CRC32

• File Creation– Client computes checksum per 512 byte

– Data Node stores the checksum

• File access– Client retrieves the data and checksum from Data Node

– If Validation fails, Client tries other replicas

71

Page 71: The Hadoop Ecosystem for Developers

Data Pipelining

• Client retrieves a list of Data Nodes on which to place replicas of a block

• Client writes block to the first Data Node

• The first Data Node forwards the data to the next Data Node in the Pipeline

• When all replicas are written, the Client moves on to write the next block in file

72

Page 72: The Hadoop Ecosystem for Developers

Client, Namenode, and Datanodes

• Namenode does NOT directly write or read data

– One of the reasons for HDFS’s Scalability

• Client interacts with Namenode to update Namenode’s HDFS namespace and retrieve block locations for writing and reading

• Client interacts directly with Datanode to read/write data

73

Page 73: The Hadoop Ecosystem for Developers

Name Node Metadata

• Meta-data in Memory– The entire metadata is in main memory– No demand paging of meta-data

• Types of Metadata– List of files– List of Blocks for each file– List of Data Nodes for each block– File attributes, e.g. creation time, replication factor

• A Transaction Log– Records file creations, file deletions. etc.

74

Page 74: The Hadoop Ecosystem for Developers

Namenode Memory Concerns

• For fast access Namenode keeps all block metadata in-memory– The bigger the cluster - the more RAM required

• Best for millions of large files (100mb or more) rather than billions• Will work well for clusters of 100s machines

• Hadoop 2+– Namenode Federations

• Each namenode will host part of the blocks• Horizontally scale the Namenode

– Support for 1000+ machine clusters

75

Page 75: The Hadoop Ecosystem for Developers

Using HDFS

Page 76: The Hadoop Ecosystem for Developers

Reading Data from HDFS

1. Create FileSystem

2. Open InputStream to a Path

3. Copy bytes using IOUtils

4. Close Stream

77

Page 77: The Hadoop Ecosystem for Developers

1: Create FileSystem

• FileSystem fs = FileSystem.get(new Configuration());

– If you run with yarn command, DistributedFileSystem (HDFS) will be created

• Utilizes fs.default.name property from configuration

• Recall that Hadoop framework loads core-site.xml which sets property to hdfs (hdfs://localhost:8020)

78

Page 78: The Hadoop Ecosystem for Developers

2: Open Input Stream to a Path

...InputStream input = null;try {input = fs.open(fileToRead);...• fs.open returns org.apache.hadoop.fs.FSDataInputStream

– Another FileSystem implementation will return their own custom implementation of InputStream

• Opens stream with a default buffer of 4k• If you want to provide your own buffer size use

– fs.open(Path f, int bufferSize)

79

Page 79: The Hadoop Ecosystem for Developers

3: Copy bytes using IOUtils

IOUtils.copyBytes(inputStream, outputStream, buffer);

• Copy bytes from InputStream to OutputStream

• Hadoop’s IOUtils makes the task simple

– buffer parameter specifies number of bytes to buffer at a time

80

Page 80: The Hadoop Ecosystem for Developers

4: Close Stream

...

} finally {

IOUtils.closeStream(input);

...

• Utilize IOUtils to avoid boiler plate code that catches IOException

81

Page 81: The Hadoop Ecosystem for Developers

ReadFile.java Example

public class ReadFile {

public static void main(String[] args) throws IOException {

Path fileToRead = new Path("/user/sample/sonnets.txt");

FileSystem fs = FileSystem.get(new Configuration()); // 1: Open FileSystem

InputStream input = null;

try {

input = fs.open(fileToRead); // 2: Open InputStream

IOUtils.copyBytes(input, System.out, 4096); // 3: Copy from Input to Output

} finally {

IOUtils.closeStream(input); // 4: Close stream

}

}

}

$ yarn jar my-hadoop-examples.jar hdfs.ReadFile

82

Page 82: The Hadoop Ecosystem for Developers

Reading Data - Seek

• FileSystem.open returns FSDataInputStream

– Extension of java.io.DataInputStream

– Supports random access and reading via interfaces:

• PositionedReadable : read chunks of the stream

• Seekable : seek to a particular position in the stream

83

Page 83: The Hadoop Ecosystem for Developers

Seeking to a Position

• FSDataInputStream implements Seekableinterface– void seek(long pos) throws IOException

• Seek to a particular position in the file

• Next read will begin at that position• If you attempt to seek past the file boundary IOException is emitted

• Somewhat expensive operation – strive for streaming and not seeking

– long getPos() throws IOException• Returns the current position/offset from the beginning of the

stream/file

84

Page 84: The Hadoop Ecosystem for Developers

SeekReadFile.java Examplepublic class SeekReadFile {

public static void main(String[] args) throws IOException {

Path fileToRead = new Path("/user/sample/readMe.txt");

FileSystem fs = FileSystem.get(new Configuration());

FSDataInputStream input = null;

try {

input = fs.open(fileToRead);

System.out.print("start postion=" + input.getPos() + ": ");

IOUtils.copyBytes(input, System.out, 4096, false);

input.seek(11);

System.out.print("start postion=" + input.getPos() + ": ");

IOUtils.copyBytes(input, System.out, 4096, false);

input.seek(0);

System.out.print("start postion=" + input.getPos() + ": ");

IOUtils.copyBytes(input, System.out, 4096, false);

} finally {

IOUtils.closeStream(input);

}

}

}

85

Page 85: The Hadoop Ecosystem for Developers

Run SeekReadFile Example

$ yarn jar my-hadoop-examples.jar hdfs.SeekReadFile

start position=0: Hello from readme.txt

start position=11: readme.txt

start position=0: Hello from readme.txt

86

Page 86: The Hadoop Ecosystem for Developers

Write Data

1. Create FileSystem instance

2. Open OutputStream

– FSDataOutputStream in this case

– Open a stream directly to a Path from FileSystem

– Creates all needed directories on the provided path

3. Copy data using IOUtils

87

Page 87: The Hadoop Ecosystem for Developers

WriteToFile.java Example

public class WriteToFile {public static void main(String[] args) throws IOException {

String textToWrite = "Hello HDFS! Elephants are awesome!\n";InputStream in = new BufferedInputStream(

new ByteArrayInputStream(textToWrite.getBytes()));Path toHdfs = new Path("/user/sample/writeMe.txt");Configuration conf = new Configuration();FileSystem fs = FileSystem.get(conf); // 1: Create FileSystem instanceFSDataOutputStream out = fs.create(toHdfs); // 2: Open OutputStreamIOUtils.copyBytes(in, out, conf); // 3: Copy Data

}}

88

Page 88: The Hadoop Ecosystem for Developers

Run WriteToFile

$ yarn jar my-hadoop-examples.jar hdfs.WriteToFile

$ hdfs dfs -cat /user/sample/writeMe.txt

Hello HDFS! Elephants are awesome!

90

Page 89: The Hadoop Ecosystem for Developers

MapReduce and YARN

Page 90: The Hadoop Ecosystem for Developers

Hadoop MapReduce

• Model for processing large amounts of data in parallel– On commodity hardware– Lots of nodes

• Derived from functional programming– Map and reduce functions

• Can be implemented in multiple languages– Java, C++, Ruby, Python (etc...)

92

Page 91: The Hadoop Ecosystem for Developers

The MapReduce Model

• Imposes key-value input/output• Defines map and reduce functions

map: (K1,V1) → list (K2,V2)reduce: (K2,list(V2)) → list (K3,V3)1. Map function is applied to every input key-value pair2. Map function generates intermediate key-value pairs3. Intermediate key-values are sorted and grouped by key4. Reduce is applied to sorted and grouped intermediate key-values5. Reduce emits result key-values

93

Page 92: The Hadoop Ecosystem for Developers

MapReduce Programming Model

94

Page 93: The Hadoop Ecosystem for Developers

MapReduce in Hadoop (1)

95

Page 94: The Hadoop Ecosystem for Developers

MapReduce in Hadoop (2)

96

Page 95: The Hadoop Ecosystem for Developers

MapReduce Framework

• Takes care of distributed processing and coordination

• Scheduling– Jobs are broken down into smaller chunks called tasks.– These tasks are scheduled

• Task Localization with Data– Framework strives to place tasks on the nodes that host

the segment of data to be processed by that specific task– Code is moved to where the data is

97

Page 96: The Hadoop Ecosystem for Developers

MapReduce Framework

• Error Handling

– Failures are an expected behavior so tasks are automatically re-tried on other machines

• Data Synchronization

– Shuffle and Sort barrier re-arranges and moves data between machines

– Input and output are coordinated by the framework

98

Page 97: The Hadoop Ecosystem for Developers

Map Reduce 2.0 on YARN

• Yet Another Resource Negotiator (YARN)• Various applications can run on YARN

– MapReduce is just one choice (the main choice at this point)– http://wiki.apache.org/hadoop/PoweredByYarn

• YARN was designed to address issues with MapReduce1– Scalability issues (max ~4,000 machines)– Inflexible Resource Management

• MapReduce1 had slot based model

99

Page 98: The Hadoop Ecosystem for Developers

MapReduce1 vs. YARN

• MapReduce1 runs on top of JobTracker and TaskTracker daemons– JobTracker schedules tasks, matches task with TaskTrackers– JobTracker manages MapReduce Jobs, monitors progress– JobTracker recovers from errors, restarts failed and slow tasks

• MapReduce1 has inflexible slot-based memory management model– Each TaskTracker is configured at start-up to have N slots– A task is executed in a single slot– Slots are configured with maximum memory on cluster start-up– The model is likely to cause over and under utilization issues

100

Page 99: The Hadoop Ecosystem for Developers

MapReduce1 vs. YARN (cont.)

• YARN addresses shortcomings of MapReduce1– JobTracker is split into 2 daemons

• ResourceManager - administers resources on the cluster• ApplicationMaster - manages applications such as MapReduce

– Fine-Grained memory management model• ApplicationMaster requests resources by asking for

“containers” with a certain memory limit (ex 2G)• YARN administers these containers and enforces memory usage• Each Application/Job has control of how much memory to

request

101

Page 100: The Hadoop Ecosystem for Developers

Daemons

• YARN Daemons– Node Manger

• Manages resources of a single node• There is one instance per node in the cluster

– Resource Manager• Manages Resources for a Cluster• Instructs Node Manager to allocate resources• Application negotiates for resources with Resource Manager• There is only one instance of Resource Manager

• MapReduce Specific Daemon– MapReduce History Server

• Archives Jobs’ metrics and meta-data

102

Page 101: The Hadoop Ecosystem for Developers

Old vs. New Java API

• There are two flavors of MapReduce API which became known as Old and New

• Old API classes reside under– org.apache.hadoop.mapred

• New API classes can be found under– org.apache.hadoop.mapreduce– org.apache.hadoop.mapreduce.lib

• We will use new API exclusively• New API was re-designed for easier evolution• Early Hadoop versions deprecated old API but deprecation was removed• Do not mix new and old API

103

Page 102: The Hadoop Ecosystem for Developers

Developing First MapReduce Job

Page 103: The Hadoop Ecosystem for Developers

MapReduce

• Divided in two phases– Map phase

– Reduce phase

• Both phases use key-value pairs as input and output

• The implementer provides map and reduce functions

• MapReduce framework orchestrates splitting, and distributing of Map and Reduce phases– Most of the pieces can be easily overridden

105

Page 104: The Hadoop Ecosystem for Developers

MapReduce

• Job – execution of map and reduce functions to accomplish a task

– Equal to Java’s main

• Task – single Mapper or Reducer

– Performs work on a fragment of data

106

Page 105: The Hadoop Ecosystem for Developers

Map Reduce Flow of Data

107

Page 106: The Hadoop Ecosystem for Developers

First Map Reduce Job

• StartsWithCount Job

– Input is a body of text from HDFS

• In this case hamlet.txt

– Split text into tokens

– For each first letter sum up all occurrences

– Output to HDFS

108

Page 107: The Hadoop Ecosystem for Developers

Word Count Job

109

Page 108: The Hadoop Ecosystem for Developers

Starts With Count Job

1. Configure the Job– Specify Input, Output, Mapper, Reducer and Combiner

2. Implement Mapper– Input is text – a line from hamlet.txt– Tokenize the text and emit first character with a count of

1 - <token, 1>

3. Implement Reducer– Sum up counts for each letter– Write out the result to HDFS

4. Run the job

110

Page 109: The Hadoop Ecosystem for Developers

1: Configure Job

• Job class– Encapsulates information about a job– Controls execution of the job

Job job = Job.getInstance(getConf(), "StartsWithCount");

• A job is packaged within a jar file– Hadoop Framework distributes the jar on your behalf– Needs to know which jar file to distribute– The easiest way to specify the jar that your job resides in is by calling

job.setJarByClassjob.setJarByClass(getClass());

– Hadoop will locate the jar file that contains the provided class

111

Page 110: The Hadoop Ecosystem for Developers

1: Configure Job - Specify Input

TextInputFormat.addInputPath(job, new Path(args[0]));

job.setInputFormatClass(TextInputFormat.class);

• Can be a file, directory or a file pattern– Directory is converted to a list of files as an input

• Input is specified by implementation of InputFormat - in this case TextInputFormat– Responsible for creating splits and a record reader– Controls input types of key-value pairs, in this case LongWritable

and Text– File is broken into lines, mapper will receive 1 line at a time

112

Page 111: The Hadoop Ecosystem for Developers

Side Note – Hadoop IO Classes

• Hadoop uses it’s own serialization mechanism for writing data in and out of network, database or files– Optimized for network serialization– A set of basic types is provided– Easy to implement your own

• org.apache.hadoop.io package– LongWritable for Long– IntWritable for Integer– Text for String– Etc...

113

Page 112: The Hadoop Ecosystem for Developers

1: Configure Job – Specify Output

TextOutputFormat.setOutputPath(job, new Path(args[1]));job.setOutputFormatClass(TextOutputFormat.class);

• OutputFormat defines specification for outputting data from Map/Reduce job

• Count job utilizes an implemenation ofOutputFormat - TextOutputFormat

– Define output path where reducer should place its output• If path already exists then the job will fail

– Each reducer task writes to its own file• By default a job is configured to run with a single reducer

– Writes key-value pair as plain text

114

Page 113: The Hadoop Ecosystem for Developers

1: Configure Job – Specify Output

job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);

• Specify the output key and value types for both mapper and reducer functions– Many times the same type– If types differ then use

• setMapOutputKeyClass()• setMapOutputValueClass()

115

Page 114: The Hadoop Ecosystem for Developers

1: Configure Job

• Specify Mapper, Reducer and Combiner– At a minimum will need to implement these classes

– Mappers and Reducer usually have same output key

job.setMapperClass(StartsWithCountMapper.class);

job.setReducerClass(StartsWithCountReducer.class);

job.setCombinerClass(StartsWithCountReducer.class);

116

Page 115: The Hadoop Ecosystem for Developers

1: Configure Job

• job.waitForCompletion(true)

– Submits and waits for completion

– The boolean parameter flag specifies whether output should be written to console

– If the job completes successfully ‘true’ is returned, otherwise ‘false’ is returned

117

Page 116: The Hadoop Ecosystem for Developers

Our Count Job is configured to

• Chop up text files into lines• Send records to mappers as key-value pairs

– Line number and the actual value

• Mapper class is StartsWithCountMapper– Receives key-value of <IntWritable,Text>– Outputs key-value of <Text, IntWritable>

• Reducer class is StartsWithCountReducer– Receives key-value of <Text, IntWritable>– Outputs key-values of <Text, IntWritable> as text

• Combiner class is StartsWithCountReducer

118

Page 117: The Hadoop Ecosystem for Developers

1: Configure Count Job

public class StartsWithCountJob extends Configured implements Tool{@Overridepublic int run(String[] args) throws Exception {

Job job = Job.getInstance(getConf(), "StartsWithCount");job.setJarByClass(getClass());

// configure output and input sourceTextInputFormat.addInputPath(job, new Path(args[0]));job.setInputFormatClass(TextInputFormat.class);

// configure mapper and reducerjob.setMapperClass(StartsWithCountMapper.class);job.setCombinerClass(StartsWithCountReducer.class);job.setReducerClass(StartsWithCountReducer.class);

119

Page 118: The Hadoop Ecosystem for Developers

StartsWithCountJob.java (cont.)

// configure outputTextOutputFormat.setOutputPath(job, new Path(args[1]));job.setOutputFormatClass(TextOutputFormat.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);

return job.waitForCompletion(true) ? 0 : 1;}public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new StartsWithCountJob(), args);

System.exit(exitCode);}

}

120

Page 119: The Hadoop Ecosystem for Developers

2: Implement Mapper class

• Class has 4 Java Generics parameters– (1) input key (2) input value (3) output key (4) output value– Input and output utilizes hadoop’s IO framework

• org.apache.hadoop.io

• Your job is to implement map() method– Input key and value– Output key and value– Logic is up to you

• map() method injects Context object, use to:– Write output– Create your own counters

121

Page 120: The Hadoop Ecosystem for Developers

2: Implement Mapper

public class StartsWithCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable countOne = new IntWritable(1);private final Text reusableText = new Text();

@Overrideprotected void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {StringTokenizer tokenizer = new StringTokenizer(value.toString());while (tokenizer.hasMoreTokens()) {

reusableText.set(tokenizer.nextToken().substring(0, 1));context.write(reusableText, countOne);

}}

}

122

Page 121: The Hadoop Ecosystem for Developers

3: Implement Reducer

• Analogous to Mapper – generic class with four types– (1) input key (2) input value (3) output key (4) output value– The output types of map functions must match the input types of reduce

function• In this case Text and IntWritable

– Map/Reduce framework groups key-value pairs produced by mapper by key

• For each key there is a set of one or more values• Input into a reducer is sorted by key• Known as Shuffle and Sort

– Reduce function accepts key->setOfValues and outputs key-value pairs• Also utilizes Context object (similar to Mapper)

123

Page 122: The Hadoop Ecosystem for Developers

3: Implement Reducer

public class StartsWithCountReducer extendsReducer<Text, IntWritable, Text, IntWritable> {

@Overrideprotected void reduce(Text token,

Iterable<IntWritable> counts,Context context) throws IOException, InterruptedException {int sum = 0;

for (IntWritable count : counts) {sum+= count.get();

}context.write(token, new IntWritable(sum));

}}

124

Page 123: The Hadoop Ecosystem for Developers

3: Reducer as a Combiner

• Combine data per Mapper task to reduce amount of data transferred to reduce phase

• Reducer can very often serve as a combiner– Only works if reducer’s output key-value pair types are the

same as mapper’s output types

• Combiners are not guaranteed to run– Optimization only– Not for critical logic

• More about combiners later

125

Page 124: The Hadoop Ecosystem for Developers

4: Run Count Job

$ yarn jar my-hadoop-examples.jar \mr.wordcount.StartsWithCountJob \/user/sample/readme.txt \/user/sample/wordcount

126

Page 125: The Hadoop Ecosystem for Developers

Output of Count Job

• Output is written to the configured output directory

– /user/sample/wordCount/

• One output file per Reducer

– part-r-xxxxx format

• Output is driven by TextOutputFormat class

127

Page 126: The Hadoop Ecosystem for Developers

$yarn command

• yarn script with a class argument command launches a JVM and executes the provided Job

$ yarn jar HadoopSamples.jar \mr.wordcount.StartsWithCountJob \/user/sample/hamlet.txt \/user/sample/wordcount/

• You could use straight java but yarn script is more convenient– Adds hadoop’s libraries to CLASSPATH– Adds hadoop’s configurations to Configuration object

• Ex: core-site.xml, mapred-site.xml, *.xml

– You can also utilize $HADOOP_CLASSPATH environment variable

128

Page 127: The Hadoop Ecosystem for Developers

Input and Output

Page 128: The Hadoop Ecosystem for Developers

MapReduce Theory

• Map and Reduce functions produce input and output– Input and output can range from Text to Complex data

structures– Specified via Job’s configuration– Relatively easy to implement your own

• Generally we can treat the flow asmap: (K1,V1) → list (K2,V2)reduce: (K2,list(V2)) → list (K3,V3)– Reduce input types are the same as map output types

130

Page 129: The Hadoop Ecosystem for Developers

Map Reduce Flow of Data

map: (K1,V1) → list (K2,V2)

reduce: (K2,list(V2)) → list (K3,V3)

131

Page 130: The Hadoop Ecosystem for Developers

Key and Value Types

• Utilizes Hadoop’s serialization mechanism for writing data in and out of network, database or files– Optimized for network serialization– A set of basic types is provided– Easy to implement your own

• Extends Writable interface– Framework’s serialization mechanisms– Defines how to read and write fields– org.apache.hadoop.io package

132

Page 131: The Hadoop Ecosystem for Developers

Key and Value Types

• Keys must implement WritableComparableinterface– Extends Writable and java.lang.Comparable<T>– Required because keys are sorted prior reduce phase

• Hadoop is shipped with many default implementations of WritableComparable<T>– Wrappers for primitives (String, Integer, etc...)– Or you can implement your own

133

Page 132: The Hadoop Ecosystem for Developers

WritableComparable<T>

Implementations

Hadoop’s Class Explanation

BooleanWritable Boolean implementation

BytesWritable Bytes implementation

DoubleWritable Double implementation

FloatWritable Float implementation

IntWritable Int implementation

LongWritable Long implementation

NullWritable Writable with no data

134

Page 133: The Hadoop Ecosystem for Developers

Implement Custom

WritableComparable<T>

• Implement 3 methods– write(DataOutput)

• Serialize your attributes

– readFields(DataInput)• De-Serialize your attributes

– compareTo(T)• Identify how to order your objects• If your custom object is used as the key it will be sorted

prior to reduce phase

135

Page 134: The Hadoop Ecosystem for Developers

BlogWritable – Implemenation

of WritableComparable<T>

public class BlogWritable implements

WritableComparable<BlogWritable> {

private String author;

private String content;

public BlogWritable(){}

public BlogWritable(String author, String content) {

this.author = author;

this.content = content;

}

public String getAuthor() {

return author;

public String getContent() {

return content;

...

...

136

Page 135: The Hadoop Ecosystem for Developers

BlogWritable – Implemenation

of WritableComparable<T>

...

@Override

public void readFields(DataInput input) throws IOException {

author = input.readUTF();

content = input.readUTF();

}

@Override

public void write(DataOutput output) throws IOException {

output.writeUTF(author);

output.writeUTF(content);

}

@Override

public int compareTo(BlogWritable other) {

return author.compareTo(other.author);

}

}

137

Page 136: The Hadoop Ecosystem for Developers

Mapper

• Extend Mapper class– Mapper<KeyIn, ValueIn, KeyOut, ValueOut>

• Simple life-cycle1. The framework first calls setup(Context)2. for each key/value pair in the split:

• map(Key, Value, Context)3. Finally cleanup(Context) is called

138

Page 137: The Hadoop Ecosystem for Developers

InputSplit

• Splits are a set of logically arranged records– A set of lines in a file– A set of rows in a database table

• Each instance of mapper will process a single split– Map instance processes one record at a time

• map(k,v) is called for each record

• Splits are implemented by extending InputSplitclass

139

Page 138: The Hadoop Ecosystem for Developers

InputSplit

• Framework provides many options for InputSplit implementations

– Hadoop’s FileSplit

– HBase’s TableSplit

• Don’t usually need to deal with splits directly

– InputFormat’s responsibility

140

Page 139: The Hadoop Ecosystem for Developers

Combiner

• Runs on output of map function• Produces outpu

map: (K1,V1) → list (K2,V2)combine: (K2,list(V2)) → list (K2,V2)reduce: (K2,list(V2)) → list (K3,V3)

• Optimization to reduce bandwidth– NO guarantees on being called– Maybe only applied to a sub-set of map outputs

• Often is the same class as Reducer• Each combine processes output from a single split

141

Page 140: The Hadoop Ecosystem for Developers

Combiner Data Flow

142

Page 141: The Hadoop Ecosystem for Developers

Sample StartsWithCountJob

Run without Combiner

143

Page 142: The Hadoop Ecosystem for Developers

Sample StartsWithCountJob

Run with Combiner

144

Page 143: The Hadoop Ecosystem for Developers

Specify Combiner Function

• To implement Combiner extend Reducer class

• Set combiner on Job class

job.setCombinerClass(StartsWithCountReducer.class);

145

Page 144: The Hadoop Ecosystem for Developers

Reducer

• Extend Reducer class– Reducer<KeyIn, ValueIn, KeyOut, ValueOut>– KeyIn and ValueIn types must match output types of mapper

• Receives input from mappers’ output– Sorted on key– Grouped on key of key-values produced by mappers– Input is directed by Partitioner implementation

• Simple life-cycle – similar to Mapper– The framework first calls setup(Context)– for each key → list(value) calls

• reduce(Key, Values, Context)

– Finally cleanup(Context) is called

146

Page 145: The Hadoop Ecosystem for Developers

Reducer

• Can configure more than 1 reducer– job.setNumReduceTasks(10);– mapreduce.job.reduces property

• job.getConfiguration().setInt("mapreduce.job.reduces", 10)

• Partitioner implementation directs key-value pairs to the proper reducer task– A partition is processed by a reduce task

• # of partitions = # or reduce tasks

– Default strategy is to hash key to determine partition

implemented by HashPartitioner<K, V>

147

Page 146: The Hadoop Ecosystem for Developers

Partitioner Data Flow

148

Page 147: The Hadoop Ecosystem for Developers

HashPartitioner

public class HashPartitioner<K, V> extends Partitioner<K, V> {public int getPartition(K key, V value, int numReduceTasks) {

return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;}

}

• Calculate Index of Partition:– Convert key’s hash into non-negative number

• Logical AND with maximum integer value

– Modulo by number of reduce tasks

• In case of more than 1 reducer– Records distributed evenly across available reduce tasks

• Assuming a good hashCode() function

– Records with same key will make it into the same reduce task– Code is independent from the # of partitions/reducers specified

149

Page 148: The Hadoop Ecosystem for Developers

Custom Partitioner

public class CustomPartitionerextends Partitioner<Text, BlogWritable>{

@Overridepublic int getPartition(Text key, BlogWritable blog,

int numReduceTasks) {int positiveHash =blog.getAuthor().hashCode()& Integer.MAX_VALUE;

//Use author’s hash only, AND with//max integer to get a positive value

return positiveHash % numReduceTasks;}

}

• All blogs with the same author will end up in the same reduce task

150

Page 149: The Hadoop Ecosystem for Developers

Component Overview

151

Page 150: The Hadoop Ecosystem for Developers

Improving Hadoop

Page 151: The Hadoop Ecosystem for Developers

Improving Hadoop

• Core Hadoop is complicated so some tools were added to make things easier

• Hadoop Distributions collect these tools and release them as a whole package

153

Page 152: The Hadoop Ecosystem for Developers

Noticeable Distributions

• Cloudera

• MapR

• HortonWorks

• Amazon EMR

154

Page 153: The Hadoop Ecosystem for Developers

HADOOP Technology Eco System

155

Page 154: The Hadoop Ecosystem for Developers

Improving Programmability

• Pig: Programming language that simplifies Hadoop actions: loading, transforming and sorting data

• Hive: enables Hadoop to operate as data warehouse using SQL-like syntax.

156

Page 155: The Hadoop Ecosystem for Developers

Pig

• “is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. “

• Top Level Apache Project– http://pig.apache.org

• Pig is an abstraction on top of Hadoop– Provides high level programming language designed for data processing– Converted into MapReduce and executed on Hadoop Clusters

• Pig is widely accepted and used– Yahoo!, Twitter, Netflix, etc...

157

Page 156: The Hadoop Ecosystem for Developers

Pig and MapReduce

• MapReduce requires programmers– Must think in terms of map and reduce functions– More than likely will require Java programmers

• Pig provides high-level language that can be used by– Analysts– Data Scientists– Statisticians– Etc...

• Originally implemented at Yahoo! to allow analysts to access data

158

Page 157: The Hadoop Ecosystem for Developers

Pig’s Features

• Join Datasets

• Sort Datasets

• Filter

• Data Types

• Group By

• User Defined Functions

159

Page 158: The Hadoop Ecosystem for Developers

Pig’s Use Cases

• Extract Transform Load (ETL)– Ex: Processing large amounts of log data

• clean bad entries, join with other data-sets

• Research of “raw” information– Ex. User Audit Logs

– Schema maybe unknown or inconsistent

– Data Scientists and Analysts may like Pig’s data transformation paradigm

160

Page 159: The Hadoop Ecosystem for Developers

Pig Components

• Pig Latin– Command based language– Designed specifically for data transformation and flow expression

• Execution Environment– The environment in which Pig Latin commands are executed– Currently there is support for Local and Hadoop modes

• Pig compiler converts Pig Latin to MapReduce– Compiler strives to optimize execution– You automatically get optimization improvements with Pig updates

161

Page 160: The Hadoop Ecosystem for Developers

Pig Code Example

162

Page 161: The Hadoop Ecosystem for Developers

Hive

• Data Warehousing Solution built on top of Hadoop

• Provides SQL-like query language named HiveQL– Minimal learning curve for people with SQL expertise

– Data analysts are target audience

• Early Hive development work started at Facebook in 2007

• Today Hive is an Apache project under Hadoop– http://hive.apache.org

163

Page 162: The Hadoop Ecosystem for Developers

Hive Provides

• Ability to bring structure to various data formats

• Simple interface for ad hoc querying, analyzing and summarizing large amounts of data

• Access to files on various data stores such as HDFS and HBase

164

Page 163: The Hadoop Ecosystem for Developers

When not to use Hive

• Hive does NOT provide low latency or real time queries

• Even querying small amounts of data may take minutes

• Designed for scalability and ease-of-use rather than low latency responses

165

Page 164: The Hadoop Ecosystem for Developers

Hive

• Translates HiveQL statements into a set of MapReduce Jobs which are then executed on a Hadoop Cluster

166

Page 165: The Hadoop Ecosystem for Developers

Hive Metastore

• To support features like schema(s) and data partitioning Hive keeps its metadata in a Relational Database

– Packaged with Derby, a lightweight embedded SQL DB• Default Derby based is good for evaluation an testing

• Schema is not shared between users as each user has their own instance of embedded Derby

• Stored in metastore_db directory which resides in the directory that hive was started from

– Can easily switch another SQL installation such as MySQL

167

Page 166: The Hadoop Ecosystem for Developers

Hive Architecture

168

Page 167: The Hadoop Ecosystem for Developers

1: Create a Table

• Let’s create a table to store data from $PLAY_AREA/data/user-posts.txt

169

Page 168: The Hadoop Ecosystem for Developers

1: Create a Table

170

Page 169: The Hadoop Ecosystem for Developers

2: Load Data Into a Table

171

Page 170: The Hadoop Ecosystem for Developers

3: Query Data

172

Page 171: The Hadoop Ecosystem for Developers

3: Query Data

173

Page 172: The Hadoop Ecosystem for Developers

Databases and DB Connectivity

• HBase: column oriented database that runs on HDFS.

• Sqoop: a tool designed to import data from relational databases (HDFS or Hive).

174

Page 173: The Hadoop Ecosystem for Developers

HBase

• Distributed column-oriented database built on top of HDFS, providing Big Table-like capabilities for Hadoop

175

Page 174: The Hadoop Ecosystem for Developers

When do we use HBase?

• Huge volumes of randomly accessed data.

• HBase is at its best when it’s accessed in a distributed fashion by many clients.

• Consider HBase when you’re loading data by key, searching data by key (or range), serving data by key, querying data by key or when storing data by row that doesn’t conform well to a schema.

176

Page 175: The Hadoop Ecosystem for Developers

When not to use Hbase

• HBase doesn’t use SQL, don’t have an optimizer, doesn’t support in transactions or joins.

• If you need those things, you probably can’t use Hbase

177

Page 176: The Hadoop Ecosystem for Developers

HBase Example

Example: create ‘blogposts’, ‘post’, ‘image’ ---create table

put ‘blogposts’, ‘id1′, ‘post:title’, ‘Hello World’ ---insert value

put ‘blogposts’, ‘id1′, ‘post:body’, ‘This is a blog post’ ---insert value

put ‘blogposts’, ‘id1′, ‘image:header’, ‘image1.jpg’ ---insert value

get ‘blogposts’, ‘id1′ ---select records

178

Page 177: The Hadoop Ecosystem for Developers

Sqoop

• Sqoop is a command line tool for moving data from RDBMS to Hadoop

• Uses MapReduce program or Hive to load the data

• Can also export data from HBase to RDBMS

• Comes with connectors to MySQL, PostgreSQL, Oracle, SQL Server and DB2.

Example:

$bin/sqoop import --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' \

--table lineitem --hive-import

$bin/sqoop export --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem --export-dir /data/lineitemData

179

Page 178: The Hadoop Ecosystem for Developers

Improving Hadoop – More useful tools

• For improving coordination: Zookeeper

• For Improving log collection: Flume

• For improving scheduling/orchestration: Oozie

• For Monitoring: Chukwa

• For Improving UI: Hue

180

Page 179: The Hadoop Ecosystem for Developers

ZooKeeper

• ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services

• It allows distributed processes to coordinate with each other through a shared hierarchal namespace which is organized similarly to a standard file system

• ZooKeeper stamps each update with a number that reflects the order of all ZooKeeper transactions

181

Page 180: The Hadoop Ecosystem for Developers

Flume

• Flume is a distributed system for collecting log data from many sources, aggregating it, and writing it to HDFS

• Flume maintains a central list of ongoing data flows, stored redundantly in Zookeeper.

182

Page 181: The Hadoop Ecosystem for Developers

Oozie

• Oozie is a workflow scheduler system to manage Hadoop jobs

• Oozie workflow is a collection of actions arranged in a control dependency DAG specifying a sequence of actions execution

• The Oozie Coordinator system allows the user to define workflow execution bases on intervals or on-demand

183

Page 182: The Hadoop Ecosystem for Developers

Spark

Fast and general MapReduce-like engine for large-scale data processing

• Fast

– In memory data storage for very fast interactive queries Up to 100 times faster then Hadoop

• General

– Unified platform that can combine: SQL, Machine Learning , Streaming , Graph & Complex analytics

• Ease of use

– Can be developed in Java, Scala or Python

• Integrated with Hadoop

– Can read from HDFS, HBase, Cassandra, and any Hadoop data source.

184

Page 183: The Hadoop Ecosystem for Developers

Spark is the Most Active Open Source

Project in Big Data

185

Page 184: The Hadoop Ecosystem for Developers

The Spark Community

186

Page 185: The Hadoop Ecosystem for Developers

Key Concepts

Resilient Distributed Datasets

• Collections of objects spread across a cluster, stored in RAM or on Disk

• Built through parallel transformations

• Automatically rebuilt on failure

Operations

• Transformations(e.g. map, filter, groupBy)

• Actions(e.g. count, collect, save)

Write programs in terms of transformations on

distributed datasets

187

Page 186: The Hadoop Ecosystem for Developers

Unified Platform

• Continued innovation bringing new functionality, e.g.:• Java 8 (Closures, LambaExpressions)• Spark SQL (SQL on Spark, not just Hive)• BlinkDB(Approximate Queries)• SparkR(R wrapper for Spark)

188

Page 187: The Hadoop Ecosystem for Developers

Language Support

189

Page 188: The Hadoop Ecosystem for Developers

Data Sources

• Local Files– file:///opt/httpd/logs/access_log

• S3• Hadoop Distributed Filesystem

– Regular files, sequence files, any other Hadoop InputFormat

• Hbase• Can also read from any other Hadoop data source.

190

Page 189: The Hadoop Ecosystem for Developers

Resilient Distributed Datasets (RDD)

• Spark revolves around RDDs

• Fault-tolerant collection of elements that can be operated on in parallel

– Parallelized Collection: Scala collection which is run in parallel

– Hadoop Dataset: records of files supported by Hadoop

191

Page 190: The Hadoop Ecosystem for Developers

Hadoop Tools

192

Page 191: The Hadoop Ecosystem for Developers

Hadoop cluster

Cluster of machine running Hadoop at Yahoo! (credit: Yahoo!)

193

Page 192: The Hadoop Ecosystem for Developers

Big Data and NoSQL

Page 193: The Hadoop Ecosystem for Developers

The Challenge

• We want scalable, durable, high volume, high velocity, distributed data storage that can handle non-structured data and that will fit our specific need

• RDBMS is too generic and doesn’t cut it any more –it can do the job but it is not cost effective to our usages

195

Page 194: The Hadoop Ecosystem for Developers

The Solution: NoSQL

• Let’s take some parts of the standard RDBMS out to and design the solution to our specific uses

• NoSQL databases have been around for ages under different names/solutions

196

Page 195: The Hadoop Ecosystem for Developers

Example Comparison: RDBMS vs. Hadoop

Typical Traditional RDBMS Hadoop

Data Size Gigabytes Petabytes

Access Interactive and Batch Batch – NOT Interactive

Updates Read / Write many times Write once, Read many times

Structure Static Schema Dynamic Schema

Scaling Nonlinear Linear

Query Response

Time

Can be near immediate Has latency (due to batch processing)

197

Page 196: The Hadoop Ecosystem for Developers

Best Used For:

Structured or Not (Flexibility)

Scalability of Storage/Compute

Complex Data Processing

Cheaper compared to RDBMS

Relational Database

Best Used For:

Interactive OLAP Analytics

(<1sec)

Multistep Transactions

100% SQL Compliance

Best when used together

Hadoop And Relational Database

198

Page 197: The Hadoop Ecosystem for Developers

The NOSQL Movement

• NOSQL is not a technology – it’s a concept.

• We need high performance, scale out abilities or an agile structure.

• We are now willing to sacrifice our sacred cows: consistency, transactions.

• Over 200 different brands and solutions (http://nosql-database.org/).

199

Page 198: The Hadoop Ecosystem for Developers

NoSQL, NOSQL or NewSQL

• NoSQL is not No to SQL

• NoSQL is not Never SQL

• NOSQL = Not Only SQL

200

Page 199: The Hadoop Ecosystem for Developers

Why NoSQL?

• Some applications need very few database features, but need high scale.

• Desire to avoid data/schema pre-design altogether for simple applications.

• Need for a low-latency, low-overhead API to access data.

• Simplicity -- do not need fancy indexing – just fast lookup by primary key.

201

Page 200: The Hadoop Ecosystem for Developers

Why NoSQL? (cont.)

• Developer friendly, DBAs not needed (?).

• Schema-less.

• Agile: non-structured (or semi-structured).

• In Memory.

• No (or loose) Transactions.

• No joins.

202

Page 201: The Hadoop Ecosystem for Developers

203

Page 202: The Hadoop Ecosystem for Developers

Is NoSQL a RDMS Replacement?

NOWell... Sometimes it does…

204

Page 203: The Hadoop Ecosystem for Developers

RDBMS vs. NoSQL

Rationale for choosing a persistent store:Relational Architecture NoSQL Architecture

High value, high density, complexData

Low value, low density, simple data

Complex data relationships Very simple relationships

Schema-centric Schema-free, unstructured or semistructured Data

Designed to scale up & out Distributed storage and processing

Lots of general purposefeatures/functionality

Stripped down, special purposedata store

High overhead ($ per operation) Low overhead ($ per operation)

205

Page 204: The Hadoop Ecosystem for Developers

Scalability and Consistency

Page 205: The Hadoop Ecosystem for Developers

Scalability

• NoSQL is sometimes very easy to scale out

• Most have dynamic data partitioning and easy data distribution

• But distributed system always come with a price: The CAP Theorem and impact on ACID transactions

207

Page 206: The Hadoop Ecosystem for Developers

ACID Transactions

Most DBMS are built with ACID transactions in mind:• Atomicity: All or nothing, performs write operations as a single

transaction• Consistency: Any transaction will take the DB from one

consistent state to another with no broken constraints, ensures replicas are identical on different nodes

• Isolation: Other operations cannot access data that has been modified during a transaction that has not been completed yet

• Durability: Ability to recover the committed transaction updates against any kind of system failure (transaction log)

208

Page 207: The Hadoop Ecosystem for Developers

ACID Transactions (cont.)

• ACID is usually implemented by a locking mechanism/manager

• Distributed systems central locking can be a bottleneck in that system

• Most NoSQL does not use/limit the ACID transactions and replaces it with something else…

209

Page 208: The Hadoop Ecosystem for Developers

• The CAP theorem states that in a distributed/partitioned application, you can only pick two of the following three characteristics:

– Consistency.

– Availability.

– Partition Tolerance.

CAP Theorem

210

Page 209: The Hadoop Ecosystem for Developers

CAP in Practice

211

Page 210: The Hadoop Ecosystem for Developers

NoSQL BASE

• NoSQL usually provide BASE characteristics instead of ACID.

BASE stands for:– Basically Available– Soft State– Eventual Consistency

• It means that when an update is made in one place, the other partitions will see it over time - there might be an inconsistency window

• read and write operations complete more quickly, lowering latency

212

Page 211: The Hadoop Ecosystem for Developers

Eventual Consistency

213

Page 212: The Hadoop Ecosystem for Developers

Types of NoSQL

Page 213: The Hadoop Ecosystem for Developers

NoSQL Taxonomy

Type Examples

Key-Value Store

Document Store

Column Store

Graph Store

215

Page 214: The Hadoop Ecosystem for Developers

SQL comfort zone

siz

e

Complex

Typical

RDBMS

Key

ValueColumn

Store

Graph

DATABASE

Document

Database Performance

NoSQL Map

216

Page 215: The Hadoop Ecosystem for Developers

Key Value Store

• Distributed hash tables.• Very fast to get a single value.• Examples:

– Amazon DynamoDB– Berkeley DB– Redis– Riak– Cassandra

217

Page 216: The Hadoop Ecosystem for Developers

Document Store

• Similar to Key/Value, but value is a document

• JSON or something similar, flexible schema

• Agile technology

• Examples:– MongoDB

– CouchDB

– CouchBase

218

Page 217: The Hadoop Ecosystem for Developers

Column Store

• One key, multiple attributes

• Hybrid row/column

• Examples:– Google BigTable

– Hbase

– Amazon’s SimpleDB

– Cassandra

219

Page 218: The Hadoop Ecosystem for Developers

How Records are Organized?

• This is a logical table in RDBMS systems

• Its physical organization is just like the logical one: column by column, row by row

Row 1

Row 2

Row 3

Row 4

Col 1 Col 2 Col 3 Col 4

220

Page 219: The Hadoop Ecosystem for Developers

Query Data

• When we query data, records are read at the order they are organized in the physical structure

• Even when we query a single column, we still need to read the entire table and extract the column

Row 1

Row 2

Row 3

Row 4

Col 1 Col 2 Col 3 Col 4

Select Col2 From MyTable

Select *From MyTable

221

Page 220: The Hadoop Ecosystem for Developers

How Does Column Store Keep Data

Organization in row store Organization in column store

Select Col2 From MyTable

222

Page 221: The Hadoop Ecosystem for Developers

Graph Store

• Inspired by Graph Theory• Data model: Nodes, relationships, properties

on both• Relational Database have very hard time to

represent a graph in the Database• Example:

– Neo4j– InfiniteGraph– RDF

223

Page 222: The Hadoop Ecosystem for Developers

• An abstract representation of a set of objects where some pairs are connected by links.

• Object (Vertex, Node) – can have attributes like name and value

• Link (Edge, Arc, Relationship) – can have attributes like type and name or date

What is Graph

NODEEdge

224

Page 223: The Hadoop Ecosystem for Developers

Graph Types

Undirected Graph

Directed Graph

Pseudo Graph

Multi Graph

NODEEdge

NODE

NODEEdge

NODE

NODE

NODE NODE

225

Page 224: The Hadoop Ecosystem for Developers

More Graph Types

Weighted Graph

Labeled Graph

Property Graph

NODE10

NODE

NODELike

NODE

NODE NODEfriend, date 2015

Name:yosi,

Age:40

Name:ami,

Age:30

226

Page 225: The Hadoop Ecosystem for Developers

Relationships

ID:1TYPE:F

NAME:alice

ID:2TYPE:M

NAME:bob

ID:1TYPE:G

NAME:NoSQL

ID:1TYPE:F

NAME:dafna

TYPE: member

Since:2012

227

Page 226: The Hadoop Ecosystem for Developers

228

Page 227: The Hadoop Ecosystem for Developers

Q&A

Page 228: The Hadoop Ecosystem for Developers

Conclusion

• The Challenge of Big Data

• Hadoop Basics: HDFS, MapReduce and YARN

• Improving Hadoop and Tools

• NoSQL and RDBMS

230

Page 229: The Hadoop Ecosystem for Developers