Tackling Big Data with the Elephant in the Room

20
TACKLING BIG DATA WITH THE ELEPHANT IN THE ROOM

Transcript of Tackling Big Data with the Elephant in the Room

Page 1: Tackling Big Data with the Elephant in the Room

TACKLING BIG DATA WITH THE ELEPHANT IN THE ROOM

Page 2: Tackling Big Data with the Elephant in the Room

WHAT’S THE PROBLEM WITH BIG DATA?

Volume Variety Velocity

Page 3: Tackling Big Data with the Elephant in the Room

WHAT’S THE SOLUTION TO BIG DATA?

“In pioneer days they used oxen for heavy pulling, and when one oxen couldn’t budge

a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger

computers, but for more systems of computers.” – Grace Hopper

Page 4: Tackling Big Data with the Elephant in the Room

HADOOP’S SOLUTION

Sqoop

Pig Hive

HBase Mahout Flume

Oozie …

Hadoop Distributed File System

MapReduce

Hadoop Core

Components

Hadoop Ecosystem

Page 5: Tackling Big Data with the Elephant in the Room

WHAT IS

HDFS?

Page 6: Tackling Big Data with the Elephant in the Room

HOW DOES HDFS WORK?

Large Data File

Block #1

Block #2

Page 7: Tackling Big Data with the Elephant in the Room

HOW DOES HDFS WORK?

Large Data File

Block #1

Block #2

Block #1

Block #1

Block #1

Page 8: Tackling Big Data with the Elephant in the Room

HOW DOES HDFS WORK?

Large Data File

Block #1

Block #2

Block #1

Block #1

Block #1

Block #2

Block #2

Block #2

Page 9: Tackling Big Data with the Elephant in the Room

HOW DOES HDFS WORK?

Large Data File

Block #1

Block #2

Block #1

Block #1

Block #1

Block #2

Block #2

Block #2

Page 10: Tackling Big Data with the Elephant in the Room

WHAT IS MAP-REDUCE? Core Ideas

–  Data Locality –  Parallelism –  Block Independence

Three Stages 1.  Map 2.  Swap & Sort 3.  Reduce

Page 11: Tackling Big Data with the Elephant in the Room

WORD COUNT MAP

the cat sat on the mat the aardvark sat on the …

Node 1

the mahout drove the ….

Node 2

the cat sat on the mat The aardvark sat on the … The mahout drove the …

Page 12: Tackling Big Data with the Elephant in the Room

Mapper

WORD COUNT MAP

the cat sat on the mat the aardvark sat on the …

Node 1

the mahout drove the ….

Node 2

Mapper

map()

map()

Page 13: Tackling Big Data with the Elephant in the Room

Mapper

WORD COUNT MAP

the cat sat on the mat the aardvark sat on the …

Node 1

the mahout drove the ….

Node 2

Mapper

map()

map()

the 1

cat 1

sat 1

on 1

the 1

mat 1

the 1

mahout 1

drove 1

the 1

Page 14: Tackling Big Data with the Elephant in the Room

Mapper

WORD COUNT MAP

the cat sat on the mat the aardvark sat on the …

Node 1

the mahout drove the ….

Node 2

Mapper

map()

map()

the 1

cat 1

sat 1

on 1

the 1

mat 1

the 1

mahout 1

drove 1

the 1

map() the 1

aardvark 1

sat 1

on 1

the 1

Page 15: Tackling Big Data with the Elephant in the Room

WORD COUNT SWAP & SORT the 1

cat 1

sat 1

on 1

the 1

mat 1

the 1

mahout 1

drove 1

the 1

the 1

aardvark 1

sat 1

on 1

the 1

Page 16: Tackling Big Data with the Elephant in the Room

WORD COUNT SWAP & SORT the 1

cat 1

sat 1

on 1

the 1

mat 1

the 1

mahout 1

drove 1

the 1

the 1

aardvark 1

sat 1

on 1

the 1

aardvark 1

cat 1

mat 1

on 1,1

sat 1

the 1,1,1,1

drove 1

mahout 1

the 1,1

Page 17: Tackling Big Data with the Elephant in the Room

WORD COUNT SWAP & SORT the 1

cat 1

sat 1

on 1

the 1

mat 1

the 1

mahout 1

drove 1

the 1

the 1

aardvark 1

sat 1

on 1

the 1

aardvark 1

cat 1

mat 1

on 1,1

sat 1

the 1,1,1,1

drove 1

mahout 1

the 1,1

aardvark 1

cat 1

mat 1

mahout 1

sat 1

drove 1

on 1,1

the 1,1,1,1,1,1

Node 3

Node 4

Node 5

Page 18: Tackling Big Data with the Elephant in the Room

WORD COUNT REDUCER aardvark 1

cat 1

mat 1

mahout 1

sat 1

drove 1

on 1,1

the 1,1,1,1,1,1

Node 3

Node 4

Node 5

Reducer 0

Reducer 1

Reducer 2

aardvark 1

cat 1

mat 1

mahout 1

sat 1

drove 1

on 2

the 6

Page 19: Tackling Big Data with the Elephant in the Room

TAKE-AWAYS

Sqoop

Pig Hive

HBase Mahout Flume

Oozie …

Hadoop Distributed File System

MapReduce

Hadoop Core

Components

Hadoop Ecosystem

Page 20: Tackling Big Data with the Elephant in the Room

QUESTIONS?