Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open...

150
Diving into Hadoop and Big Data Michael Carducci @MichaelCarducci

Transcript of Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open...

Page 1: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Diving into Hadoop and Big DataMichael Carducci @MichaelCarducci

Page 2: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Agenda• What is Big Data

• Introducing Hadoop

• Hadoop Ecosystem & Architecture

• HDFS Overview

• Programming Hadoop

• MapReduce

• Hive (SQL* OLAP)

• Drill (Interactive Queries)

• Pig (Fast Scripting)

• Spark

• Spark SQL

• Machine Learning

Page 3: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

What is Big Data

Page 4: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

–IBM Marketing Cloud, 2017

“90% of the data in the world today has been created in the last two years alone, at 2.5

quintillion bytes of data a day!”

What is Big Data

Page 5: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

What is Big Data

• Large quantity of data

• Many sources/formats

• Unstructured

• Large Processing Load (needs to be distributed)

• Streaming/real-time processing

• Big Deployment footprint

Page 6: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Key Trends

Device Explosion

Ubiquitous Connection

Social Networks

Sensor Networks

Cheap Storage

Inexpensive Computing

23.14 Billion Connected devices

3.3 ZB by 2021 278 EB/month

Millions of new sensors go online every hour

$0.019 Avg Cost of 1GB in 2016 15.5 Million Times Cheaper

$0.03 Cost/GFLOPS in 2018

2.77 billion social media users By 2019

Page 7: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Big Data Big Opportunities

Page 8: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

A New Way of Thinking What, not Why

Page 9: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Big Data - Challenges

Volume Variety Velocity

Page 10: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Apache Hadoop

Page 11: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

–Hortonworks

“an open source software platform for distributed storage and distributed

processing of very large data sets on computer clusters built from commodity

hardware”

Page 12: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Why Hadoop?

• Low Cost

• Linear Scaling

• Fault Tolerant

• Flexibility

?

Page 13: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Hadoop History

2002Nutch created by

Doug Cutting and Mike Caferella

2003-04

GFS and MapReduceFrom Google

2004Nutch’s use of

MapReduce and distributed filesystems

take shape

2008Hadoop is born from

Nutch

2006

Doug Cutting joins Yahoo and brings Nutch with him

2008

Yahoo releases Hadoop as Open Source Project

Page 14: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

We Learn by DOING

• Download Virtualboxhttps://www.virtualbox.org/

• Download Hortonworks sandbox https://hortonworks.com/products/sandbox/

• Download Example Datahttps://grouplens.org/datasets/movielens/

• Explore Examples offline!

Page 15: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Hadoop Ecosystem

Page 16: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Hadoop Ecosystem

Page 17: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Hadoop Ecosystem

Page 18: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Hadoop Ecosystem

Page 19: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Zookeeper

• Tracks state across the cluster

• Which node is master

• What tasks are assigned to which workers

• Which workers are current available?

• Failure Recovery

• HA witness

• Integral to many hadoop technologies

Page 20: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Failure Modes

• Master Crashes

• Worker crashes

• Network Issues

Page 21: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Zookeeper Under the Hood

• A distributed filesystem

• Supports basic general commands

• Create, Delete, Exists, SetData, GetData, GetChildren

//master “master1.xxxx:2223”

/workers/worker1 “worker1.xxxx:2225”

/worker2 “worker2.xxxx:2225”

Page 22: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Push Notifications

• Clients can register for notifications on a znode

• Avoids continuous polling

Page 23: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Persistent and Epheremeral Znodes

• Persistent zones persist until explicitly deleted

• Ephemeral Znodes go away when the client that created it disconnects.

Page 24: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Hadoop Ecosystem

Page 25: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Hadoop Ecosystem

Page 26: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Concepts

• Handle very large files

• Write Once

• Commodity Hardware

• Low Latency Data Access

• Small Files

• Arbitrary FIle Modifications

Page 27: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Architecture

Name Node

Data Nodes

Page 28: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Architecture

Name Node

Data Nodes

Master

Workers

Page 29: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Architecture

Name Node

Data Nodes

Master

Workers

Page 30: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Reading a file

Name Node

Data Nodes

Client

?

Page 31: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Writing a file

Name Node

Data Nodes

Client

Page 32: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Name Node Resiliency

• Backup Metadata

• Secondary Namenode*

• HDFS Federation

• HDFS High Availability

Name Node

SPOF

Page 33: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Using HDFS

• CLI

• HTTP/HDFS Proxies

• Java Client

• NFS Gateway

• UI (Ambari)

Page 34: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Let’s throw some data onto HDFS!

Page 35: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Hadoop Ecosystem

Page 36: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Apache YARN

• Introduced in Hadoop 2

• Manages Resources (Yet Another Resource Negotiator)

• Originally part of MapReduce

• Paved the way for more powerful/Performant MapReduce Alternatives

Page 37: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

How YARN Works• Application coordinates with YARN to distribute work

across your cluster

• YARN attempts to process data on nodes that contain the relevant data blocks

• Scheduling options can be configured

• FIFO, Capacity, Fair Schedulers

Page 38: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Hadoop Ecosystem

Page 39: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

MapReduce

• Distributes the processing of data on your cluster

• Resilient to failure

• Divides data into partitions

• Mappers transform data in parallel

• Reducers aggregate data together

Page 40: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

How MapReduce Works: Mapping

• The Mapper transforms source rows into key/value pairs

Input Data

Mapper

K1:V, K2:V, K3:V, K1:V K3:V

Page 41: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Who rated the most jokes?

Page 42: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Example: Jester Data (jester_ratings.dat)

UserId | JokeId | Rating

63978 | 147 | 8.281 63978 | 113 | 7.781 63978 | 112 | 9.438 63978 | 119 | 8.906 63978 | 121 | 8.438 63978 | 130 | 8.781

Page 43: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Example: Jester Data (jester_ratings.dat)

UserId | JokeId | Rating

63978 | 147 | 8.281 63978 | 113 | 7.781 63978 | 112 | 9.438 63979 | 119 | 8.906 63979 | 121 | 8.438 63979 | 130 | 8.781

Mapper

63978:147, 63978:113, 63978:112, 63979:119, 63979:121, 63979:130

Page 44: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Extract and organize the data we care about

63978:147, 63978:113, 63978:112, 63979:119, 63979:121, 63979:130

Page 45: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

MapReduce “Shuffle and Sort”

63978:147, 63978:113, 63978:112, 63979:119, 63979:121, 63979:130

63978:147,113,112 63979:119,121,130

Page 46: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

The REDUCER Processes Each Key’s Values

63978:147,113,112 63979:119,121,130

Len(jokes)

63978:3 63979:3

Page 47: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

The Big PictureUserId | JokeId | Rating

63978 | 147 | 8.281 63978 | 113 | 7.781 63978 | 112 | 9.438 63979 | 119 | 8.906 63979 | 121 | 8.438 63979 | 130 | 8.781

63978:147, 63978:113, 63978:112, 63979:119, 63979:121, 63979:130

Mapper

Shuffle and Sort63978:147,113,112 63979:119,121,130

Reducer63978:3 63979:3

Page 48: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

How MapReduce Scales

Page 49: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

The Big PictureUserId | JokeId | Rating

978 | 147 | 8.281 976 | 113 | 7.781 978 | 112 | 9.438 979 | 119 | 8.906 974 | 121 | 8.438 978 | 130 | 8.781

978:147 976:113 978:112 979:119 974:121, 978:130

Mapper

Shuffle and Sort974:121 976:113

Reducer

974:1 976:1

978:121,147,130 979:112

978:3 979:1

Page 50: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Coordinating Distributed Map/Reduce Tasks

YARN MapReduce Application

Master

NodeManager Node NodeManager Node

HDFS

MapTask / ReduceTask

MapTask / ReduceTask

MapTask / ReduceTask

Client Node

Page 51: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

How Mappers & Reducers are Written

• MapReduce is Java Native

• STREAMING allows interfacing to other languages

MapTask / ReduceTask

MapTask / ReduceTask

Key/Values

stdin stdout

Page 52: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Handling Failure

• Application Master monitors worker tasks• Restarts as needed• Reassigns to a different node

• YARN Monitors Application Master• Restarts as needed

• YARN Monitors Nodes• Restarts as needed

• In HA configuration, Zookeeper monitors YARN and switches to hot standby if needed

YARN

MapReduce Application

Master

NodeManager Node

MapTask / ReduceTask

MapTask / ReduceTask

Page 53: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

MapReduce in Action: Which is the most rated joke?

Page 54: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

from mrjob.job import MRJob from mrjob.step import MRStep

Page 55: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

class JokeRatings(MRJob): def steps(self): return [ MRStep(mapper=self.mapper_get_ratings, reducer=self.reducer_count_ratings) ]

Page 56: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

def mapper_get_ratings(self, _, line): (userID, JokeID, rating) = line.split('|') yield JokeID, 1

def reducer_count_ratings(self, key, values): yield sum(values), key

Page 57: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

from mrjob.job import MRJob from mrjob.step import MRStep

class JokeRatings(MRJob): def steps(self): return [ MRStep(mapper=self.mapper_get_ratings, reducer=self.reducer_count_ratings) ]

def mapper_get_ratings(self, _, line): (userID, JokeID, rating) = line.split('|') yield JokeID, 1

def reducer_count_ratings(self, key, values): yield sum(values), key

if __name__ == '__main__': JokeRatings.run()

Page 58: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Hadoop Ecosystem

Page 59: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Introducing Hive

• Abstracts away from Map/Reduce

• Use SQL Syntax

• Interactive

• Scalable (query across cluster)

• Extensible

• Highly Optimized

• Support JDBC/ODBC

• Leverage DB/OLAP Skillset

Page 60: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Introducing Hive

• Abstracts away from Map/Reduce

• Use SQL Syntax

• Interactive

• Scalable (query across cluster)

• Extensible

• Highly Optimized

• Support JDBC/ODBC

• Leverage DB/OLAP Skillset

Page 61: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Write SQL*, Get MapReduce

Page 62: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

SQL* == HiveQL

• Basically MySQL

• Specify data structure and partitions

• Chain views together and use as tables

Page 63: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Using HIVE

• CLI

• Query Files

• Oozie

• Web/UI (Ambari)

• JDBC/ODBC Server

• Thrift Service

Page 64: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Most Rated Joke (in Hive)

Page 65: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

from mrjob.job import MRJob from mrjob.step import MRStep

class JokeRatings(MRJob): def steps(self): return [ MRStep(mapper=self.mapper_get_ratings, reducer=self.reducer_count_ratings) ]

def mapper_get_ratings(self, _, line): (userID, JokeID, rating) = line.split('|') yield JokeID, 1

def reducer_count_ratings(self, key, values): yield sum(values), key

if __name__ == '__main__': JokeRatings.run()

Page 66: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

SELECT joke_id, count(*) as ratigCount FROM jester_ratings GROUP by rating;

Page 67: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Schema on Read vs Schema on Write

Page 68: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

CREATE TABLE JokeRatings( UserID INT, JokeID INT, Rating INT

) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|' STORED AS TEXTFILE

LOAD DATA LOCAL INPATH ‘$(env:HOME)/jester/jester_ratings.dat’OVERWRITE INTO TABLE JokeRatings

Schema on Read

Page 69: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Storing Data

• LOAD DATA

• MOVES data from HDFS into Hive

• LOAD DATA LOCAL

• COPIES data from local filesystem into HIVE

• Managed vs External

Page 70: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Managed vs External Tables

• Data MOVED into Hive is Managed by hive

• Create External Table leaves data available in HDFS

Page 71: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Partitioning

• Store data in partitioned subdirectories

CREATE TABLE JokeRatings( UserID INT, JokeID INT, Rating INT )PARTIONED BY UserID

Page 72: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Introducing Drill

• External SQLQuery Engine

• Unify/query across multiple data sources

• HDFS

• MongoDB

• S3

• Azure Blob Storage

• Hive

• HBase

• etc…

Page 73: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Introducing Drill

• External SQLQuery Engine

• Unify/query across multiple data sources

• HDFS

• MongoDB

• S3

• Azure Blob Storage

• Hive

• HBase

• etc…

Page 74: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Introducing Drill

• External SQLQuery Engine

• Unify/query across multiple data sources

• HDFS

• MongoDB

• S3

• Azure Blob Storage

• Hive

• HBase

• etc…

Page 75: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Hadoop Ecosystem

Page 76: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Introducing Apache Pig

Page 77: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Introducing Apache Pig

• Abstracts Map/Reduce Complexity

• Uses PIG LATIN scripting language

• Highly extensible with UDFs

• Can perform faster than MapReduce

Page 78: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Running Pig Scripts

• Grunt

• Script

• Browser (Ambari)

Page 79: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Pig Latin Crash Course

Page 80: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Schema on Read

movieRatings = LOAD ‘/user/maria_dev/ml-100k/u.data’ AS(userId: int, movieId:int, rating:int, ratingTime:int)

Page 81: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Working With Relations

• LOAD, STORE, DUMP

• FILTER, DISTINCT, FOREACH/GENERATE, MAPREDUCE, STREAM, SAMPLE

• JOIN, COGROUP, GROUP, CROSS, CUBE

• ORDER, RANK, LIMIT

• UNION, SPLIT

Page 82: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Working With Relations

• LOAD, STORE, DUMP

• FILTER, DISTINCT, FOREACH/GENERATE, MAPREDUCE, STREAM, SAMPLE

• JOIN, COGROUP, GROUP, CROSS, CUBE

• ORDER, RANK, LIMIT

• UNION, SPLIT

Page 83: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Working With Relations

• LOAD, STORE, DUMP

• FILTER, DISTINCT, FOREACH/GENERATE, MAPREDUCE, STREAM, SAMPLE

• JOIN, COGROUP, GROUP, CROSS, CUBE

• ORDER, RANK, LIMIT

• UNION, SPLIT

Page 84: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Diagnostics

• DESCRIBE

• EXPLAIN

• ILLUSTRATE

Page 85: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

UDFs

• REGISTER

• DEFINE

• IMPORT

Page 86: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Aggregate Functions

• AVG

• COUNT

• CONCAT

• MAX

• MIN

• SIZE

• SUM

Page 87: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Loaders

• PigStorage

• TextLoader

• JsonLoader

• HBaseStorage

Page 88: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Hands on Example: Mining Movie Rating Data

Page 89: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

LOAD ratings data

movieRatings = LOAD ‘/user/maria_dev/ml-100k/u.data’ AS(userId: int, movieId:int, rating:int, ratingTime:int)

Page 90: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Specify non-default delimiter: PigStorage

movieData = LOAD ‘/user/maria_dev/ml-100k/u.item’ PigStorage(‘|’) AS (MovieId: int, movieTitle:chararray, releaseDate:chararray, videoRelease:chararray, imdbLink:chararray);

Page 91: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Create a new relation from another relation

movieData = LOAD ‘/user/maria_dev/ml-100k/u.item’ PigStorage(‘|’) AS (MovieId: int, movieTitle:chararray, releaseDate:chararray, videoRelease:chararray, imdbLink:chararray);

nameLookup = FOREACH movieData GENERATE movieId, movieTitle, ToUnixTime(ToDate(releaseDate, ‘dd-MMM-yyyy’)) As releaseTime

Page 92: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Group By

ratingsByMovie = GROUP movieRatings BY movieId;

Page 93: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Compute avgRatings

avgRatings = FOREACH ratingsByMovie GENERATE group AS movieId, AVG(ratings.rating) as avgRating;

Page 94: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Filtering Results

fiveStarMovies = FILTER avgRatings BY avgRating > 4.0

Page 95: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

JOIN

fiveStarWithNames = JOIN fiveStarMovies BY movieID, nameLookup BY movieId;

Page 96: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

ORDER BY

oldestFiveStarMovies = ORDER fiveStarsWithData BYnameLookup::releaseTime;

DUMP oldestFiveStarMovies;

Page 97: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Let’s Run It

Page 98: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Hadoop Ecosystem

Page 99: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Introducing HBASE

• NoSQL DB build on HDFS

• Based on BigTable

• No Language - API

• Auto Sharding

Page 100: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

HBase Data Model

• Fast access to any given row

• A row is referenced by a unique key

• Rows has a small number of column families

• A column family may contain arbitrary columns

• You can have a very large number of columns in a column family

• Each Cell can have many versions with given timestamps

• Sparse data is ok. Missing columns in a row consume no storage

Page 101: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Example: Web Table

com.cnn.www

Contents Column Family

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//

EN" <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//

EN" <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//

EN"

KeyAnchor Column Family

Contents:

“CNN” “CNN.com”

Anchor cnnsi.com:

Anchor my.look.ca:

Page 102: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Accessing HBase

• HBase shell

• API

• Wrappers in many language

• Spark, Hive, Pig

• Rest Service

• Thrift Service

• Avro Service

Page 103: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Hadoop Ecosystem

Page 104: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Introducing Apache Spark

“A fast and general engine for large-scale data processing”

Page 105: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Spark Components

• Spark Streaming

• Spark SQL

• MLLib

• GraphX

Page 106: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Spark is Scalable

Driver Program -Spark Context

Cluster Manager (Spark, YARN)

Executor -Cache -Tasks

Executor -Cache -Tasks

Driver Program -Spark Context

Page 107: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Spark is Fast

• “Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.”

• Directed Acyclic Graph Engine optimizes workflows

Page 108: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Spark is Popular

• Amazon

• Yahoo

• Groupon

• TripAdvisor

• NASA JPL

• Ebay

*https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

Page 109: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Easy to Code

• Code in Java, Scala, Python

• Built around one primary concept: Resilient Distributed Dataset (RDD)

Page 110: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

What is an RDD?

Page 111: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

SparkContext

• Created by your driver program

• SparkContext Makes your RDDs Resilient and Distributed

• It creates the RDDs

• The spark shell creates an “sc” object for you

Page 112: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Creating RDDs

• rdd = parallelize([1,2,3,4]) • sc.textFile(“hdfs:///c:/users/michael/bigTextFile.txt") or file:// s3n:// etc • hiveCtx = HiveContext(sc) rows = hiveCtx.sql(“SELECT name, age

FROM users”) • Create from:

• JDBC • Cassandra • HBase • JSON • CSV • ect…

Page 113: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Transforming RDDs

• map

• flatmap

• filter

• distinct

• sample

• union, intersection, subtract, cartesian

Page 114: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

map Example

• rdd = sc.parallelize([1,2,3,4])

• squaredRDD = rdd.map(lambda x: x*x)

Yields 1,4,9,16

Page 115: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

RDD actions

• collect

• count

• countByValue

• take

• top

• Reduce

• …

Page 116: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

RDD actions

• collect

• count

• countByValue

• take

• top

• Reduce

• …

Page 117: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Example - Find the lowest average rating.

Page 118: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

from pyspark import SparkConf, SparkContext

def loadMovieNames(): movieNames = {} with open("ml-100k/u.item") as f: for line in f: fields = line.split('|') movieNames[int(fields[0])] = fields[1] return movieNames

Page 119: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

def parseInput(line): fields = line.split() return (int(fields[1]), (float(fields[2]), 1.0))

if __name__ == "__main__": # The main script - create our SparkContext conf = SparkConf().setAppName("WorstMovies") sc = SparkContext(conf = conf)

# Load up our movie ID -> movie name lookup table movieNames = loadMovieNames()

# Load up the raw u.data file lines = sc.textFile("hdfs:///user/maria_dev/ml-100k/u.data")

Page 120: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

# Convert to (movieID, (rating, 1.0)) movieRatings = lines.map(parseInput)

# Reduce to (movieID, (sumOfRatings, totalRatings)) ratingTotalsAndCount = movieRatings.reduceByKey(lambda movie1, movie2: ( movie1[0] + movie2[0], movie1[1] + movie2[1] ) )

# Map to (rating, averageRating) averageRatings = ratingTotalsAndCount.mapValues(lambda totalAndCount : totalAndCount[0] / totalAndCount[1])

Page 121: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

# Sort by average rating sortedMovies = averageRatings.sortBy(lambda x: x[1])

# Take the top 10 results results = sortedMovies.take(10)

# Print them out: for result in results: print(movieNames[result[0]], result[1])

Page 122: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

DataFrames & DataSets

• Extend RDDs

• Contain Row Objects

• Can run SQL Queries

• Has a Schema

• Read/Write to JSON, HIVE, Parquet

• Supports JDBC/ODBC, Tableau

• Query using SparkSQL

Page 123: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Working with DataFrames without SparkSQL

• myDataFrame.show()

• myDataFrame.select(“fieldName”)

• myDataFrame.filter(myDataFrame(“count”>52)

• myDataFrame.groupBy(myDataFrame(“quantity”)).mean()

• myDataFrame.rdd().map(mapperFunction)

Page 124: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Find Lowest Avg Movie using Datasets

Page 125: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

from pyspark import SparkConf, SparkContext

def loadMovieNames(): movieNames = {} with open("ml-100k/u.item") as f: for line in f: fields = line.split('|') movieNames[int(fields[0])] = fields[1] return movieNames

Page 126: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

def parseInput(line): fields = line.split() return (int(fields[1]), (float(fields[2]), 1.0))

if __name__ == "__main__": # The main script - create our SparkContext conf = SparkConf().setAppName("WorstMovies") sc = SparkContext(conf = conf)

# Load up our movie ID -> movie name lookup table movieNames = loadMovieNames()

Page 127: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

# Get the raw data lines = spark.sparkContext.textFile("hdfs:///user/maria_dev/ml-100k/u.data") # Convert it to a RDD of Row objects with (movieID, rating) movies = lines.map(parseInput) # Convert that to a DataFrame movieDataset = spark.createDataFrame(movies)

Page 128: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

# Compute average rating for each movieID averageRatings = movieDataset.groupBy("movieID").avg("rating")

# Compute count of ratings for each movieID counts = movieDataset.groupBy("movieID").count()

# Join the two together (We now have movieID, avg(rating), and count columns) averagesAndCounts = counts.join(averageRatings, "movieID")

# Filter movies rated 10 or fewer times popularAveragesAndCounts = averagesAndCounts.filter("count > 10")

Page 129: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

# Pull the top 10 results topTen = popularAveragesAndCounts.orderBy("avg(rating)").take(10)

# Print them out, converting movie ID's to names as we go. for movie in topTen: print (movieNames[movie[0]], movie[1], movie[2])

# Stop the session spark.stop()

Page 130: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Hadoop Ecosystem

Page 131: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

What is Storm

• A framework for stream processing on your cluster

• Works on individual events

• sub-second latency

Page 132: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Storm Terminology

• A stream consists of tuples that flow through

• Spout are sources of stream data (kafka etc)

• Bolts process stream data as it’s received

• transform aggregate, write to databases

• A topology is a group of spouts and bolts that process your stream

Page 133: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Developing Storm Applications

• Typically written in java

• Storm core

• Low level API for storm

• “At least once”

• Trident

• Higher level API for storm

• “Exactly Once”

• Storm runs “forever” once submitted… Until you stop them

Page 134: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Hadoop Ecosystem

Page 135: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Using MLLib in Spark

Page 136: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

from pyspark import SparkConf, SparkContext

def loadMovieNames(): movieNames = {} with open("ml-100k/u.item") as f: for line in f: fields = line.split('|') movieNames[int(fields[0])] = fields[1] return movieNames

Page 137: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

def parseInput(line): fields = line.split() return (int(fields[1]), (float(fields[2]), 1.0))

if __name__ == "__main__": # The main script - create our SparkContext conf = SparkConf().setAppName("WorstMovies") sc = SparkContext(conf = conf)

# Load up our movie ID -> movie name lookup table movieNames = loadMovieNames()

Page 138: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

# Get the raw data lines = spark.sparkContext.textFile("hdfs:///user/maria_dev/ml-100k/u.data") # Convert it to a RDD of Row objects with (movieID, rating) movies = lines.map(parseInput)

# Convert it to a RDD of Row objects with (userID, movieID, rating) ratingsRDD = lines.map(parseInput)

# Convert to a DataFrame and cache it ratings = spark.createDataFrame(ratingsRDD).cache()

Page 139: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

# Construct a "test" dataframe for user 0 with every movie rated more than 100 times popularMovies = ratingCounts.select("movieID").withColumn('userID', lit(0))

# Run our model on that list of popular movies for user ID 0 recommendations = model.transform(popularMovies)

# Get the top 20 movies with the highest predicted rating for this user topRecommendations = recommendations.sort(recommendations.prediction.desc()).take(20)

for recommendation in topRecommendations: print (movieNames[recommendation['movieID']], recommendation['prediction'])

spark.stop()

Page 140: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Hadoop Ecosystem

Page 141: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Hadoop Ecosystem

Page 142: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Hadoop Ecosystem

Page 143: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

SQOOP

Page 144: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Sqoop

Mapper Mapper Mapper Mapper

HDFS

RDBMS

Page 145: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Sqoop Example: Import data from MySQL

sqoop import —connect jdbc:mysql://localhost/movielens —driver com.mysql.jdbc.Drive —table movies

Page 146: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Sqoop Example: Import data from MySQL to Hive

sqoop import —connect jdbc:mysql://localhost/movielens —driver com.mysql.jdbc.Drive —table movies —hive import

Page 147: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Incremental Imports

• You can keep your RDBMS in sync

• —check-column and —last-value

Page 148: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Sqoop Example: Export data to MySQL from Hive

sqoop export —connect jdbc:mysql://localhost/movielens —driver com.mysql.jdbc.Drive —table exported_movies —export-dir /apps/hive/warehous/movies —input-fileds-terminated-by ‘\0001’

Page 149: Diving into Hadoop and Big Data - DeveloperMarch€¦ · Apache Hadoop –Hortonworks “an open source software platform for distributed storage and distributed processing of very

Thank youMichael Carducci @MichaelCarducci