NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD...

NETE4631Network Information Systems (NISs):

Big Data and Scaling in the Cloud

Suronapee, PhD

[email protected]

1

Huge Amount of Data

2

today statistics

Statistical facts of # devices

3

Big Data A collection of data sets so large and complex,

it’s impossible to process it on one computer with the usual databases and tools

Big Data represent the information assets characterized by “High Volume, Velocity, and Variety”

Because of its size and complexity, Big Data is hard to capture, store, copy, delete (privacy), search, share, analyze, and visualize

4

Big Data Processing

Combine it with cloud would be possible as to require specific technology and analytical

methods for its transformation into to Value

5

Derived dataInput process

MapReduce What is MapReduce?

Programming model from LISP Scatter and gather principals

Many problems can be phrased this way Large input data make simple computation impossible

Advantages Easy to process and generate large data sets

Hides difficulty of writing parallel code System takes care of scheduling, load balancing, handling

machines failures, etc.

6

MapReduce Programming Model The computation takes a set of input key/value

pairs, and produces a set of output key/value pairs.

Users expresses the computation as two functions: Map and Reduce. Map

Takes an input pair and produces a set of intermediate key/value pairs.

Reduce Accepts an intermediate key I and a set of values for that

key and merges together these values to form a possibly smaller set of values.(typ. 1 output)

7

Word Count

8

Count number of times each distinct word appears in the file MAP(KEY = LINE, VALUE = CONTENTS):

REDUCE(KEY, VALUES):

Word Count Illustrated

9

Observation Conceptually the map and reduce functions

supplied by the user have associated types

The input keys and values are drawn from a different domain than the output keys and values.

Furthermore, the intermediate keys and values are from the same domain as the output keys and values.

10

PageRank Algorithm Phase 1: Propagation

Phase 2: Aggregation

Input: A pool of objects, including both vertices and edges

11

PageRank: Propagation Map: for each object

If object is vertex, emit key=URL, value=object If object is edge, emit key=source URL,

value=object

Reduce: (input is a web page and all the outgoing links) Find the number of edge objects -> outgoing links Read the PageRank value from the vertex object Assign PR(edges)=PR(vertex)/num_outgoing

12

PageRank: Aggregation Map: for each object

If object is vertex, emit key=URL, value=object If object is edge, emit key=destination URL,

value=object

Reduce: (input is a web page and all the incoming links) Add the PR value of all incoming links Assign PR(vertex)=ΣPR(incoming links)

13

More Examples Distributed Grep:

Map: emits a line if it matches a supplied pattern Reduce: copies supplied intermediate data to the output

Count of URL Access Frequency: Map: processes logs of web page requests, outputs (URL, 1) Reduce: adds together all values for the same URL and emits

(URL, total count) pairs

Reverse Web-Link Graph: Map: extracts a key from each record, and emits a (key;

record) pair. Reduce: emits all pairs unchanged.

14

Implementation

15

Overall flow of a MapReduce operation

Execution Overview When user calls MapReduce, sequence of actions are:

MapReduce library first splits input files into M pieces (=16-64MB) and starts up many copies of the program on a cluster of machines

The master, one of the program copies assigns work to the workers The map worker who is assigned a map task do the following:

Reads the contents of the corresponding input split Parses key/value pairs from input data and input each to the Map function. Buffer produced intermediate key/value pairs in memory.

Buffered pairs are written to local disk, partitioned into R regions by the partitioning function (their location passed back to the master)

The master forwards these locations to the reduce workers. The reduce worker reads intermediate data and sorts with the key The reduce worker performs reduce function and append to output

16

Parallelism map() functions run in parallel, creating

different intermediate values from different input data sets

reduce() functions also run in parallel, each working on a different output key

All values are processed independently

Bottleneck: reduce phase can’t start until map phase is completely finished

17

Combiners Often a map task will produce many pairs of

the form (k,v1), (k,v2), … for the same key k E.g., popular words in Word Count

Can save network time by preaggregating at mapper Combine (k1, list(v1)) -> v2 Usually same as reduce function

Works only if reduce function is commutative and associative

18

Hadoop

19

Hadoop Execution 1. Client submits “wordcount” job, indicating

code and input files 2. JobTracker breaks input file into k chunks

(64 MB each). Assigns work to TaskTrackers 3. After map(), TaskTrackers exchange map-

output for grouping map output by keys 4. JobTracker breaks reduce() keyspace into m

chunks. Assigns work 5. reduce() output may go to HDFS

20

Map-Machine Reads contents of assigned portion of input

file Parses and prepares data for input to map

function Passes data into map function and saves

result in memory Periodically writes completed work to local

disk Notifies Master of this partially completed

work (intermediate data)

21

Reduce-Machine Receives notification from Master of partially

completed work Retrieves intermediate data from Map-

Machine via remote-read Sorts intermediate data by key Iterates over intermediate data

For each unique key, sends corresponding set through reduce function

Appends result of reduce function to final output file (HDFS)

22

Data Flow Input, final output are stored on a distributed

file system Scheduler tries to schedule map tasks “close”

to physical storage location of input data Intermediate results are stored on local FS of

map and reduce workers Output is often input to another map reduce

task

23

Capacity planning

24

Cloud provider have to be on-demand in scale

Capacity Planning Match demand to available resources

Scaling in cloud

25

Scale vertically (scale up) Add resources to a node (or a server) to make it

powerful

Scale horizontally (scale out) Add more nodes (or commodity servers)

Building blocks in cloud

26

Data center Server: what we want to connect Switch control: who is connected

right now (enabling data flowing)

Switch A layer 2 device that deals with

local networking Switching a connection is based

on its own internal hardware

Scaling the servers

27

Add more ports to the switch Support hundreds of thousands giga-bits each

second Hundreds of thousands servers in a data center Each of which requires up to 1 Gbps

Infeasible

Add more switches Imaging a tree-like structure

What happens as we keep going up the tree?

28

Technology impossible to buildthe enormous root switch Increase ports (expensive) What happens if the root

fail?

Switches can’t handle that much load Max per switch = 2 Gbps Other 2 connects are

useless

From tree to fat tree

29

4x4 switch represented as 2 set of 2x2 switches

Enforce the “criss cross” pattern

A large flat tree: the 8x8 switches (4x(2x2))

30

A tree scalable, using only 2x2 switches (smaller switches)

The Clos Network

31

Non-blocking property “Any unused server can connect to any other

unused server at any time, no matter what the other connections are.”

Adding another set of switches in the middle

Scale out is better than scale up

32

Scale out Having a lot of smaller switches

Scale up Having a few big switches

Scaling comparison

33

Cost Normally, scale up pays more than scale out. Scale out enables you to try smaller-specialized

configuration.

Maintenance Scale out increases the number of systems you must

manage.

Communication Scale out increases the number of communication

between systems. Scale out introduces additional latency to your system. Scale out increase the level of your availability of the

system.

References Cloud Computing Application, Campbell, R.

and Farivar, R. A Survey of Mobile Cloud Computing:

Architecture, Applications, and Approaches Brinton, Christopher; Chiang, Mung (2013-06-

10). Networks Illustrated: 8 Principles Without Calculus (Kindle Locations 1119-1123). Edwiser Scholastic Press. Kindle Edition.

34

NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD...

Documents

Transcript of NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD...