NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD...

34
NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD [email protected] 1

Transcript of NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD...

Page 1: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

NETE4631Network Information Systems (NISs):

Big Data and Scaling in the Cloud

Suronapee, PhD

[email protected]

1

Page 2: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Huge Amount of Data

2

today statistics

Page 3: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Statistical facts of # devices

3

Page 4: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Big Data A collection of data sets so large and complex,

it’s impossible to process it on one computer with the usual databases and tools

Big Data represent the information assets characterized by “High Volume, Velocity, and Variety”

Because of its size and complexity, Big Data is hard to capture, store, copy, delete (privacy), search, share, analyze, and visualize

4

Page 5: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Big Data Processing

Combine it with cloud would be possible as to require specific technology and analytical

methods for its transformation into to Value

5

Derived dataInput process

Page 6: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

MapReduce What is MapReduce?

Programming model from LISP Scatter and gather principals

Many problems can be phrased this way Large input data make simple computation impossible

Advantages Easy to process and generate large data sets

Hides difficulty of writing parallel code System takes care of scheduling, load balancing, handling

machines failures, etc.

6

Page 7: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

MapReduce Programming Model The computation takes a set of input key/value

pairs, and produces a set of output key/value pairs.

Users expresses the computation as two functions: Map and Reduce. Map

Takes an input pair and produces a set of intermediate key/value pairs.

Reduce Accepts an intermediate key I and a set of values for that

key and merges together these values to form a possibly smaller set of values.(typ. 1 output)

7

Page 8: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Word Count

8

Count number of times each distinct word appears in the file MAP(KEY = LINE, VALUE = CONTENTS):

REDUCE(KEY, VALUES):

Page 9: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Word Count Illustrated

9

Page 10: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Observation Conceptually the map and reduce functions

supplied by the user have associated types

The input keys and values are drawn from a different domain than the output keys and values.

Furthermore, the intermediate keys and values are from the same domain as the output keys and values.

10

Page 11: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

PageRank Algorithm Phase 1: Propagation

Phase 2: Aggregation

Input: A pool of objects, including both vertices and edges

11

Page 12: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

PageRank: Propagation Map: for each object

If object is vertex, emit key=URL, value=object If object is edge, emit key=source URL,

value=object

Reduce: (input is a web page and all the outgoing links) Find the number of edge objects -> outgoing links Read the PageRank value from the vertex object Assign PR(edges)=PR(vertex)/num_outgoing

12

Page 13: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

PageRank: Aggregation Map: for each object

If object is vertex, emit key=URL, value=object If object is edge, emit key=destination URL,

value=object

Reduce: (input is a web page and all the incoming links) Add the PR value of all incoming links Assign PR(vertex)=ΣPR(incoming links)

13

Page 14: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

More Examples Distributed Grep:

Map: emits a line if it matches a supplied pattern Reduce: copies supplied intermediate data to the output

Count of URL Access Frequency: Map: processes logs of web page requests, outputs (URL, 1) Reduce: adds together all values for the same URL and emits

(URL, total count) pairs

Reverse Web-Link Graph: Map: extracts a key from each record, and emits a (key;

record) pair. Reduce: emits all pairs unchanged.

14

Page 15: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Implementation

15

Overall flow of a MapReduce operation

Page 16: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Execution Overview When user calls MapReduce, sequence of actions are:

MapReduce library first splits input files into M pieces (=16-64MB) and starts up many copies of the program on a cluster of machines

The master, one of the program copies assigns work to the workers The map worker who is assigned a map task do the following:

Reads the contents of the corresponding input split Parses key/value pairs from input data and input each to the Map function. Buffer produced intermediate key/value pairs in memory.

Buffered pairs are written to local disk, partitioned into R regions by the partitioning function (their location passed back to the master)

The master forwards these locations to the reduce workers. The reduce worker reads intermediate data and sorts with the key The reduce worker performs reduce function and append to output

16

Page 17: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Parallelism map() functions run in parallel, creating

different intermediate values from different input data sets

reduce() functions also run in parallel, each working on a different output key

All values are processed independently

Bottleneck: reduce phase can’t start until map phase is completely finished

17

Page 18: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Combiners Often a map task will produce many pairs of

the form (k,v1), (k,v2), … for the same key k E.g., popular words in Word Count

Can save network time by preaggregating at mapper Combine (k1, list(v1)) -> v2 Usually same as reduce function

Works only if reduce function is commutative and associative

18

Page 19: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Hadoop

19

Page 20: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Hadoop Execution 1. Client submits “wordcount” job, indicating

code and input files 2. JobTracker breaks input file into k chunks

(64 MB each). Assigns work to TaskTrackers 3. After map(), TaskTrackers exchange map-

output for grouping map output by keys 4. JobTracker breaks reduce() keyspace into m

chunks. Assigns work 5. reduce() output may go to HDFS

20

Page 21: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Map-Machine Reads contents of assigned portion of input

file Parses and prepares data for input to map

function Passes data into map function and saves

result in memory Periodically writes completed work to local

disk Notifies Master of this partially completed

work (intermediate data)

21

Page 22: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Reduce-Machine Receives notification from Master of partially

completed work Retrieves intermediate data from Map-

Machine via remote-read Sorts intermediate data by key Iterates over intermediate data

For each unique key, sends corresponding set through reduce function

Appends result of reduce function to final output file (HDFS)

22

Page 23: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Data Flow Input, final output are stored on a distributed

file system Scheduler tries to schedule map tasks “close”

to physical storage location of input data Intermediate results are stored on local FS of

map and reduce workers Output is often input to another map reduce

task

23

Page 24: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Capacity planning

24

Cloud provider have to be on-demand in scale

Capacity Planning Match demand to available resources

Page 25: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Scaling in cloud

25

Scale vertically (scale up) Add resources to a node (or a server) to make it

powerful

Scale horizontally (scale out) Add more nodes (or commodity servers)

Page 26: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Building blocks in cloud

26

Data center Server: what we want to connect Switch control: who is connected

right now (enabling data flowing)

Switch A layer 2 device that deals with

local networking Switching a connection is based

on its own internal hardware

Page 27: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Scaling the servers

27

Add more ports to the switch Support hundreds of thousands giga-bits each

second Hundreds of thousands servers in a data center Each of which requires up to 1 Gbps

Infeasible

Add more switches Imaging a tree-like structure

Page 28: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

What happens as we keep going up the tree?

28

Technology impossible to buildthe enormous root switch Increase ports (expensive) What happens if the root

fail?

Switches can’t handle that much load Max per switch = 2 Gbps Other 2 connects are

useless

Page 29: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

From tree to fat tree

29

4x4 switch represented as 2 set of 2x2 switches

Enforce the “criss cross” pattern

Page 30: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

A large flat tree: the 8x8 switches (4x(2x2))

30

A tree scalable, using only 2x2 switches (smaller switches)

Page 31: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

The Clos Network

31

Non-blocking property “Any unused server can connect to any other

unused server at any time, no matter what the other connections are.”

Adding another set of switches in the middle

Page 32: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Scale out is better than scale up

32

Scale out Having a lot of smaller switches

Scale up Having a few big switches

Page 33: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

Scaling comparison

33

Cost Normally, scale up pays more than scale out. Scale out enables you to try smaller-specialized

configuration.

Maintenance Scale out increases the number of systems you must

manage.

Communication Scale out increases the number of communication

between systems. Scale out introduces additional latency to your system. Scale out increase the level of your availability of the

system.

Page 34: NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1.

References Cloud Computing Application, Campbell, R.

and Farivar, R. A Survey of Mobile Cloud Computing:

Architecture, Applications, and Approaches Brinton, Christopher; Chiang, Mung (2013-06-

10). Networks Illustrated: 8 Principles Without Calculus (Kindle Locations 1119-1123). Edwiser Scholastic Press. Kindle Edition.

34