NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD...
-
Upload
kathlyn-boyd -
Category
Documents
-
view
215 -
download
1
Transcript of NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD...
NETE4631Network Information Systems (NISs):
Big Data and Scaling in the Cloud
Suronapee, PhD
1
Huge Amount of Data
2
today statistics
Statistical facts of # devices
3
Big Data A collection of data sets so large and complex,
it’s impossible to process it on one computer with the usual databases and tools
Big Data represent the information assets characterized by “High Volume, Velocity, and Variety”
Because of its size and complexity, Big Data is hard to capture, store, copy, delete (privacy), search, share, analyze, and visualize
4
Big Data Processing
Combine it with cloud would be possible as to require specific technology and analytical
methods for its transformation into to Value
5
Derived dataInput process
MapReduce What is MapReduce?
Programming model from LISP Scatter and gather principals
Many problems can be phrased this way Large input data make simple computation impossible
Advantages Easy to process and generate large data sets
Hides difficulty of writing parallel code System takes care of scheduling, load balancing, handling
machines failures, etc.
6
MapReduce Programming Model The computation takes a set of input key/value
pairs, and produces a set of output key/value pairs.
Users expresses the computation as two functions: Map and Reduce. Map
Takes an input pair and produces a set of intermediate key/value pairs.
Reduce Accepts an intermediate key I and a set of values for that
key and merges together these values to form a possibly smaller set of values.(typ. 1 output)
7
Word Count
8
Count number of times each distinct word appears in the file MAP(KEY = LINE, VALUE = CONTENTS):
REDUCE(KEY, VALUES):
Word Count Illustrated
9
Observation Conceptually the map and reduce functions
supplied by the user have associated types
The input keys and values are drawn from a different domain than the output keys and values.
Furthermore, the intermediate keys and values are from the same domain as the output keys and values.
10
PageRank Algorithm Phase 1: Propagation
Phase 2: Aggregation
Input: A pool of objects, including both vertices and edges
11
PageRank: Propagation Map: for each object
If object is vertex, emit key=URL, value=object If object is edge, emit key=source URL,
value=object
Reduce: (input is a web page and all the outgoing links) Find the number of edge objects -> outgoing links Read the PageRank value from the vertex object Assign PR(edges)=PR(vertex)/num_outgoing
12
PageRank: Aggregation Map: for each object
If object is vertex, emit key=URL, value=object If object is edge, emit key=destination URL,
value=object
Reduce: (input is a web page and all the incoming links) Add the PR value of all incoming links Assign PR(vertex)=ΣPR(incoming links)
13
More Examples Distributed Grep:
Map: emits a line if it matches a supplied pattern Reduce: copies supplied intermediate data to the output
Count of URL Access Frequency: Map: processes logs of web page requests, outputs (URL, 1) Reduce: adds together all values for the same URL and emits
(URL, total count) pairs
Reverse Web-Link Graph: Map: extracts a key from each record, and emits a (key;
record) pair. Reduce: emits all pairs unchanged.
14
Implementation
15
Overall flow of a MapReduce operation
Execution Overview When user calls MapReduce, sequence of actions are:
MapReduce library first splits input files into M pieces (=16-64MB) and starts up many copies of the program on a cluster of machines
The master, one of the program copies assigns work to the workers The map worker who is assigned a map task do the following:
Reads the contents of the corresponding input split Parses key/value pairs from input data and input each to the Map function. Buffer produced intermediate key/value pairs in memory.
Buffered pairs are written to local disk, partitioned into R regions by the partitioning function (their location passed back to the master)
The master forwards these locations to the reduce workers. The reduce worker reads intermediate data and sorts with the key The reduce worker performs reduce function and append to output
16
Parallelism map() functions run in parallel, creating
different intermediate values from different input data sets
reduce() functions also run in parallel, each working on a different output key
All values are processed independently
Bottleneck: reduce phase can’t start until map phase is completely finished
17
Combiners Often a map task will produce many pairs of
the form (k,v1), (k,v2), … for the same key k E.g., popular words in Word Count
Can save network time by preaggregating at mapper Combine (k1, list(v1)) -> v2 Usually same as reduce function
Works only if reduce function is commutative and associative
18
Hadoop
19
Hadoop Execution 1. Client submits “wordcount” job, indicating
code and input files 2. JobTracker breaks input file into k chunks
(64 MB each). Assigns work to TaskTrackers 3. After map(), TaskTrackers exchange map-
output for grouping map output by keys 4. JobTracker breaks reduce() keyspace into m
chunks. Assigns work 5. reduce() output may go to HDFS
20
Map-Machine Reads contents of assigned portion of input
file Parses and prepares data for input to map
function Passes data into map function and saves
result in memory Periodically writes completed work to local
disk Notifies Master of this partially completed
work (intermediate data)
21
Reduce-Machine Receives notification from Master of partially
completed work Retrieves intermediate data from Map-
Machine via remote-read Sorts intermediate data by key Iterates over intermediate data
For each unique key, sends corresponding set through reduce function
Appends result of reduce function to final output file (HDFS)
22
Data Flow Input, final output are stored on a distributed
file system Scheduler tries to schedule map tasks “close”
to physical storage location of input data Intermediate results are stored on local FS of
map and reduce workers Output is often input to another map reduce
task
23
Capacity planning
24
Cloud provider have to be on-demand in scale
Capacity Planning Match demand to available resources
Scaling in cloud
25
Scale vertically (scale up) Add resources to a node (or a server) to make it
powerful
Scale horizontally (scale out) Add more nodes (or commodity servers)
Building blocks in cloud
26
Data center Server: what we want to connect Switch control: who is connected
right now (enabling data flowing)
Switch A layer 2 device that deals with
local networking Switching a connection is based
on its own internal hardware
Scaling the servers
27
Add more ports to the switch Support hundreds of thousands giga-bits each
second Hundreds of thousands servers in a data center Each of which requires up to 1 Gbps
Infeasible
Add more switches Imaging a tree-like structure
What happens as we keep going up the tree?
28
Technology impossible to buildthe enormous root switch Increase ports (expensive) What happens if the root
fail?
Switches can’t handle that much load Max per switch = 2 Gbps Other 2 connects are
useless
From tree to fat tree
29
4x4 switch represented as 2 set of 2x2 switches
Enforce the “criss cross” pattern
A large flat tree: the 8x8 switches (4x(2x2))
30
A tree scalable, using only 2x2 switches (smaller switches)
The Clos Network
31
Non-blocking property “Any unused server can connect to any other
unused server at any time, no matter what the other connections are.”
Adding another set of switches in the middle
Scale out is better than scale up
32
Scale out Having a lot of smaller switches
Scale up Having a few big switches
Scaling comparison
33
Cost Normally, scale up pays more than scale out. Scale out enables you to try smaller-specialized
configuration.
Maintenance Scale out increases the number of systems you must
manage.
Communication Scale out increases the number of communication
between systems. Scale out introduces additional latency to your system. Scale out increase the level of your availability of the
system.
References Cloud Computing Application, Campbell, R.
and Farivar, R. A Survey of Mobile Cloud Computing:
Architecture, Applications, and Approaches Brinton, Christopher; Chiang, Mung (2013-06-
10). Networks Illustrated: 8 Principles Without Calculus (Kindle Locations 1119-1123). Edwiser Scholastic Press. Kindle Edition.
34