Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and...
Transcript of Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and...
![Page 1: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/1.jpg)
Internet Measurement and Data Analysis (14)
Kenjiro Cho
2012-01-11
![Page 2: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/2.jpg)
review of previous class
Class 13 Data mining
I pattern extraction
I classification
I clustering
I exercise: clustering
2 / 48
![Page 3: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/3.jpg)
today’s topics
Class 14 Scalable measurement and analysis
I distributed parallel processing
I cloud technology
3 / 48
![Page 4: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/4.jpg)
measurement, data analysis and scalability
measurement methods
I network bandwidth, data volume, processing power onmeasurement machines
data collection
I collecting data from multiple sources
I network bandwidth, data volume, processing power oncollecting machines
data analysis
I analysis of huge data sets
I repetition of relatively simple jobs
I complex data processing by data mining methodsI data volume, processing power of analyzing machines
I communication power for distributed processing
4 / 48
![Page 5: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/5.jpg)
computational complexity
metrics for the efficiency of an algorithm
I time complexity
I space complexity
I average-case complexity
I worst-case complexity
big O notationI describe algorithms simply by the growth order of execution
time as input size n increasesI example: O(n),O(n2), O(n log n)
I more precisely, “f (n) is order g(n)” means:for function f (n) and function g(n), f (n) = O(g(n)) ⇔ thereexist constants C and n0 such that |f (n)| ≤ C |g(n)| (∀n ≥ n0)
5 / 48
![Page 6: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/6.jpg)
computational complexity
I logarithmic time
I polynomial time
I exponential time
1
10
100
1000
10000
100000
1e+06
1e+07
1e+08
1 10 100 1000 10000
com
puta
tion
time
input size (n)
O(log n)O(n)
O(n log n)O(n**2)O(n**3)O(2**n)
6 / 48
![Page 7: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/7.jpg)
example of computational complexity
search algorithms
I linear search: O(n)
I binary search: O(log2 n)
sort algorithms
I selection sort: O(n2)
I quick sort: O(n log2 n) on average, O(n2) for worst case
in general,
I linear algorithms (e.g., loop): O(n)
I binary trees: O(log n)
I double loops for a variable: O(n2)
I triple loops for a variable: O(n3)
I combination of variables (e.g., shortest path): O(cn)
7 / 48
![Page 8: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/8.jpg)
distributed algorithms
parallel or concurrent algorithms
I split a job and process them by multiple computers
I issues of communication cost and synchronization
distributed algorithms
I assume that communications are message passing amongindependent computers
I failures of computers and message losses
meritsI scalability
I improvement is only linear at best
I fault tolerance
8 / 48
![Page 9: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/9.jpg)
scale-up and scale-outI scale-up
I strengthen or extend a single nodeI without issues of parallel processing
I scale-outI extend a system by increasing the number of nodesI cost performance, fault-tolerance (use of cheap off-the-shelf
computers)
scale-out
scale-up
9 / 48
![Page 10: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/10.jpg)
cloud computing
cloud computing: various definitions
I broadly, computer resources behind a wide-area network
backgroundI market needs:
I outsourcing IT resources, management and servicesI no initial investment, no need to predict future demands
I cost reduction as a result
I as well as risk management and energy saving, especially afterthe Japan Earthquake
I providers: economy of scale, walled garden
10 / 48
![Page 11: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/11.jpg)
various cloudsI public/private/hybrid
I public cloud: public services over the InternetI private cloud: internal services for a single organizationI personal cloud, cloud federation
I service classification: SaaS/PaaS/IaaSI SaaS (Software as a Service)
I provides applications (e.g., Google Apps, Microsoft OnlineServices)
I PaaS (Platform as a Service)I provides a platform for applications (e.g., Google App Engine,
Microsoft Windows Azure)I IaaS (Infrastructure as a Service)
I provides (hardware) infrastructures such as virtualized serversor shared storage (e.g., Amazon EC2, Amazon S3)
I IaaS provider - IaaS user (utility computing)I IaaS user = SaaS provider - SaaS user (web applications)I PaaS: a framework to make SaaS development open for third
partyI scale-out cloud/server cloud
11 / 48
![Page 12: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/12.jpg)
key technologies
I virtualization: OS level, I/O level, network level
I utility computing
I energy saving
I data center networking
I management and monitoring technologies
I automatic scaling and load balancing
I large-scale distributed data processing
I related research fields: networking, OS, distributed systems,database, grid computing
I led by commercial services
12 / 48
![Page 13: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/13.jpg)
MapReduceMapReduce: a parallel programming model developed by Google
Dean, Jeff and Ghemawat, Sanjay.MapReduce: Simplified Data Processing on Large Clusters.OSDI’04. San Francisco, CA. December 2004.http://labs.google.com/papers/mapreduce.html
the slides are taken from the above materials
motivation: large scale data processingI want to use hundreds or thousands of CPUs for large data
processingI make it easy to use the system without understanding the
details of the hardware infrastructures
MapReduce providesI automatic parallelization and distributionI fault-toleranceI I/O schedulingI status and monitoring
13 / 48
![Page 14: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/14.jpg)
MapReduce programming model
Map/Reduce
I idea from Lisp or other functional programming languages
I generic: for a wide range of applications
I suitable for distributed processing
I able to re-execute after a failure
Map/Reduce in Lisp(map square ’(1 2 3 4)) → (1 4 9 16)(reduce + ’(1 4 9 16)) → 30
14 / 48
![Page 15: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/15.jpg)
Map/Reduce in MapReducemap(in key, in value) → list(out key, intermediate value)
I key/value pairs as input, produce another set of key/valuepairs
reduce(out key, list(intermediate value)) → list(out value)I using the results of map(), produce a set of merged output
values for a particular key
example: count word occurrencesmap(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
reduce(String output_key, Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
15 / 48
![Page 16: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/16.jpg)
other applications
I distributed grepI map: output lines matching a supplied patternI reduce: nothing
I count of URL access frequencyI map: reading web access log, and outputs < URL, 1 >I reduce: adds together all values for the same URL, and emits
< URL, count >
I reverse web-link graphI map: outputs < target, source > pairs for each link in web
pagesI reduce: concatenates the list of all source URLs associated
with a given target URL and emits the pair< target, list(source) >
I inverted indexI map: emits < word , docID > from each documentI reduce: emits the list of < word , list(docID) >
16 / 48
![Page 17: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/17.jpg)
MapReduce Execution Overview
source: MapReduce: Simplified Data Processing on Large Clusters
17 / 48
![Page 18: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/18.jpg)
MapReduce Execution
source: MapReduce: Simplified Data Processing on Large Clusters
18 / 48
![Page 19: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/19.jpg)
MapReduce Parallel Execution
source: MapReduce: Simplified Data Processing on Large Clusters
19 / 48
![Page 20: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/20.jpg)
Task Granularity and Pipelining
I tasks are fine-grained: the number of Map tasks >> numberof machines
I minimizes time for fault recoveryI can pipeline shuffling with map executionI better dynamic load balancing
I often use 2,000 map/5,000 reduce tasks w/ 2,000 machines
source: MapReduce: Simplified Data Processing on Large Clusters
20 / 48
![Page 21: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/21.jpg)
fault tolerance: handled via re-execution
on worker failure
I detect failure via periodic heartbeatsI re-execute completed and in-progress map tasks
I need to re-execute completed tasks as results are stored onlocal disks
I re-execute in progress reduce tasks
I task completion committed through master
robust: lost 1600 of 1800 machines once, but finished fine
21 / 48
![Page 22: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/22.jpg)
MapReduce status
source: MapReduce: Simplified Data Processing on Large Clusters
22 / 48
![Page 23: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/23.jpg)
MapReduce status
source: MapReduce: Simplified Data Processing on Large Clusters
23 / 48
![Page 24: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/24.jpg)
MapReduce status
source: MapReduce: Simplified Data Processing on Large Clusters
24 / 48
![Page 25: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/25.jpg)
MapReduce status
source: MapReduce: Simplified Data Processing on Large Clusters
25 / 48
![Page 26: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/26.jpg)
MapReduce status
source: MapReduce: Simplified Data Processing on Large Clusters
26 / 48
![Page 27: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/27.jpg)
MapReduce status
source: MapReduce: Simplified Data Processing on Large Clusters
27 / 48
![Page 28: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/28.jpg)
MapReduce status
source: MapReduce: Simplified Data Processing on Large Clusters
28 / 48
![Page 29: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/29.jpg)
MapReduce status
source: MapReduce: Simplified Data Processing on Large Clusters
29 / 48
![Page 30: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/30.jpg)
MapReduce status
source: MapReduce: Simplified Data Processing on Large Clusters
30 / 48
![Page 31: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/31.jpg)
MapReduce status
source: MapReduce: Simplified Data Processing on Large Clusters
31 / 48
![Page 32: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/32.jpg)
MapReduce status
source: MapReduce: Simplified Data Processing on Large Clusters
32 / 48
![Page 33: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/33.jpg)
refinement: redundant execution
slow workers significantly lengthen completion time
I other jobs consuming resources on machine
I bad disks with soft errors transfer data very slowly
I weird things: processor caches disabled (!!)
solution: near end of phase, spawn backup copies of tasks
I whichever one finishes first “wins”
effect: drastically shortens completion time
33 / 48
![Page 34: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/34.jpg)
refinement: locality optimization
master scheduling policy
I asks GFS for locations of replicas of input file blocks
I map tasks typically split into 64MB (== GFS block size)
I map tasks scheduled so GFS input block replicas are on samemachine or same rack
effect: thousands of machines read input at local disk speed
I without this, rack switches limit read rate
34 / 48
![Page 35: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/35.jpg)
refinement: skipping bad records
Map/Reduce functions sometimes fail for particular inputs
I best solution is to debug and fix, but not always possibleI on Segmentation Fault
I send UDP packet to master from signal handlerI include sequence number of record being processed
I if master sees two failures for same record,I next worker is told to skip the record
effect: can work around bugs in third party libraries
35 / 48
![Page 36: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/36.jpg)
other refinement
I sorting guarantees within each reduce partition
I compression of intermediate data
I Combiner: useful for saving network bandwidth
I local execution for debugging/testing
I user-defined counters
36 / 48
![Page 37: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/37.jpg)
performance
test run on cluster of 1800 machines
I 4GB of memory
I Dual-processor 2GHz Xeons with Hyperthreading
I Dual 160GB IDE disks
I Gigabit Ethernet per machine
I Bisection bandwidth approximately 100Gbps
2 benchmarks:
I MR Grep: scan 1010 100-byte records to extract recordsmatching a rare pattern (92K matching records)
I MR Sort: sort 1010 100-byte records (modeled after TeraSortbenchmark)
37 / 48
![Page 38: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/38.jpg)
MR Grep
I locality optimization helpsI 1800 machines read 1TB of data at peak of 31GB/sI without this, rack switches would limit to 10GB/s
I startup overhead is significant for short jobs
source: MapReduce: Simplified Data Processing on Large Clusters
38 / 48
![Page 39: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/39.jpg)
MR Sort
I backup tasks reduce job completion time significantly
I system deals well with failures
Normal(left) No backup tasks(middle) 200 processes killed(right)
source: MapReduce: Simplified Data Processing on Large Clusters
39 / 48
![Page 40: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/40.jpg)
MapReduce summary
I MapReduce: abstract model for distributed parallel processing
I considerably simplify large-scale data processingI easy to use, fun!
I the system takes care of details of parallel processingI programmers can concentrate on solving a problem
I various applications inside Google including search indexcreation
additional note
I Google does not publish the implementation of MapReduce
I Hadoop: open source MapReduce implementation by ApacheProject
40 / 48
![Page 41: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/41.jpg)
previous exercise: k-means clustering
I data: hourly traffic for Monday vs. Wednesday/Friday/Sunday
% cat km-1.txt km-2.txt km-3.txt | ruby k-means.rb | \
sort -k3,3 -s -n > km-results.txt
0
1000
2000
3000
4000
5000
6000
0 1000 2000 3000 4000 5000 6000
Y
X
123
41 / 48
![Page 42: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/42.jpg)
k-means code (1/2)k = 3 # k clusters
re = /^(\d+)\s+(\d+)/
INFINITY = 0x7fffffff
# read data
nodes = Array.new # array of array for data points: [x, y, cluster_index]
centroids = Array.new # array of array for centroids: [x, y]
ARGF.each_line do |line|
if re.match(line)
c = rand(k) # randomly assign initial cluster
nodes.push [$1.to_i, $2.to_i, c]
end
end
round = 0
begin
updated = false
# assignment step: assign each node to the closest centroid
if round != 0 # skip assignment for the 1st round
nodes.each do |node|
dist2 = INFINITY # square of dsistance to the closest centroid
cluster = 0 # closest cluster index
for i in (0 .. k - 1)
d2 = (node[0] - centroids[i][0])**2 + (node[1] - centroids[i][1])**2
if d2 < dist2
dist2 = d2
cluster = i
end
end
node[2] = cluster
end
end
42 / 48
![Page 43: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/43.jpg)
k-means code (2/2)
# update step: compute new centroids
sums = Array.new(k)
clsize = Array.new(k)
for i in (0 .. k - 1)
sums[i] = [0, 0]
clsize[i] = 0
end
nodes.each do |node|
i = node[2]
sums[i][0] += node[0]
sums[i][1] += node[1]
clsize[i] += 1
end
for i in (0 .. k - 1)
newcenter = [Float(sums[i][0]) / clsize[i], Float(sums[i][1]) / clsize[i]]
if round == 0 || newcenter[0] != centroids[i][0] || newcenter[1] != centroids[i][1]
centroids[i] = newcenter
updated = true
end
end
round += 1
end while updated == true
# print the results
nodes.each do |node|
puts "#{node[0]}\t#{node[1]}\t#{node[2]}"
end
43 / 48
![Page 44: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/44.jpg)
k-means clustering resultsI different results with different initial values
set key left
set xrange [0:6000]
set yrange [0:6000]
set xlabel "X"
set ylabel "Y"
plot "km-c1.txt" using 1:2 title "cluster 1" with points, \
"km-c2.txt" using 1:2 title "cluster 2" with points, \
"km-c3.txt" using 1:2 title "cluster 3" with points
0
1000
2000
3000
4000
5000
6000
0 1000 2000 3000 4000 5000 6000
Y
X
cluster 1cluster 2cluster 3
44 / 48
![Page 45: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/45.jpg)
final report
I select A or BI A. web access log analysisI B. free topic
I up to 8 pages in the PDF format
I submission via SFC-SFS by 2012-01-25 (Wed) 23:59
45 / 48
![Page 46: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/46.jpg)
final report (cont’d)
A. web access log analysis
I data: apache log (combined log format) used in Class 3
I from a JAIST server, access log for 24 hourshttp://www.iijlab.net/∼kjc/classes/sfc2011f-measurement/
sample access log.bz2
I write a script to extract the access count of each unique content, and
plot the distribution in a log-log plot
I X-axis:request count, Y-axis:CCDF for the number of URLs
I optionally, do other analysis
I the report should include (1) your script to extract the access counts, (2)a plot of the access count distribution, and (3) your analysis of the results
B. free topic
I select a topic by yourself
I the topic is not necessarily on networking
I but the report should include some form of data analysis and discussionabout data and results
46 / 48
![Page 47: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/47.jpg)
summary
Class 14 Scalable measurement and analysis
I distributed parallel processing
I cloud technology
47 / 48
![Page 48: Internet Measurement and Data Analysis (14)kjc/classes/sfc2011f...measurement, data analysis and scalability measurement methods I network bandwidth, data volume, processing power](https://reader035.fdocuments.in/reader035/viewer/2022081407/604d3cefc9f4fe68fb35b387/html5/thumbnails/48.jpg)
next class
Class 15 Summary (1/18)
I summary of the class
I Internet measurement and privacy issues
48 / 48