A PRESENTATION Mrs. ALOKA GUHA A PRESENTATION BY: Mrs. ALOKA GUHA.
Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP...
Transcript of Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP...
Revolution Confidential
Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop
Date: 12 October, 2010
1
Jin Xia, Saptarshi Guha, and W.S. Cleveland
Revolution Confidential Voice over IP (VoIP)
VoIP refers to the telephony system that delivers voice communica7ons through IP networks.
Telecommunica7on companies are gradually migra7ng to VoIP. Public Switched Telephone Network (PSTN) is commonly merged to VoIP through gateways now.
Skype and SIP-‐RTP. Quality of Service (QoS) is cri7cal to VoIP applica7on.
Internet queuing creates packet delay and jiGer which affect QoS.
Offered load: original traffic before queuing interference. It is crucial to queuing studies.
Revolution Confidential VoIP Packe/za/on and Transmission
3
Semi-‐call Packe7za7on
Transmission through IP network and PSTN via gateway
Internet Caller Callee
20 ms 2 s Transmission interval
Analog Voice Signal Packet Trace
PSTN
PSTN to IP
IP to PSTN
Packet Trace
Router
Router
IP
Router
Gateway
Revolution Confidential Objec/ve
Is the measured traffic close to the offered? Whether absolute jiGer between successive voice packets is small compared to 20 ms.
Whether absolute jiGer is larger from farther sending sites, from which packets are expected to transmit through more hops.
Whether traffic rate at gateways has a monotone effect on absolute jiGer, in this case we only have access to traffic rate at Newark gateway.
4
Revolution Confidential Objec/ve
5
Does jitter depend on traffic rate? If yes, is it different across sending sites?
Revolution Confidential Data Sources
VoIP packet 7mestamps and headers for 48 hours on the Newark (New Jersey, USA) gateway of Global Crossing(GBLX) network.
Two mul7plexed packet traces from IP to PSTN and from PSTN to IP.
332,018 calls, 1.315 BN packets, 84GB. 27 sending sites including PSTN
17 na7onal sending sites, e.g. PSTN, Atlanta, etc. 10 interna7onal sending sites, e.g. London, Milan, etc.
6
Revolution Confidential
7
Revolution Confidential R and Hadoop to Compute with Data
We face a problem with heavy load in data processing, modeling and visualiza7on.
R is a programming language and so_ware environment for sta7s7cal compu7ng and graphics.
The open source implementa7on of the S sta7s7cal programming language (Chambers, Becker & Wilks)
Highly extensible through packages. Over 2000 packages exist.
R is the most popular language in scien7fic community for sta7s7cal research.
The standard for rapid prototyping for data analysis
8
Revolution Confidential R and Hadoop to Compute With Data
Hadoop is an environment suppor7ng computa7on on large data sets across clusters.
Use R and Hadoop. The R and Hadoop Integrated Processing Environment (RHIPE) allows R users to compute across data using the MapReduce programming model through the Hadoop system.
Completely within the R environment.
9
Revolution Confidential RHIPE -‐ Overview
10
Based on Hadoop Streaming source.
User writes R code, RHIPE communicates between R and Hadoop.
A variety of R data types can be used for keys and values
Output can be read in Java / Python etc. (uses Protocol Buffers)
Revolution Confidential
11
Demul/plex
Convert packet database to semi-‐call database
1079089243.086862 IP UDP 200 67.17.50.213 14484 67.17.58.211 14906 0 1079089243.088272 IP UDP 200 67.17.50.213 5420 67.17.50.6 18228 0
In MapReduce, the map will par77on the packets based on source and des7na7on IP, ports and direc7on (in/out)
“1079089243.086862 IP UDP 200 67.17.50.213 14484 67.17.58.211 14906 0”
“67.17.50.213.14484.67.17.58.211.14906” (1079089243.086862, 0)
key value
Revolution Confidential
12
map <- expression({ v <- lapply(seq_along(map.values),function(r) { value0 <- strsplit(map.values[[r]]," +")[[1]] key <- paste(value0[5:8],sep=".") value <-c(as.numeric(value0[1]),as.integer(value0[9])) rhcollect(key,value) }) })
map.values and map.keys are vectors of keys and values.
The reduce aggregates the intermediate output and saves each semi-‐call as an R data frame
Revolution Confidential Demul/plex
13
Map emits 664,036 unique keys (semi-‐calls) and 1.4BN intermediate values (packets)
The cluster has 78 processors, each processor is assigned ~ 664,036/78 ~ 8513 keys and associated packets to ‘reduce’ (combine into a data frame).
On each processor the following flow occurs: while there are more intermediate keys do reduce.key = get new intermediate key ... do something with reduce.key while more intermediate values for reduce.key do reduce.value = get intermediate value for reduce.key ... end while .. . all values received, post process end while
REDUCE
PRE
POST
Revolution Confidential
14
reduce <- expression( pre = { mydata<-list() }, reduce = { mydata <- append(mydata,reduce.values) }, post = { mydata <- do.call("rbind",mydata) colnames(mydata) <- c("time","rtpPT") mydata <- mydata[order(mydata[,'time']),,drop=F] mydata <- data.frame(time = as.numeric(mydata[,'time']), rtpPT = as.integer(mydata[,'rtpPT'])) rhcollect(reduce.key,mydata) } )
reduce.values is a vector of the intermediate values.
Got a new key
While more intermediate values
All values sent
Revolution Confidential
15
> z <- rhmr(map=input,reduce=reduce,inout=c("text","map"),ifolder="/voip/iprtp.traces",ofolder="/voip/call.traces",jobname="create call trace database") > job <- rhex(z, async = TRUE) > print(job) RHIPE Job Token Information -------------------------- URL: http://spica:50030/jobdetails.jsp?jobid=job_201007281701_0053 Name: 2010-07-28 23:33:44 ID: job_201007281701_0053 Submission Time: 2010-07-28 23:33:45 State: RUNNING Duration(sec): 11.702 Progress pct numtasks pending running complete failed map 0 156 146 10 0 0 reduce 0 78 78 0 0 0
Create a Job
Launch a Job
Monitor
Jobs are created, launched and monitored from the R console.
Revolution Confidential
16
> rhread("/voip/call.traces",type="map",max=1) RHIPE: Read 1 pair occupying 159 bytes, deserializing [[1]] [[1]][[1]] [1] "67.17.50.213.5054.67.17.50.6.6640.in"
[[1]][[2]] time rtpPT 1 1079007238 0 2 1079007238 19 3 1079007238 19
> rhgetkey(list(c("67.17.50.213.5054.67.17.50.6.6640.in")),"/voip/call.traces/p*")
Read results
Results are a list of key, value lists
Jobs are created, launched and monitored from R console
Can query by key (MapFile)
Revolution Confidential Why Vectors
17
ScaGer plot of reading in 11.2 GB of CSV data and tokenizing the lines into columns.
MAP_MAX is the number of key, value pairs given to the map expression
The red points are the group medians and the red curve is the loess fit.
Revolution Confidential Call Summaries
Compute start 7me, end 7me, dura7on and number of packets
Compute for each semi-‐call and pair up by call iden7fier
18
m<-expression({ lapply(seq_along(map.values),function(i){ key0<-map.keys[[i]] # semi-call identifier value0<-map.values[[i]] # semi-call packets key.elements<-strsplit(key0,"\\.")[[1]] key<-paste(key.elements[1:10],collapse=".") # call identifier n.pkt<-dim(value0)[1] start<-value0[1,1] end<-value0[n.pkt,1] dur<-end-start+0.02 value<-if(tmp[11]=="in") c(in.start=start,in.end=end,in.dur=dur,in.pkt=n.pkt) else c(out.start=start,out.end=end,out.dur=dur,out.pkt=n.pkt) # call summary rhcollect(key,value) }) })
Revolution Confidential Call Summaries
In the reduce, we get two summaries for each semi-‐call
19
r<-expression( pre={ mydata<-list() }, reduce={ mydata<-append(mydata,reduce.values) }, post={ mydata<-unlist(mydata) in.start<-if(!is.null(mydata['in.start'])) mydata['in.start'] else NA in.end<-if(!is.null(mydata['in.end'])) mydata['in.end'] else NA
… …. out.end<-if(!is.null(mydata['out.end'])) mydata['out.end'] else NA out.start<-if(!is.null(mydata['out.start'])) mydata['out.start'] else NA value<-c(in.start,in.end,in.dur,in.pkt,out.start,out.end,out.dur,out.pkt) rhcollect(reduce.key,value) } )
reduce.values will be of length 2
Revolution Confidential Create JiPer Database
Each semi-‐call gives rise to one or mul7ple jiPer objects: data frames of jiGer corresponding to transmission intervals.
20
Revolution Confidential
21
map <-expression({ lapply(seq_along(map.values),function(i){ key0<-map.keys[[i]] value0<-map.values[[i]] n.pkt<-dim(value0)[1] if(n.pkt>1){ arrival<-value0$time flag<-value0$rtpPT==0 | value0$rtpPT==8 tmp.jitter<-diff(arrival) tmp.arrival1<-arrival[-n.pkt] tmp.flag1<-flag[-n.pkt] tmp.flag2<-flag[2:n.pkt] tmp.transmission<-tmp.flag1 & tmp.flag2 jitter<-round(tmp.jitter[tmp.transmission]-0.02,6)*1000 arrival1<-tmp.arrival1[tmp.transmission] tmp.id<-diff(c(F,tmp.transmission,F)) transmission.start<-seq_along(tmp.id)[tmp.id==1] transmission.end<-seq_along(tmp.id)[tmp.id==-1] if(length(transmission.start)>0){ group<-rep(seq_along(transmission.start),transmission.end-transmission.start) lapply(seq_along(group),function(i){ key<-c(key0,group[i])
value<-c(arrival1[i],jitter[i]) rhcollect(key,value) }) } } }) })
Logic
Template
Write jitter object
Revolution Confidential
22
r<-expression( pre = { mydata<-list() }, reduce = { mydata <- append(mydata,reduce.values) }, post = { mydata <- do.call("rbind",mydata) colnames(mydata) <- c("arrival1","jitter") mydata <- mydata[order(mydata[,'arrival1']),,drop=F] mydata <- data.frame(arrival1 = as.numeric(mydata[,'arrival1']), jitter = as.numeric(mydata[,'jitter'])) rhcollect(reduce.key,mydata) } )
Append data frames
Combine them and save
14 MM jiGer objects, 6.5 min, 21GB.
Revolution Confidential Remove Gateway Effect
23
The jiGer is cyclical because of a gateway effect
This can be removed using regression with bisquare robust es7mator (Andrews et al., 1972) to return residuals
Only on jiGer objects with more than 90 packets. This is a pure map: apply regression across 3.8 million data frames (13.9 GB)
Revolution Confidential
24
map <-expression({ X9<-diag(rep(1,9)) mywt.bisquare <- function(u, c = 6){ U <- abs(u/c) w <- ((1 + U) * (1 - U))^2 w[U > 1] <- 0 w } lapply(seq_along(map.values),function(i){ key0<-map.keys[[i]] value0<-map.values[[i]] n.jitter<-nrow(value0) if(n.jitter>=90){ for(i in 1:3){
iterative LS regression with bisquare weights fit.lm <- lm.wfit(xx,jitter,wt) #lm(jitter ~ xx -1, weights=wt) … checks } }})})
rhex(map=map, input,output,mapred=list(mapred.reduce.tasks=0))
An lapply for big datasets
Revolution Confidential Compare Rate Normalized JiPer
Compare jiGer across sending sites while accoun7ng for traffic rate.
Compute traffic rate (bits per second) for 30 second intervals (20ms too noisy). round 7me of packet down to nearest 30 seconds and get counts.
25
Revolution Confidential Compare Rate Normalized JiPer
Create near replicate subsets that share almost iden7cal traffic rate distribu7on for each sending site.
Approximately 30,000 observa7ons per sending site.
Regression across all subsets and recombine results.
26
Fifth root absolute jitter vs. Traffic Rate for a subset of PSTN ( untransformed jitter is too
skewed)
Revolution Confidential Compare Rate Normalized JiPer
27
Residual plots with a loess fit shows the appropriateness of the regression.
Regression residuals against traffic rate for a subset of Newark
Revolution Confidential Compare Rate Normalized JiPer
28
Rate normalized jitter mean and standard deviation have smaller differences within than across sending sites.
Rate normalized jitter mean and standard deviation follow Normal distributions across subsets within each sending site.
We can use the above information to recombine the results by taking averages across subsets within sending site.
Revolution Confidential Compare Rate Normalized JiPer
29
Mean regression fit for each sending site
JiGer shows a monotone increasing
rela7onship with traffic rate
Revolution Confidential Compare Rate Normalized JiPer
30
Mean and standard devia7on of rate corrected jiGer per subset are averaged and compared among interna7onal and na7onal sending site.
As expected interna7onal sites show more jiGer than local ones.
Revolution Confidential Conclusion
JiGer appears to be affected by traffic rate.
JiGer depends sending site distance as expected. but the differences among sending sites are rela7vely small compared to the scale of the raw data scale and QoS requirements.
We can safely say that the observed traffic is a good approxima7on of offered traffic.
31
Revolution Confidential Conclusion -‐ R and Hadoop
Flexible parallel programming models e.g. MapReduce provides the analyst powerful ways to distribute code execu7on with worrying about the intricacies.
Mixing R and Hadoop makes it easy for the analyst to think and work within the R framework All the examples use R and RHIPE. Data for all visualiza7ons created using RHIPE.
Data created using RHIPE can be read using other languages (e.g. Java, Python).
32
Revolution Confidential
33
The leading commercial provider of software and support for the popular open source R statistics language.
www.revolutionanalytics.com (650) 330 0553
Twitter: @RevolutionR