Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConﬁdential* Voice over IP...

Revolution Confidential

Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop

Date: 12 October, 2010

1

Jin Xia, Saptarshi Guha, and W.S. Cleveland

Revolution Confidential Voice over IP (VoIP)

  VoIP refers to the telephony system that delivers voice communica7ons through IP networks.

  Telecommunica7on companies are gradually migra7ng to VoIP.   Public Switched Telephone Network (PSTN) is commonly merged to VoIP through gateways now.

  Skype and SIP-‐RTP.   Quality of Service (QoS) is cri7cal to VoIP applica7on.

  Internet queuing creates packet delay and jiGer which affect QoS.

  Offered load: original traffic before queuing interference. It is crucial to queuing studies.

Revolution Confidential VoIP Packe/za/on and Transmission

3

  Semi-‐call Packe7za7on

  Transmission through IP network and PSTN via gateway

Internet Caller Callee

20 ms 2 s Transmission interval

Analog Voice Signal Packet Trace

PSTN

PSTN to IP

IP to PSTN

Packet Trace

Router

Router

IP

Router

Gateway

Revolution Confidential Objec/ve

  Is the measured traffic close to the offered?   Whether absolute jiGer between successive voice packets is small compared to 20 ms.

  Whether absolute jiGer is larger from farther sending sites, from which packets are expected to transmit through more hops.

  Whether traffic rate at gateways has a monotone effect on absolute jiGer, in this case we only have access to traffic rate at Newark gateway.

4

Revolution Confidential Objec/ve

5

Does jitter depend on traffic rate? If yes, is it different across sending sites?

Revolution Confidential Data Sources

  VoIP packet 7mestamps and headers for 48 hours on the Newark (New Jersey, USA) gateway of Global Crossing(GBLX) network.

  Two mul7plexed packet traces from IP to PSTN and from PSTN to IP.

  332,018 calls, 1.315 BN packets, 84GB.   27 sending sites including PSTN

  17 na7onal sending sites, e.g. PSTN, Atlanta, etc.   10 interna7onal sending sites, e.g. London, Milan, etc.

6


7

Revolution Confidential R and Hadoop to Compute with Data

  We face a problem with heavy load in data processing, modeling and visualiza7on.

  R is a programming language and so_ware environment for sta7s7cal compu7ng and graphics.

  The open source implementa7on of the S sta7s7cal programming language (Chambers, Becker & Wilks)

  Highly extensible through packages. Over 2000 packages exist.

  R is the most popular language in scien7fic community for sta7s7cal research.

The standard for rapid prototyping for data analysis

8

Revolution Confidential R and Hadoop to Compute With Data

  Hadoop is an environment suppor7ng computa7on on large data sets across clusters.

  Use R and Hadoop.   The R and Hadoop Integrated Processing Environment (RHIPE) allows R users to compute across data using the MapReduce programming model through the Hadoop system.

  Completely within the R environment.

9

Revolution Confidential RHIPE -‐ Overview

10

  Based on Hadoop Streaming source.

  User writes R code, RHIPE communicates between R and Hadoop.

  A variety of R data types can be used for keys and values

  Output can be read in Java / Python etc. (uses Protocol Buffers)


11

Demul/plex

Convert packet database to semi-‐call database

1079089243.086862 IP UDP 200 67.17.50.213 14484 67.17.58.211 14906 0 1079089243.088272 IP UDP 200 67.17.50.213 5420 67.17.50.6 18228 0

In MapReduce, the map will par77on the packets based on source and des7na7on IP, ports and direc7on (in/out)

“1079089243.086862 IP UDP 200 67.17.50.213 14484 67.17.58.211 14906 0”

“67.17.50.213.14484.67.17.58.211.14906” (1079089243.086862, 0)

key value


12

map <- expression({ v <- lapply(seq_along(map.values),function(r) { value0 <- strsplit(map.values[[r]]," +")[[1]] key <- paste(value0[5:8],sep=".") value <-c(as.numeric(value0[1]),as.integer(value0[9])) rhcollect(key,value) }) })

map.values and map.keys are vectors of keys and values.

The reduce aggregates the intermediate output and saves each semi-‐call as an R data frame

Revolution Confidential Demul/plex

13

  Map emits 664,036 unique keys (semi-‐calls) and 1.4BN intermediate values (packets)

  The cluster has 78 processors, each processor is assigned ~ 664,036/78 ~ 8513 keys and associated packets to ‘reduce’ (combine into a data frame).

  On each processor the following flow occurs: while there are more intermediate keys do reduce.key = get new intermediate key ... do something with reduce.key while more intermediate values for reduce.key do reduce.value = get intermediate value for reduce.key ... end while .. . all values received, post process end while

REDUCE

PRE

POST


14

reduce <- expression( pre = { mydata<-list() }, reduce = { mydata <- append(mydata,reduce.values) }, post = { mydata <- do.call("rbind",mydata) colnames(mydata) <- c("time","rtpPT") mydata <- mydata[order(mydata[,'time']),,drop=F] mydata <- data.frame(time = as.numeric(mydata[,'time']), rtpPT = as.integer(mydata[,'rtpPT'])) rhcollect(reduce.key,mydata) } )

reduce.values is a vector of the intermediate values.

Got a new key

While more intermediate values

All values sent


15

> z <- rhmr(map=input,reduce=reduce,inout=c("text","map"),ifolder="/voip/iprtp.traces",ofolder="/voip/call.traces",jobname="create call trace database") > job <- rhex(z, async = TRUE) > print(job) RHIPE Job Token Information -------------------------- URL: http://spica:50030/jobdetails.jsp?jobid=job_201007281701_0053 Name: 2010-07-28 23:33:44 ID: job_201007281701_0053 Submission Time: 2010-07-28 23:33:45 State: RUNNING Duration(sec): 11.702 Progress pct numtasks pending running complete failed map 0 156 146 10 0 0 reduce 0 78 78 0 0 0

Create a Job

Launch a Job

Monitor

Jobs are created, launched and monitored from the R console.


16

> rhread("/voip/call.traces",type="map",max=1) RHIPE: Read 1 pair occupying 159 bytes, deserializing [[1]] [[1]][[1]] [1] "67.17.50.213.5054.67.17.50.6.6640.in"

[[1]][[2]] time rtpPT 1 1079007238 0 2 1079007238 19 3 1079007238 19

> rhgetkey(list(c("67.17.50.213.5054.67.17.50.6.6640.in")),"/voip/call.traces/p*")

Read results

Results are a list of key, value lists

Jobs are created, launched and monitored from R console

Can query by key (MapFile)

Revolution Confidential Why Vectors

17

  ScaGer plot of reading in 11.2 GB of CSV data and tokenizing the lines into columns.

  MAP_MAX is the number of key, value pairs given to the map expression

  The red points are the group medians and the red curve is the loess fit.

Revolution Confidential Call Summaries

  Compute start 7me, end 7me, dura7on and number of packets

  Compute for each semi-‐call and pair up by call iden7fier

18

m<-expression({ lapply(seq_along(map.values),function(i){ key0<-map.keys[[i]] # semi-call identifier value0<-map.values[[i]] # semi-call packets key.elements<-strsplit(key0,"\\.")[[1]] key<-paste(key.elements[1:10],collapse=".") # call identifier n.pkt<-dim(value0)[1] start<-value0[1,1] end<-value0[n.pkt,1] dur<-end-start+0.02 value<-if(tmp[11]=="in") c(in.start=start,in.end=end,in.dur=dur,in.pkt=n.pkt) else c(out.start=start,out.end=end,out.dur=dur,out.pkt=n.pkt) # call summary rhcollect(key,value) }) })

Revolution Confidential Call Summaries

  In the reduce, we get two summaries for each semi-‐call

19

r<-expression( pre={ mydata<-list() }, reduce={ mydata<-append(mydata,reduce.values) }, post={ mydata<-unlist(mydata) in.start<-if(!is.null(mydata['in.start'])) mydata['in.start'] else NA in.end<-if(!is.null(mydata['in.end'])) mydata['in.end'] else NA

… …. out.end<-if(!is.null(mydata['out.end'])) mydata['out.end'] else NA out.start<-if(!is.null(mydata['out.start'])) mydata['out.start'] else NA value<-c(in.start,in.end,in.dur,in.pkt,out.start,out.end,out.dur,out.pkt) rhcollect(reduce.key,value) } )

reduce.values will be of length 2

Revolution Confidential Create JiPer Database

  Each semi-‐call gives rise to one or mul7ple jiPer objects: data frames of jiGer corresponding to transmission intervals.

20


21

map <-expression({ lapply(seq_along(map.values),function(i){ key0<-map.keys[[i]] value0<-map.values[[i]] n.pkt<-dim(value0)[1] if(n.pkt>1){ arrival<-value0$time flag<-value0$rtpPT==0 | value0$rtpPT==8 tmp.jitter<-diff(arrival) tmp.arrival1<-arrival[-n.pkt] tmp.flag1<-flag[-n.pkt] tmp.flag2<-flag[2:n.pkt] tmp.transmission<-tmp.flag1 & tmp.flag2 jitter<-round(tmp.jitter[tmp.transmission]-0.02,6)*1000 arrival1<-tmp.arrival1[tmp.transmission] tmp.id<-diff(c(F,tmp.transmission,F)) transmission.start<-seq_along(tmp.id)[tmp.id==1] transmission.end<-seq_along(tmp.id)[tmp.id==-1] if(length(transmission.start)>0){ group<-rep(seq_along(transmission.start),transmission.end-transmission.start) lapply(seq_along(group),function(i){ key<-c(key0,group[i])

value<-c(arrival1[i],jitter[i]) rhcollect(key,value) }) } } }) })

Logic

Template

Write jitter object


22

r<-expression( pre = { mydata<-list() }, reduce = { mydata <- append(mydata,reduce.values) }, post = { mydata <- do.call("rbind",mydata) colnames(mydata) <- c("arrival1","jitter") mydata <- mydata[order(mydata[,'arrival1']),,drop=F] mydata <- data.frame(arrival1 = as.numeric(mydata[,'arrival1']), jitter = as.numeric(mydata[,'jitter'])) rhcollect(reduce.key,mydata) } )

Append data frames

Combine them and save

14 MM jiGer objects, 6.5 min, 21GB.

Revolution Confidential Remove Gateway Effect

23

  The jiGer is cyclical because of a gateway effect

  This can be removed using regression with bisquare robust es7mator (Andrews et al., 1972) to return residuals

  Only on jiGer objects with more than 90 packets.   This is a pure map: apply regression across 3.8 million data frames (13.9 GB)


24

map <-expression({ X9<-diag(rep(1,9)) mywt.bisquare <- function(u, c = 6){ U <- abs(u/c) w <- ((1 + U) * (1 - U))^2 w[U > 1] <- 0 w } lapply(seq_along(map.values),function(i){ key0<-map.keys[[i]] value0<-map.values[[i]] n.jitter<-nrow(value0) if(n.jitter>=90){ for(i in 1:3){

iterative LS regression with bisquare weights fit.lm <- lm.wfit(xx,jitter,wt) #lm(jitter ~ xx -1, weights=wt) … checks } }})})

rhex(map=map, input,output,mapred=list(mapred.reduce.tasks=0))

An lapply for big datasets

Revolution Confidential Compare Rate Normalized JiPer

  Compare jiGer across sending sites while accoun7ng for traffic rate.

  Compute traffic rate (bits per second) for 30 second intervals (20ms too noisy).   round 7me of packet down to nearest 30 seconds and get counts.

25


  Create near replicate subsets that share almost iden7cal traffic rate distribu7on for each sending site.

  Approximately 30,000 observa7ons per sending site.

  Regression across all subsets and recombine results.

26

Fifth root absolute jitter vs. Traffic Rate for a subset of PSTN ( untransformed jitter is too

skewed)


27

  Residual plots with a loess fit shows the appropriateness of the regression.

Regression residuals against traffic rate for a subset of Newark


28

  Rate normalized jitter mean and standard deviation have smaller differences within than across sending sites.

  Rate normalized jitter mean and standard deviation follow Normal distributions across subsets within each sending site.

  We can use the above information to recombine the results by taking averages across subsets within sending site.


29

  Mean regression fit for each sending site

JiGer shows a monotone increasing

rela7onship with traffic rate


30

  Mean and standard devia7on of rate corrected jiGer per subset are averaged and compared among interna7onal and na7onal sending site.

As expected interna7onal sites show more jiGer than local ones.

Revolution Confidential Conclusion

  JiGer appears to be affected by traffic rate.

  JiGer depends sending site distance as expected.   but the differences among sending sites are rela7vely small compared to the scale of the raw data scale and QoS requirements.

  We can safely say that the observed traffic is a good approxima7on of offered traffic.

31

Revolution Confidential Conclusion -‐ R and Hadoop

  Flexible parallel programming models e.g. MapReduce provides the analyst powerful ways to distribute code execu7on with worrying about the intricacies.

  Mixing R and Hadoop makes it easy for the analyst to think and work within the R framework   All the examples use R and RHIPE.   Data for all visualiza7ons created using RHIPE.

  Data created using RHIPE can be read using other languages (e.g. Java, Python).

32


33

The leading commercial provider of software and support for the popular open source R statistics language.

www.revolutionanalytics.com (650) 330 0553

Twitter: @RevolutionR

Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConﬁdential* Voice over IP...

Documents

Transcript of Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConﬁdential* Voice over IP...