Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP...

33
Revolution Confidential Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October, 2010 1 Jin Xia, Saptarshi Guha, and W.S. Cleveland

Transcript of Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP...

Page 1: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  

Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop

Date: 12 October, 2010

1

Jin Xia, Saptarshi Guha, and W.S. Cleveland

Page 2: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  Voice  over  IP  (VoIP)  

  VoIP  refers  to  the  telephony  system  that  delivers  voice  communica7ons  through  IP  networks.  

  Telecommunica7on  companies  are  gradually  migra7ng  to  VoIP.      Public  Switched  Telephone  Network  (PSTN)  is  commonly  merged  to  VoIP  through  gateways  now.  

  Skype  and  SIP-­‐RTP.    Quality  of  Service  (QoS)  is  cri7cal  to  VoIP  applica7on.  

  Internet  queuing  creates  packet  delay  and  jiGer  which  affect  QoS.  

  Offered  load:  original  traffic  before  queuing  interference.  It  is  crucial  to  queuing  studies.  

Page 3: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  VoIP  Packe/za/on  and  Transmission  

3

  Semi-­‐call  Packe7za7on  

  Transmission  through  IP  network  and  PSTN  via  gateway  

Internet Caller Callee

20 ms 2 s Transmission interval

Analog  Voice  Signal   Packet  Trace    

PSTN  

PSTN  to  IP  

IP  to  PSTN  

Packet  Trace    

Router  

Router  

IP  

Router  

Gateway  

Page 4: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  Objec/ve  

  Is  the  measured  traffic  close  to  the  offered?    Whether  absolute  jiGer  between  successive  voice  packets  is  small  compared  to  20  ms.  

  Whether  absolute  jiGer  is  larger  from  farther  sending  sites,  from  which  packets  are  expected  to  transmit  through  more  hops.  

  Whether  traffic  rate  at  gateways  has  a  monotone  effect  on  absolute  jiGer,  in  this  case  we  only  have  access  to  traffic  rate  at  Newark  gateway.  

4

Page 5: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  Objec/ve  

5

Does jitter depend on traffic rate? If yes, is it different across sending sites?

Page 6: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  Data  Sources  

  VoIP  packet  7mestamps  and  headers  for  48  hours  on  the  Newark  (New  Jersey,  USA)  gateway  of  Global  Crossing(GBLX)  network.  

  Two  mul7plexed  packet  traces  from  IP  to  PSTN  and  from  PSTN  to  IP.  

  332,018  calls,  1.315  BN  packets,  84GB.    27  sending  sites  including  PSTN  

  17  na7onal  sending  sites,  e.g.  PSTN,  Atlanta,  etc.    10  interna7onal  sending  sites,  e.g.  London,  Milan,  etc.  

6

Page 7: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  

7

Page 8: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  R  and  Hadoop  to  Compute  with  Data  

  We  face  a  problem  with  heavy  load  in  data  processing,  modeling  and  visualiza7on.  

  R  is  a  programming  language  and  so_ware  environment  for  sta7s7cal  compu7ng  and  graphics.  

  The  open  source  implementa7on  of  the  S  sta7s7cal  programming  language  (Chambers,  Becker  &  Wilks)  

  Highly  extensible  through  packages.  Over  2000  packages  exist.  

  R  is  the  most  popular  language  in  scien7fic  community  for  sta7s7cal  research.  

The  standard  for  rapid  prototyping  for  data  analysis  

8

Page 9: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  R  and  Hadoop  to  Compute  With  Data  

  Hadoop  is  an  environment  suppor7ng  computa7on  on  large  data  sets  across  clusters.  

  Use  R  and  Hadoop.    The  R  and  Hadoop  Integrated  Processing  Environment  (RHIPE)  allows  R  users  to  compute  across  data  using  the  MapReduce  programming  model  through  the  Hadoop  system.  

  Completely  within  the  R  environment.  

9

Page 10: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  RHIPE  -­‐  Overview  

10

  Based  on  Hadoop  Streaming  source.  

  User  writes  R  code,  RHIPE  communicates  between  R  and  Hadoop.  

  A  variety  of  R  data  types  can  be  used  for  keys  and  values  

  Output  can  be  read  in  Java  /  Python  etc.  (uses  Protocol  Buffers)  

Page 11: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  

11

Demul/plex  

Convert  packet  database  to  semi-­‐call  database  

1079089243.086862 IP UDP 200 67.17.50.213 14484 67.17.58.211 14906 0 1079089243.088272 IP UDP 200 67.17.50.213 5420 67.17.50.6 18228 0

In  MapReduce,  the  map  will  par77on  the  packets  based  on  source  and  des7na7on  IP,  ports  and  direc7on  (in/out)  

“1079089243.086862 IP UDP 200 67.17.50.213 14484 67.17.58.211 14906 0”

“67.17.50.213.14484.67.17.58.211.14906” (1079089243.086862, 0)

key value

Page 12: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  

12

map <- expression({ v <- lapply(seq_along(map.values),function(r) { value0 <- strsplit(map.values[[r]]," +")[[1]] key <- paste(value0[5:8],sep=".") value <-c(as.numeric(value0[1]),as.integer(value0[9])) rhcollect(key,value) }) })

map.values and map.keys are  vectors  of  keys  and  values.    

The  reduce  aggregates  the  intermediate  output  and  saves  each  semi-­‐call  as  an  R  data  frame  

Page 13: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  Demul/plex  

13

  Map  emits  664,036  unique  keys  (semi-­‐calls)  and  1.4BN  intermediate  values  (packets)  

  The  cluster  has  78  processors,  each  processor  is  assigned  ~  664,036/78  ~  8513  keys  and  associated  packets  to  ‘reduce’  (combine  into  a  data  frame).    

  On  each  processor  the  following  flow  occurs:  while there are more intermediate keys do reduce.key = get new intermediate key ... do something with reduce.key while more intermediate values for reduce.key do reduce.value = get intermediate value for reduce.key ... end while .. . all values received, post process end while

REDUCE

PRE

POST

Page 14: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  

14

reduce <- expression( pre = { mydata<-list() }, reduce = { mydata <- append(mydata,reduce.values) }, post = { mydata <- do.call("rbind",mydata) colnames(mydata) <- c("time","rtpPT") mydata <- mydata[order(mydata[,'time']),,drop=F] mydata <- data.frame(time = as.numeric(mydata[,'time']), rtpPT = as.integer(mydata[,'rtpPT'])) rhcollect(reduce.key,mydata) } )

reduce.values is  a  vector  of  the  intermediate  values.

Got a new key

While more intermediate values

All values sent

Page 15: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  

15

> z <- rhmr(map=input,reduce=reduce,inout=c("text","map"),ifolder="/voip/iprtp.traces",ofolder="/voip/call.traces",jobname="create call trace database") > job <- rhex(z, async = TRUE) > print(job) RHIPE Job Token Information -------------------------- URL: http://spica:50030/jobdetails.jsp?jobid=job_201007281701_0053 Name: 2010-07-28 23:33:44 ID: job_201007281701_0053 Submission Time: 2010-07-28 23:33:45 State: RUNNING Duration(sec): 11.702 Progress pct numtasks pending running complete failed map 0 156 146 10 0 0 reduce 0 78 78 0 0 0

Create a Job

Launch a Job

Monitor

Jobs  are  created,  launched  and  monitored  from  the  R  console.  

Page 16: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  

16

> rhread("/voip/call.traces",type="map",max=1) RHIPE: Read 1 pair occupying 159 bytes, deserializing [[1]] [[1]][[1]] [1] "67.17.50.213.5054.67.17.50.6.6640.in"

[[1]][[2]] time rtpPT 1 1079007238 0 2 1079007238 19 3 1079007238 19

> rhgetkey(list(c("67.17.50.213.5054.67.17.50.6.6640.in")),"/voip/call.traces/p*")

Read results

Results are a list of key, value lists

Jobs  are  created,  launched  and  monitored  from  R  console  

Can query by key (MapFile)

Page 17: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  Why  Vectors  

17

  ScaGer  plot  of  reading  in  11.2  GB  of  CSV  data  and  tokenizing  the  lines  into  columns.  

  MAP_MAX  is  the  number  of  key,  value  pairs  given  to  the  map  expression  

  The  red  points  are  the  group  medians  and  the  red  curve  is  the  loess  fit.  

Page 18: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  Call  Summaries  

  Compute  start  7me,  end  7me,  dura7on  and  number  of  packets  

  Compute  for  each  semi-­‐call  and  pair  up    by  call  iden7fier  

18

m<-expression({ lapply(seq_along(map.values),function(i){ key0<-map.keys[[i]] # semi-call identifier value0<-map.values[[i]] # semi-call packets key.elements<-strsplit(key0,"\\.")[[1]] key<-paste(key.elements[1:10],collapse=".") # call identifier n.pkt<-dim(value0)[1] start<-value0[1,1] end<-value0[n.pkt,1] dur<-end-start+0.02 value<-if(tmp[11]=="in") c(in.start=start,in.end=end,in.dur=dur,in.pkt=n.pkt) else c(out.start=start,out.end=end,out.dur=dur,out.pkt=n.pkt) # call summary rhcollect(key,value) }) })

Page 19: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  Call  Summaries  

  In  the  reduce,  we  get  two  summaries  for  each  semi-­‐call  

19

r<-expression( pre={ mydata<-list() }, reduce={ mydata<-append(mydata,reduce.values) }, post={ mydata<-unlist(mydata) in.start<-if(!is.null(mydata['in.start'])) mydata['in.start'] else NA in.end<-if(!is.null(mydata['in.end'])) mydata['in.end'] else NA

… …. out.end<-if(!is.null(mydata['out.end'])) mydata['out.end'] else NA out.start<-if(!is.null(mydata['out.start'])) mydata['out.start'] else NA value<-c(in.start,in.end,in.dur,in.pkt,out.start,out.end,out.dur,out.pkt) rhcollect(reduce.key,value) } )

reduce.values will be of length 2

Page 20: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  Create  JiPer  Database  

  Each  semi-­‐call  gives  rise  to  one  or  mul7ple  jiPer  objects:  data  frames  of  jiGer  corresponding  to  transmission  intervals.  

20

Page 21: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  

21

map <-expression({ lapply(seq_along(map.values),function(i){ key0<-map.keys[[i]] value0<-map.values[[i]] n.pkt<-dim(value0)[1] if(n.pkt>1){ arrival<-value0$time flag<-value0$rtpPT==0 | value0$rtpPT==8 tmp.jitter<-diff(arrival) tmp.arrival1<-arrival[-n.pkt] tmp.flag1<-flag[-n.pkt] tmp.flag2<-flag[2:n.pkt] tmp.transmission<-tmp.flag1 & tmp.flag2 jitter<-round(tmp.jitter[tmp.transmission]-0.02,6)*1000 arrival1<-tmp.arrival1[tmp.transmission] tmp.id<-diff(c(F,tmp.transmission,F)) transmission.start<-seq_along(tmp.id)[tmp.id==1] transmission.end<-seq_along(tmp.id)[tmp.id==-1] if(length(transmission.start)>0){ group<-rep(seq_along(transmission.start),transmission.end-transmission.start) lapply(seq_along(group),function(i){ key<-c(key0,group[i])

value<-c(arrival1[i],jitter[i]) rhcollect(key,value) }) } } }) })

Logic

Template

Write jitter object

Page 22: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  

22

r<-expression( pre = { mydata<-list() }, reduce = { mydata <- append(mydata,reduce.values) }, post = { mydata <- do.call("rbind",mydata) colnames(mydata) <- c("arrival1","jitter") mydata <- mydata[order(mydata[,'arrival1']),,drop=F] mydata <- data.frame(arrival1 = as.numeric(mydata[,'arrival1']), jitter = as.numeric(mydata[,'jitter'])) rhcollect(reduce.key,mydata) } )

Append data frames

Combine them and save

14  MM  jiGer  objects,  6.5  min,  21GB.  

Page 23: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  Remove  Gateway  Effect    

23

  The  jiGer  is  cyclical  because  of  a  gateway  effect  

  This  can  be  removed  using  regression  with  bisquare  robust  es7mator  (Andrews  et  al.,  1972)  to  return  residuals  

  Only  on  jiGer  objects  with  more  than  90  packets.    This  is  a  pure  map:  apply  regression  across  3.8  million  data  frames  (13.9  GB)  

Page 24: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  

24

map <-expression({ X9<-diag(rep(1,9)) mywt.bisquare <- function(u, c = 6){ U <- abs(u/c) w <- ((1 + U) * (1 - U))^2 w[U > 1] <- 0 w } lapply(seq_along(map.values),function(i){ key0<-map.keys[[i]] value0<-map.values[[i]] n.jitter<-nrow(value0) if(n.jitter>=90){ for(i in 1:3){

iterative LS regression with bisquare weights fit.lm <- lm.wfit(xx,jitter,wt) #lm(jitter ~ xx -1, weights=wt) … checks } }})})

rhex(map=map, input,output,mapred=list(mapred.reduce.tasks=0))

An  lapply  for  big  datasets  

Page 25: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  Compare  Rate  Normalized  JiPer    

  Compare  jiGer  across  sending  sites  while  accoun7ng  for  traffic  rate.  

  Compute  traffic  rate  (bits  per  second)  for  30  second  intervals  (20ms  too  noisy).    round  7me  of  packet  down  to  nearest  30  seconds  and  get  counts.  

25

Page 26: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  Compare  Rate  Normalized  JiPer    

  Create  near  replicate  subsets  that  share  almost  iden7cal  traffic  rate  distribu7on  for  each  sending  site.  

  Approximately  30,000  observa7ons  per  sending  site.  

  Regression  across  all  subsets  and  recombine  results.  

26

Fifth root absolute jitter vs. Traffic Rate for a subset of PSTN ( untransformed jitter is too

skewed)

Page 27: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  Compare  Rate  Normalized  JiPer    

27

  Residual  plots  with  a  loess  fit  shows  the  appropriateness  of  the  regression.  

Regression residuals against traffic rate for a subset of Newark

Page 28: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  Compare  Rate  Normalized  JiPer    

28

  Rate normalized jitter mean and standard deviation have smaller differences within than across sending sites.

  Rate normalized jitter mean and standard deviation follow Normal distributions across subsets within each sending site.

  We can use the above information to recombine the results by taking averages across subsets within sending site.

Page 29: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  Compare  Rate  Normalized  JiPer    

29

  Mean  regression  fit  for  each  sending  site  

JiGer  shows  a  monotone  increasing  

rela7onship  with  traffic  rate  

Page 30: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  Compare  Rate  Normalized  JiPer    

30

  Mean  and  standard  devia7on  of  rate  corrected  jiGer  per  subset  are  averaged  and  compared  among  interna7onal  and  na7onal  sending  site.  

As  expected  interna7onal  sites  show  more  jiGer  than  local  ones.    

Page 31: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  Conclusion  

  JiGer  appears  to  be  affected  by  traffic  rate.  

  JiGer  depends  sending  site  distance  as  expected.     but  the  differences  among  sending  sites  are  rela7vely  small  compared  to  the  scale  of  the  raw  data  scale  and  QoS  requirements.  

  We  can  safely  say  that  the  observed  traffic  is  a  good  approxima7on  of  offered  traffic.  

31

Page 32: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  Conclusion    -­‐    R  and  Hadoop  

  Flexible  parallel  programming  models  e.g.  MapReduce  provides  the  analyst  powerful  ways  to  distribute  code  execu7on  with  worrying  about  the  intricacies.  

  Mixing  R  and  Hadoop  makes  it  easy  for  the  analyst  to  think  and  work  within  the  R  framework    All  the  examples  use  R  and  RHIPE.    Data  for  all  visualiza7ons  created  using  RHIPE.  

  Data  created  using  RHIPE  can  be  read  using  other  languages  (e.g.  Java,  Python).  

32

Page 33: Jin Xia, Saptarshi Guha, and W.S. Cleveland HadoopCon...RevolutionConfidential* Voice over IP Studying Traffic Characteristics for Quality of Service using R and Hadoop Date: 12 October,

Revolution  Confidential  

33

The leading commercial provider of software and support for the popular open source R statistics language.

www.revolutionanalytics.com (650) 330 0553

Twitter: @RevolutionR