Disco workshop

87
Disco workshop From zero to CDN log processing

description

Disco workshop. From zero to CDN log processing.

Transcript of Disco workshop

Page 1: Disco workshop

Disco workshop From zero to CDN log processing

Page 2: Disco workshop

2  

1.  Intro  to  parallel  compu1ng  •  Algorithms  •  Programming  model  •  Applica1ons  

2.  Intro  to  MapReduce  •  History  •  (in)applicability  •  Examples  •  Execu1on  overview  

3.  Wri1ng  MapReduce  jobs  with  Disco  •  Disco  &  DDFS  •  Python  •  Your  first  disco  job  •  Disco  @  SpilGames  

4.  CDN  log  processing  •  Architecture  •  Availability  &  Performance  monitoring  •  Steps  to  get  to  our  Disco  landscape  

Overview

Page 3: Disco workshop

3  

Introduction to Parallel Computing

Page 4: Disco workshop

4  

Tradi1onally  (Neumann  model),  soUware  has  been  wriVen  for  serial  computa1on:  

•  To  be  run  on  a  single  computer  having  a  single  CPU  •  A  problem  is  broken  into  discrete  series  of  instruc1ons  •  Instruc1ons  are  executed  one  aUer  another  •  Only  on  instruc1on  may  execute  at  any  moment  in  1me  

Serial computations

Page 5: Disco workshop

5  

A  parallel  computer  is  of  liVle  use  unless  efficient  parallel  algorithms  are  available    •  The  issues  in  designing  parallel  algorithms  are  very  different  from  those  in  designing  their  sequen1al  counterparts  

•  A  significant  amount  of  work  is  being  done  to  develop  efficient  parallel  algorithms  for  a  variety  of  parallel  architectures  

Design of efficient algorithms

Page 6: Disco workshop

6  

Fibonacci series (1,1,2,3,5,8,13,21…) by F(n) = F(n-1) + F(n-2)

Sequential algorithm, not parallelizable

Page 7: Disco workshop

7  

Parallel  compu1ng  is  the  simultaneous  use  of  mul1ple  compu1ng  resources  to  solve  a  computa1onal  problem:  

•  To  be  run  using  mul1ple  CPUs  •  A  problem  is  broken  down  into  discrete  parts  that  can  be  solved  concurrently  

•  Each  part  is  further  broken  down  to  a  series  of  instruc1ons  •  Instruc1ons  from  each  part  execute  simultaneously  on  different  CPUs  

Parallel computations

Page 8: Disco workshop

8  

Summation of numbers

Page 9: Disco workshop

9  

•  Descrip1on  •  The  mental  model  the  programmer  has  about  the  detailed  execu1on  of  their  applica1ons  

•  Purpose  •  Improve  programmer  produc1vity  

•  Evalua1on  •  Expression  •  Simplicity  •  Performance  

Programming Model

Page 10: Disco workshop

10  

•  Message  passing  •  Independent  tasks  encapsula1ng  local  data  •  Tasks  interact  by  exchanging  messages  

•  Shared  memory  •  Tasks  share  a  common  address  space  •  Tasks  interact  by  reading  and  wri1ng  this  space  asynchronously  

•  Data  paralleliza1on  •  Tasks  execute  a  sequence  of  independent  opera1ons  •  Data  usually  evenly  par11oned  across  tasks  •  Also  referred  to  as  “Embarrassingly  parallel”  

Parallel Programming Models

Page 11: Disco workshop

11  

•  Historically  used  for  large  scale  problems  in  science  and  Engineering  •  Physics  –  applied,  nuclear,  par1cle,  fusion,  photonics  •  Bioscience,  Biotechnology,  Gene1cs,  Sequencing  •  Chemistry,  Molecular  sciences  •  Mechanical  Engineering  –  from  prosthe1cs  to  spacecraU  •  Electrical  Engineering,  Circuit  Design,  Microelectronics  •  Computer  Science,  Mathema1cs  

Applications (Scientific)

Page 12: Disco workshop

12  

•  Commercial  applica1ons  also  provide  the  driving  force  in  the  parallel  compu1ng.  These  applica1ons  require  the  processing  of  large  amounts  of  data  •  Databases,  data  mining  •  Oil  explora1on  •  Web  search  engines,  web  based  business  services  •  Medical  imaging  and  diagnosis  •  Pharmaceu1cal  design  •  Management  of  na1onal  and  mul1-­‐na1onal  corpora1ons  •  Financial  and  economic  modeling  •  Advanced  graphics  &  VR  •  Networked  video  and  mul1-­‐media  technologies  

Applications (Commercial)

Page 13: Disco workshop

13  

•  Parallelize  •  Distribute  

•  Problems?  •  Concurrency  problems  •  Coordina1on  •  Scalability  •  Fault  Tolerance  

What if my job is too “big”?

Page 14: Disco workshop

14  

•  Applica1on  is  modeled  as  Directed  Acyclic  Graph  •  DAG  defines  the  dataflow  

•  Computa1onal  ver1ces  •  Ver1ces  of  the  graph  defines  the  opera1on  on  data  

•  Channels  •  File  •  TCP  pipe  •  SHM  FIFO  

•  Not  as  restric1ve  as  MapReduce  •  Mul1ple  Input  and  Output  

•  Allows  developers  to  define  communica1on  between  ver1ces  

Microsoft: MSN search group: DRYAD

Page 15: Disco workshop

15  

“A  simple  and  powerful  interface  that  enables  automa1c  paralleliza1on  and  distribu1on  of  large-­‐scale  computa1ons,  combined  with  an  implementa1on  of  this  interface  that  achieves  high  performance  on  large  clusters  of  commodity  PCs.”  

Google

Deen and Ghermawat, “MapReduce: Simplified Data Processing on Large Clusters”, Google Inc.

Page 16: Disco workshop

16  

Introduction to MapReduce

Page 17: Disco workshop

17  

I  have  a  ques1on  which  a  data  set  can  answer.    I  have  lots  of  data  and  I  have  of  a  cluster  of  nodes.  MapReduce  is  a  parallel  framework  which  takes  advantage  of  my  cluster  by  distribu1ng  the  work  across  each  node.    Specifically,  MapReduce  maps  data  in  the  form  of  key-­‐value  pairs  which  are  then  par11oned  into  buckets.  The  buckets  can  be  spread  easily  over  all  the  nodes  in  the  cluster  and  each  node  or  Reducer,  reduces  the  data  to  an  “answer”  or  a  list  of  “answers”.  

What is MapReduce?

Page 18: Disco workshop

18  

•  Published  in  2004  by  Google  

MapReduce history

Page 19: Disco workshop

19  

•  Published  in  2004  by  Google  •  Func1onal  programming  (eg.  Lisp,  Erlang)  

•  map()  func1on  •  Applies  a  func1on  to  each  value  of  a  sequence  

•  reduce()  func1on  (fold())  •  Combines  all  elements  of  a  sequence  using  a  binary  operator  

MapReduce history

Page 20: Disco workshop

20  

•  Published  in  2004  by  Google  

MapReduce history

Page 21: Disco workshop

21  

•  Restric1ve  seman1cs  •  Pipelining  Map/Reduce  stages  possibly  inefficient  •  Solvers  problems  within  a  narrow  programming  domain  well  •  DB  community:  our  parallel  RMDBSs  have  been  doing  this  

forever…  •  Data  scale  maVers:  Use  MapReduce  if  you  truly  have  large  

data  sets  that  are  difficult  to  process  using  simpler  solu1ons  •  Its  not  always  a  high  performance  solu1on.  Straight  python,  

simple  batch  scheduled  Python,  and  C  core  can  all  outperform  MR  by  and  order  of  magnitude  or  two  on  a  single  node  for  many  problems,  even  for  so-­‐called  big  data  problems  

Why NOT MapReduce?

Page 22: Disco workshop

22  

•  Distributed  grep,  sort,  word  frequency  •  Inverted  index  construc1on  •  Page  Rank  •  Web  link-­‐graph  traversal  •  Large-­‐scale  PDF  genera1on,  image  conversion  •  Ar1ficial  Intelligence,  Machine  Learning  •  Geographical  data,  Google  Maps  •  Log  querying  •  Sta1s1cal  Machine  Transla1on  •  Analyzing  similari1es  of  user’s  behavior  •  Process  clickstream  and  demographic  data  •  Research  for  Ad  systems  •  Ver1cal  search  engine  for  trustworthy  wine  informa1on  

What it is good for?

Page 23: Disco workshop

23  

•  Google  (proprietary  implementa1on  in  C++)  •  Hadoop  (Open  Source  implementa1on  in  JAVA)  •  Disco  (erlang,  python)  •  Skynet  (ruby)  •  BashReduce  (last.fm)  •  Spark  (Scala,  func1onal  OO  lang.  on  JVM)  •  Plasma  MapReduce  (OCaml)  •  Storm  (The  hadoop  of  Real1me  Processing)  

cat  a_bunch_of_files  |  ./mapper.py  |  sort  |  ./reducer.py  

Flavors of MapReduce

Page 24: Disco workshop

24  

•  Process  data  using  special  map()  and  reduce()  func1ons  •  The  map()  func1on  is  called  on  every  item  in  the  input  and  emit  a  series  of  intermediate  key/value  pairs  

•  All  values  associated  with  a  given  key  are  grouped  together  

•  The  reduce()  func1on  is  called  on  every  unique  key,  and  its  values  list,  and  emits  a  value  that  is  added  to  the  output  

The MR programming model

Page 25: Disco workshop

25  

•  More  formally  •  Map(k1,  v1)  -­‐>  list(k2,  v2)  •  Reduce(k2,  list(v2))  -­‐>  list(v2)  

The MR programming model

Page 26: Disco workshop

26  

•  Greatly  reduces  parallel  programming  complexity  •  Reduces  synchroniza1on  complexity  •  Automa1cally  par11ons  data  •  Provides  failure  transparency  

•  Prac1cal  •  Hundreds  of  jobs  every  day  

MapReduce benefits

Page 27: Disco workshop

27  

•  Par11ons  input  data  •  Schedules  execu1on  across  a  set  of  machines  •  Handles  machine  failure  •  Manages  IPC  

The MR runtime system

Page 28: Disco workshop

28  

•  Distributed  grep  •  Map  func1on  emits  <word,  line_number>    if  a  word  matches  search  criteria  

•  Reduce  func1on  is  iden1ty  func1on  •  URL  access  frequency  

•  Map  func1on  processing  web  logs,  emits  <url,  1>  •  Reduce  func1on  summing  values,  emits  <url,  total>  

MR Examples

Page 29: Disco workshop

29  

•  Geospa1al  Query  processing  •  Given  an  intersec1on,  find  all  roads  connec1ng  to  it  •  Rendering  the  1les  in  the  map  •  Finding  the  nearest  feature  to  a  given  address  

MR Examples

Page 30: Disco workshop

30  

•  “Learning  the  right  abstrac1on  will  simplify  your  life.”  –  Travis  Oliphant  

MR Examples

Program   Map()   Reduce()  

Distributed  grep   Matched  lines   pass  

Reverse  web  link  graph   <target,  source>   <target,  list(src)>  

URL  count   <url,  1>   <url,  total_count)  

Term-­‐vector  per  host   <hostname,  term-­‐vector>   <hostname,  all-­‐term-­‐vector>  

Inverted  Index   <word,  doc  id>   <word,  list(doc_id)>  

Distributed  Sort   <key,  value>   pass  

Page 31: Disco workshop

31  

•  The  user  program,  via  the  MR  library,  shards  the  input  data  

MR Execution 1/8

Page 32: Disco workshop

32  

•  The  user  program  creates  process  copies  (workers)  distributed  on  a  machine  cluster.  

•  One  copy  will  be  the  “Master”  and  the  others  will  be  worker  threads  

MR Execution 2/8

Page 33: Disco workshop

33  

•  The  master  distributes  M  map  and  R  reduce    tasks  to  idle  workers.  •  M  ==  number  of  shards  •  R  ==  the  key  space  is  divided  into  R  parts  

MR Execution 3/8

Page 34: Disco workshop

34  

•  Each  map-­‐task  worker  reads  assigned  input  shard  and  outputs  intermediate  key/value  pairs  •  Output  buffered  in  RAM  

MR Execution 4/8

Page 35: Disco workshop

35  

•  Each  worker  flushes  intermediate  values,    par11oned  into  R  regions,  to  disk  and  no1fies    the  Master  process  

MR Execution 5/8

Page 36: Disco workshop

36  

•  Master  process  gives  disk  loca1on  to  an  available  reduce-­‐task  worker  who  reads  all  associated  intermediate  data  

MR Execution 6/8

Page 37: Disco workshop

37  

•  Each  reduce-­‐task  worker  sorts  its  intermediate  data.  Calls  the  reduce()  func1on,  passing  unique  keys  and  associated  key  values.  Reduce  func1on  output  appended  to  reduce-­‐task’s  par11on  output  file  

MR Execution 7/8

Page 38: Disco workshop

38  

•  Master  process  wakes  up  user  process  when    all  tasks  have  completed.    

•  Output  contained  in  R  output  files.  

MR Execution 8/8

Page 39: Disco workshop

39  

•  An  input  reader  •  A  map()  func1on  •  A  par11on  func1on  •  A  compare  func1on  (sort)  •  A  reduce()  func1on  •  An  output  writer  

Hot spots

Page 40: Disco workshop

40  

MR Execution Overview

Page 41: Disco workshop

41  

•  Fault  Tolerance  •  Master  process  periodically  pings  workers  

•  Map-­‐task  failure  –  Re-­‐execute  

»  All  output  was  stored  locally  •  Reduce-­‐task  failure  

–  Only  re-­‐execute  par1ally  completed  tasks  »  All  output  stored  in  the  global  file  system  

MR Execution Overview

Page 42: Disco workshop

42  

•  Don’t  move  data  to  workers…  Move  workers  to  the  data!  •  Store  data  on  local  disks  for  nodes  in  the  cluster  •  Start  up  the  workers  on  the  node  that  has  data  local  

•  Why?  •  Not  enough  RAM  to  hold  all  the  data  in  memory  •  Disk  access  is  slow,  disk  throughput  is  good  

•  A  distributed  file  system  is  the  answer  •  GFS  (Google  File  System)  (=  Big  File  System)  •  HDFS  (Hadoop  DFS)  =  GFS  clone  •  DDFS  (Disco  DFS)  

Distributed File System

Page 43: Disco workshop

43  

•  Sequen1al  -­‐>  Parallel  -­‐>  Distributed  •  Hype  aUer  Google  published  the  paper  in  2004  •  A  very  narrow  set  of  problems  •  Big-­‐data  is  a  marke1ng  buzzword  

Summary for Part I.

Page 44: Disco workshop

44  

•  MapReduce  is  a  paradigm  for  distributed  compu1ng  developed  (patented…)  by  Google  for  performing  analysis  on  large  amounts  of  data  distributed  across  thousands  of  commodity  computers  •  The  Map  phase  processes  the  input  one  element  at  a  1me  and  returns  a  (key,  value)  pair  for  each  element  

•  An  op1onal  Par11on  step  par11ons  Map  results  into  groups  based  on  a  par11on  func1on  on  the  key.  

•  The  engine  merges  par11ons  and  sorts  all  the  map  results.  

•  The  merged  results  are  passed  to  the  Reduce  phase.  One  or  more  reduce  jobs  reduce  the  (key,  value)  pairs  to  produce  the  final  results.  

Summary for Part I (cont.)

Page 45: Disco workshop

45  

Writing MapReduce jobs with Disco

Page 46: Disco workshop

46  

•  Wri1ng  MapReduce  jobs  can  be  VERY  1me  consuming  •  MapReduce  paVerns  •  Debugging  a  failure  is  a  nightmare  •  Large  clusters  require  a  dedicated  team  to  keep  it  running  •  Wri1ng  a  Disco  job  becomes  a  soUware  engineering  task  

•  …rather  than  a  data  analysis  task  

Take a deep breath

Page 47: Disco workshop

47  

Disco

           

Page 48: Disco workshop

48  

•  “Massive  data  –  Minimal  code”  –  by  Nokia  Research  Center  •  hVp://discoproject.org    •  WriVen  in  Erlang  

•  Orchestra1ng  control  •  Robust  fault-­‐tolerant  distributed  applica1ons  

•  Python  for  opera1ng  on  data  •  Easy  to  learn  •  Complex  algorithms  with  very  liVle  code  •  U1lize  favorite  python  libraries  

•  The  complexity  is  hidden,  but…  

About Disco

Page 49: Disco workshop

49  

•  Distributed  •  Increase  storage  capacity  by  adding  nodes  •  Processing  on  nodes  without  transferring  data  

•  Replicated  •  Chunked  data  stored  in  gzip  compressed  chunks  •  Tag  based  •  AVributes  •  CLI  

•  $  ddfs  ls  data:log  •  $  ddfs  chunk  data:bigtxt  ./bigtxt  •  $  ddfs  blobs  data:bigtxt  •  $  ddfs  xcat  data:bigtxt  

Disco Distributed “filesystem”

   

Page 50: Disco workshop

50  

•  Everything  is  preinstalled  •  Disco  localhost  setup:  hVps://github.com/spilgames/disco-­‐development-­‐workflow    

Sandbox environment

Page 51: Disco workshop

51  

•  www.pythonforbeginners.com  -­‐  by  Magnus  •  Import  •  Data  structures:  {}  dict,  []  list,  ()  tuple  •  Defining  func1ons  and  classes  •  Control  flow  primi1ves  and  structures:  for,  if,  …  •  Excep1on  handling  •  Regular  expressions  •  GeoIP,  MySQLdb,  …  •  To  understand  what  yield  does,  you  must  understand  what  

generators  are.  And  before  generators  come  iterables.  

Python – What you’ll need

Page 52: Disco workshop

52  

When  you  create  a  list,  you  can  read  its  items  one  by  one,  and  it’s  called  itera1on:    >>>  mylist  =  [1,  2,  3]  >>>  for  i  in  mylist:  …  print  i    1  2  3  

Python Lists

Page 53: Disco workshop

53  

Mylist  is  an  iterable.  When  you  use  a  comprehension  list,  you  create  a  list  and  so  an  iterable:    >>>  mylist  =  [x*x  for  x  in  range(3)]  >>>  for  i  in  mylist:  …  print  i    0  1  4    

Python Iterables

Page 54: Disco workshop

54  

Generators  are  iterables,  but  you  can  read  them  once.  It’s  because  they  do  not  store  all  the  values  in  memory,  they  generate  the  values  on  the  fly:    >>>  mygenerator  =  (x*x  for  x  in  range(3))  >>>  for  i  in  mygenerator:  …  print  i      0  1  4    I  just  the  same  except  you  used  ()  instead  of  [].  But,  you  can  not  perform  for  i  in  mygenerator  a  second  1me  since  generators  can  only  be  used  once:  they  calculate  0,  then  forget  about  it  and  calculate  1  and  ends  calcula1ng  4,  one  by  one.  

Python Generators

Page 55: Disco workshop

55  

Yield  is  a  keyword  that  is  used  like  return,  except  the  func1on  will  return  a  generator.    >>>  def  createGenerator():  …  mylist  =  range(3)  …  for  i  in  mylist:  …    yield  i*i  …  >>>  mygenerator  =  createGenerator()  >>>  print  mygenerator  <generator  object  createGenerator  at  0xb7555c34>  >>>  for  I  in  mygenerator:  …  print  i    0  1  4  

Python Yield

Page 56: Disco workshop

56  

•  What  is  the  total  count  for  each  unique  word  in  the  text?  

•  Word  coun1ng  is  the  Hello  World!  of  MapReduce  

•  We  need  to  write  map()  and  reduce()  func1ons  •  Map(rec)  -­‐>  list(k,  v)  •  Reduce(k,  v)  -­‐>  list(res)  

•  Your  applica1on  communicates  with  Disco  API  •  from  disco.core  import  Job,  result_iterator  

Your first disco job

Page 57: Disco workshop

57  

•  Spli�ng  file  (related  chunks)  to  lines  •  Map(line,  params)  

•  Split  line  to  words  •  Emit  k,v  tuple:  <word,  1>  

•  Reduce(iter,  params)  •  OUen,  this  is  an  algebraic  expression  •  <word,  [1,1,1]>  -­‐>  <word,  3>  

Word count

Page 58: Disco workshop

58  

•  Modules  to  import  •  Se�ng  the  master  host  •  DDFS  •  Job()  •  Result_iterator(Job.wait())  •  Job.purge()  

Word count: Your application

Page 59: Disco workshop

59  

def  fun_map(line,  params):    for  word  in  line.split():      yield  word,  1  

Word count: Your map

Page 60: Disco workshop

60  

def  fun_reduce(iter,  params):    from  disco.u1l  import  kvgroup    for  word,  counts  in  kvgroup(sorted(iter)):      yield  word,  sum(counts)  

       Built-­‐in  disco.worker.classic.func.sum_reduce()  

Word count: Your reduce

Page 61: Disco workshop

61  

job  =  Job().run(input=…,  map=fun_map,  reduce=fun_reduce)    for  word,  count  in  result_iterator(job.wait(show=True)):  

 print  (word,  count)    job.purge()    

Word count: Your results

Page 62: Disco workshop

62  

Class  MyJob1(Job):    @classmethod    def  map(self,  data,  params):      …        @classmethod    def  reduce(self,  iter,  params):      …  

 …  MyJob2.run(input=MyJob1.wait())        #  <-­‐  Job  chaining  

Word count: More advanced

Page 63: Disco workshop

63  

•  Event  Tracking  &  Adver1sing  related  jobs  •  Heatmap:  page  clicks  -­‐>  2D  density  distribu1ons  •  Reconstruc1ng  sessions  •  Ad  research  •  Behavioral  modeling  

•  Log  crunching  •  Gameplays  per  country    •  Frontend  performance  (CDN)  •  404s,  Response  code  tracking  •  Intrusion  detec1on  #security  

Disco @ SpilGames

Page 64: Disco workshop

64  

•  Calculate  your  resource  need  es1mates  •  Deploy  in  workflow  •  We  have  

•  Git  •  Package  repository  /  Deployment  Orchestra1on  •  Disco-­‐tools:  hVp://github.com/spilgames/disco-­‐tools/  •  Job  runner:  hVp://jobrunner/  •  Data  warehouse  •  Interac1ve,  graphical  report  genera1on  

Disco @ SpilGames

Page 65: Disco workshop

65  

Page 66: Disco workshop

66  

CDN log processing

Page 67: Disco workshop

67  

•  Ques1on?  •  Availability  of  each  CDN  provider  

•  Data  source  •  Javascript  sampler  on  client  side  •  LoadBalancer  -­‐>  HA  logging  endpoints    -­‐>  Access  logs  -­‐>  Disco  Distributed  FS  

CDN Availability monitoring

Page 68: Disco workshop

68  

CDN Availability monitoring

Page 69: Disco workshop

69  

•  Input  •  URI  parsing  •  /res.ext?v=o,1|e,1|os,1|ce,1|hw,1|c,0|l,1  

•  Expected  output  •  ProviderO    98.7537%  •  ProviderE    57.8851%  •  ProviderC    99.4584%  •  ProviderL    99.4847%  

CDN Availability monitoring

Page 70: Disco workshop

70  

#  cdnData:  “o,1|e,1|os,1|ce,1|hw,1|c,0|l,1“    •  Parse  a  log  entry  •  Yield  samples  

•  <o,  1>  •  <e,  1>  •  <os,  1>  •  <ce,  1>  •  <hw,  1>  •  <c,  0>  •  <l,  1>  

CDN Availability monitoring (map)

Page 71: Disco workshop

71  

def  map_cdnAvailability(line,  params):          import  urlparse          try:                  (1mestamp,  data)  =  line.split(‘,’,  1)                  data  =  dict(urlparse.parse_qsl(data,  False))                  for  cdnData  in  data[‘a’].split(‘|’)                          try:                                  cdnName  =  cdnData.split(‘,’)[0]                                  cdnAvailable  =  int(cdnData.split(‘,’)[1])                                  yield  cdnName,  cdnAvailabe                          except:  pass          except:  pass  

CDN Availability monitoring (map)

Page 72: Disco workshop

72  

Availability  of  <hw,  [1,1,1,0,1,1,1,0,1,1,0,1]>    •  kvgroup(iter)  •  The  trick:  

•  Samples  =  […]  •  len(samples)  -­‐>  number  of  all  samples  •  sum(samples)  -­‐>  number  of  available  •  A  =  sum()/len()  *  100.0  

CDN Availability monitoring (reduce)

Page 73: Disco workshop

73  

def  reduce_cdnAvailability(iter,  params):          from  disco.u1l  import  kvgroup            for  cdnName,  cdnAvailabili1es  in  kvgroup(sorted(iter)):                  try:                          cdnAvailabili1es  =  list(cdnAvailabili1es)                            totalSamples  =  len(cdnAvailabili1es)                          totalAvailable  =  sum(cdnAvailabili1es)                          totalUnavailable  =  totalSamples  –  totalAvailable                            yield  cdnName,  (round(float(totalAvailable)  /  totalSamples  *  100.0,  4))                    except:  pass  

   

CDN Availability monitoring (reduce)

Page 74: Disco workshop

74  

•  DDFS  •  tag://logs:cdn:la010:12345678900  •  disco.ddfs.list(tag)  •  disco.ddfs.[get|set]aVr(tag,aVr,value)  

•  Job(name,master).run(input,map,reduce)  •  par11ons  =  R  •  map_reader  =  disco.worker.classic.func.chain_reader  •  save  =  true  

 

Advanced usage

Page 75: Disco workshop

75  

CDN Performance

95th percentile with per country breakdown

Page 76: Disco workshop

76  

•  Ques1on  •  95th  percen1le  of  response  1mes  per  CDN  per  country  

•  Data  source  •  Javascript  sampler  on  client  side  •  LB  -­‐>  HA  Logging  endpoints  -­‐>  Access  logs  -­‐>  DDFS  

•  Input  •  /res.ext?v=o,1234|l,2345|c,3456&ipaddr=127.0.0.1  

•  Expected  output  •  ProviderN        CountryA:  3891  ms  CountryB:  1198  ms  …  •  ProviderC        CountryA:  3793  ms  CountryB:  1397  ms  …  •  ProviderE        CountryA:  3676  ms  CountryB:  1676  ms  …  •  ProviderL        CountryA:  4332  ms  CountryB:  1233  ms…  

 

CDN Performance

Page 77: Disco workshop

77  

The 95th percentile

A 95th percentile says that 95% of the time data points are below that value and 5% of the time they are above that value. 95 is a magic number used in networking because you have to plan for the most-of-the-time case.

Page 78: Disco workshop

78  

v=o,1234|l,2345|c,3456&ipaddr=127.0.0.1    •  Line  parsing  is  about  the  same  •  Advanced  key:  <cdn:country,  performance>  •  How  to  get  country  from  IP?  

•  Job().run(…required_modules=[“GeoIP”]…)  •  No  global  variables!  Within  map()  –  Why?  

•  Use  Job().run(…params={}…)  instead  •  yield  “%s:%s“  %  (cdnName,  country),  cdnPerf  

CDN Performance (map)

Page 79: Disco workshop

79  

#  <hw,  [123,  234,  345,  456,  567,  678,  798]>    def  percen1le(N,  percent,  key=lambda  x:x):          import  math          if  not  N:                  return  None          k  =  (len(N)  -­‐  1)  *  percent          f  =  math.floor(k)          c  =  math.ceil(k)          if  f  ==  c:                  return  key(N[int(k)])          d0  =  key(N[int(f)])  *  (c  -­‐  k)          d1  =  key(N[int(c)])  *  (k  -­‐  f)            return  d0  +  d1  

CDN Performance (reduce)

Page 80: Disco workshop

80  

•  Outputs  •  Print  to  screen  • Write  to  a  file  • Write  to  DDFS  –  Why  not?  •  An  other  MR  job  with  chaining  •  Email  it  • Write  to  MySQL  • Write  to  Ver1ca  •  Zip  and  upload  to  Spil  OOSS  

Other goodies

Page 81: Disco workshop

81  

1.  Ques1on  &  Data  source  •  Javascript  code  •  Nginx  endpoint  •  Logrotate  •  (de-­‐personalize)  •  DDFS  load  scripts  

2.  MR  jobs  3.  Jobrunner  jobs  4.  Present  your  results  

Steps to get to our Disco landscape

Page 82: Disco workshop

82  

•  Edi1ng  on  live  servers  •  No  version  control  •  No  staging  environment  •  Not  using  deployment  mechanism  •  Not  using  Con1nuous  Integra1on  •  Poor  parsing  •  No  redundancy  for  MC  applica1ons  •  Not  purging  your  job  •  Not  documen1ng  your  job    •  Using  hard  coded  configura1on  inside  MR  code  

Bad habits

Page 83: Disco workshop

83  

•  No  peer  review  •  Not  ge�ng  back  events  from  slaves  •  Using  job.wait()  •  Job().run(par11ons=1)  

Bad habits cont.

Page 84: Disco workshop

84  

•  Wri1ng  Disco  jobs  can  be  easy  •  Finding  the  right  abstrac1on  for  a  problem  is  not…  •  Framework  is  on  the  way  -­‐>  DRY  •  You  can  find  a  lot  of  good  paVerns  in  SET  and  other  jobs  

You  successfully  took  a  step  to  understand  how  to  •  Process  large  amount  of  data  •  Solve  some  specific  problems  with  MR  

Summary

Page 85: Disco workshop

85  

•  Ecosystems  •  DiscoDB:  lightning-­‐fast  key-­‐>value  mapping  •  Discodex:  disco  +  ddfs  +  discodb  

•  Disco  vs.  Hadoop  •  HDFS,  Hadoop  ecosystem  •  NoSQL  result  stores  

Bonus: Outlook

Page 86: Disco workshop

Questions?

Page 87: Disco workshop

87  

•  Presenta1on  can  be  found  at:  hVp://spil.com/discoworkshop2013      

•  You  can  contact  me  at:    [email protected]  

Thank you!