CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh%...

52
CmpE 138 Spring 2011 Special Topics – L2 Shivanshu Singh [email protected]

Transcript of CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh%...

Page 1: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

CmpE  138  Spring  2011  

Special  Topics  –  L2  

Shivanshu  Singh  [email protected]  

Page 2: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Map  Reduce  

 ElecBon  process  

Page 3: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Map  Reduce  

Typical  single  node  architecture  

Storage  

           Memory  

CPU  

Applica'on  

Page 4: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Map  Reduce  

Applica'on  

Page 5: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Map  Reduce  

•  CounBng  •  SorBng  

– Merge  sort,  Quick  sort  •  BIG  Data  

– Data  Mining  – Trend  Analysis  e.g.  TwiPer  – RecommendaBon  Systems  

•  If  bought  =  (A,  B)  =>  likely  to  buy  C  – Google  Search  

Page 6: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

The  Underlying  Technologies  

Page 7: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Distributed  systems,  storage,  compuBng  ….  

•  Web  data  sets  can  be  very  large  –  Tens  to  hundreds  of  terabytes  ….  soon  petabyte(s)  

•  Cannot  mine  on  a  single  server  (why?)  •  Standard  architecture  emerging:    

–  Cluster  of  commodity  Linux  nodes  –  (very)  High  speed  Ethernet  interconnect  

•  How  to  organize  computaBons  on  this  architecture?  –  Storage  is  cheap  but  data  management  is  not  

(Nodes  are  bound  to  fail)  •  Mask  issues  such  as  hardware  failure  

Page 8: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Mask  issues  such  as  hardware  failure

Page 9: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Goal:  Stable  Storage  

•  For:  (Stable)  ComputaBon  •  In  other  words  

if  any  of  the  nodes,  fails  how  do  we  ensure  data  availability,  persistence  ….  ?  

Page 10: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Goal:  Stable  Storage  

•  Answer:  – Distribute  it  – have  redundancy   ‘Manage’  this  

•  Data  operaBons  and  services  –  Store,  Retrieve  on  a  single  logical  resource  that  is  distributed  over  a  number  of  ‘locaBons’.  

Filesystem!  

Page 11: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

DFS  

•  Distributed  File  System  – Provides  global  file  namespace  – Google  GFS;  Hadoop  HDFS;  etc.  – Typical  usage  paPern  

•  Huge  files  (100s  of  GB  to  TB)  •  Reads  and  appends  are  common  

Page 12: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

DFS  •  Chunk  Servers  

–  File  is  split  into  conBguous  chunks  –  Typically  each  chunk  is  16-­‐64MB  –  Each  chunk  replicated  (usually  2x  or  3x)  –  Try  to  keep  replicas  in  different  racks  

•  Master  node  (GFS)  –  a.k.a.  Name  Nodes  in  HDFS  –  Stores  metadata  –  Might  be  replicated  

•  Client  library  for  file  access  –  Talks  to  master  to  find  chunk  servers  –  Connects  directly  to  chunk  servers  to  access  data  

Page 13: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Chubby  

•  A  coarse-­‐grained  lock  service  – distributed  systems  can  use  this  to  synchronize  access  to  shared  resources  

–  Intended  for  use  by  loosely-­‐coupled  distributed  systems  

–  In  GFS:  Elect  a  master  –  In  BigTable:  master  elecBon,  client  discovery,  table  service  locking  

Page 14: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Interface  

•  Presents  a  simple  distributed  file  system  – Clients  can  open/close/read/write  files  –  Reads  and  writes  are  whole-­‐file  

– Also  supports  advisory  reader/writer  locks  – Clients  can  register  for  noBficaBon  of  file  update  

Page 15: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Topology  

Master  

Replica  

Replica  

Replica  Replica  

ALL  Client    

Traffic  

One  Chubby  Cell  

Page 16: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Master  ElecBon  

•  All  replicas  try  to  acquire  a  write  lock  on  designated  file.    

•  The  one  who  gets  the  lock  is  the  master.  – Master  can  then  write  its  address  to  file  – other  replicas  can  read  this  file  to  discover  the  chosen  master  name.  

•  Chubby  doubles  as  a  name  service  

Page 17: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Consensus  

•  Chubby  cell  is  usually  5  replicas  – 3  must  be  alive  for  cell  to  be  viable  

•  How  do  replicas  in  Chubby  agree  on  their  own  master,  official  lock  values?  – PAXOS  algorithm  

Page 18: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

PAXOS  

Paxos  is  a  family  of  algorithms  (by  Leslie  Lamport)    designed  to  provide  distributed  consensus    in  a  network  of  several  processors.  

Page 19: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Processor  AssumpBons  

•  Operate  at  arbitrary  speed  •  Independent,  random  failures  •  Processors  with  stable  storage  may  rejoin  protocol  aoer  failure  

•  Do  not  lie,  collude,  or  aPempt  to  maliciously  subvert  the  protocol  

Page 20: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Network  AssumpBons  

•  All  processors  can  communicate  with  (see)  one  another  

•  Messages  are  sent  asynchronously  and  may  take  arbitrarily  long  to  deliver  

•  Order  of  messages  is  not  guaranteed:  they  may  be  lost,  reordered,  or  duplicated  

•  Messages,  if  delivered,  are  not  corrupted  in  the  process  

Page 21: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

A  Fault  Tolerant  Memory  of  Facts  

•  Paxos  provides  a  memory  for  individual  facts‖  in  the  network.  – A  fact  is  a  binding  from  a  variable  to  a  value.  – Paxos  between  2F+1  processors  is  reliable  and  can  make  progress  if  up  to  F  of  them  fail.  

Page 22: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Roles  

•  Proposer  –  An  agent  that  proposes  a  fact  •  Leader  –  the  authoritaBve  proposer  •  Acceptor  –  holds  agreed-­‐upon  facts  in  its  memory  

•  Learner  –  May  retrieve  a  fact  from  the  system  

Page 23: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Safety  Guarantees  

•  Nontriviality:  Only  proposed  values  can  be  learned  

•  Consistency:  Only  at  most  one  value  can  be  learned  

•  Liveness:  If  at  least  one  value  V  has  been  proposed,  eventually  any  learner  L  will  get  some  value  

Page 24: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Key  Idea  

•  Acceptors  do  not  act  unilaterally.  For  a  fact  to  be  learned,  a  quorum  of  acceptors  must  agree  upon  the  fact  

•  A  quorum  is  any  majority  of  acceptors  •  Given  acceptors  {A,  B,  C,  D},  Q  =  {{A,  B,  C},  {A,  B,  D},  {B,  C,  D},  {A,  C,  D}}  

Page 25: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Basic  Paxos  

•  Determines  the  authoritaBve  value  for  a  single  variable  

•  Several  proposers  offer  a  value  Vn  to  set  the  variable  to.  

•  The  system  converges  on  a  single  agreed-­‐  upon  V  to  be  the  fact.  

Page 26: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Step  1:  Prepare  

Credit:  spinnaker  labs  inc.  

Page 27: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Step  2:  Promise  

PROMISE  x–  Acceptor  will  accept  proposals  only  numbered  x  or  higher      Proposer  1  is  ineligible  because  an  Acceptor  quorum  has  voted  for  a  higher  number  than  j  

Credit:  spinnaker  labs  inc.  

Page 28: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Step  3:  Accept  

Credit:  spinnaker  labs  inc.  

Page 29: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Step  4:  Accepted  ack  

Credit:  spinnaker  labs  inc.  

Page 30: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Learning  

If  a  learner  interrogates  the  system,  a  quorum  will  respond  with  fact  V_k  

Page 31: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Basic  Paxos  conBnued..  

•  Proposer  1  is  free  to  try  again  with  a  proposal  number  >  k;  can  take  over  leadership  and  write  in  a  new  authoritaBve  value  –  Official  fact  will  change  atomically  on  all  acceptors  from  the  perspecBve  of  learners  –  If  a  leader  dies  mid-­‐negoBaBon,  value  just  drops,  another  leader  tries  with  higher  proposal  

Page 32: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Paxos  in  Chubby  

•  Replicas  in  a  cell  iniBally  use  Paxos  to  establish  the  leader.  

•  Majority  of  replicas  must  agree  •  Replicas  promise  not  to  try  to  elect  new  master  for  at  least  a  few  seconds  (master  lease)  

•  Master  lease  is  periodically  renewed      Read  More:  hPp://labs.google.com/papers/chubby.html  

hPp://labs.google.com/papers/bigtable-­‐osdi06.pdf    

Page 33: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Big  Table  

•  Google‘s  Needs:  – Data  reliability  – High  speed  retrieval  – Storage  of  huge  numbers  of  records  (several  TB  of  data)  

–  (MulBple)  past  versions  of  records  should  be  available  

Page 34: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

HBase  -­‐  Big  Table  

•  Features:  – Simplified  data  retrieval  mechanism  –  (row,  col,  Bmestamp)  

•  value  lookup,  only  –  No  relaBonal  operators  – Arbitrary  number  of  columns  per  row  – Arbitrary  data  type  for  each  column  

•  New  constraint:  data  validaBon  must  be  performed  by  applicaBon  layer!  

Page 35: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Logical  Data  RepresentaBon  

•  Rows  &  columns  idenBfied  by  arbitrary  strings  •  MulBple  versions  of  a  (row,  col)  cell  can  be  accessed  through  Bmestamps  – ApplicaBon  controls  version  tracking  policy  – Columns  grouped  into  column  families  

Page 36: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Data  Model  •  Related  columns  stored  in  fixed  number  of  families  

–  Family  name  is  a  prefix  on  column  name  –  e.g.,  fileaPr:owning_group,  fileaPr:owning_user  

•  A  column  name  has  the  form  "<family>:<label>"  where  <family>  and  <label>  can  be  arbitrary  byte  arrays.  

•  Lookup  is  Hash  based  •  Column  families  stored  physically  close  on  disk  

–  items  in  a  given  column  family  should  have  roughly  the  same  read/write  characterisBcs  and  contain  similar  data.  

Page 37: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Conceptual  View  

Column  family  

Page 38: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Physical  Storage  View  

Each  stored  in  conBguous  chunks  over  mulBple  nodes  as  the  data  grows  

Page 39: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Example  GET    DecimalFormat  decimalFormat  =  new  DecimalFormat("0000000");  

   HTable  hTable  =  new  HTable("rest_data");        String  str  =decimalFormat.format(4);        Get  g  =  new  Get(Bytes.toBytes(str));    Result  r  =  hTable.get(g);  

   NavigableMap<byte[],  byte[]>  map  =        r.getFamilyMap(Bytes.toBytes("feature"));  

Page 40: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Example  PUT    DecimalFormat  restIdFormat  =  new  DecimalFormat("0000000");  

   HTable  hTable  =  new  HTable("restaurants");  

   String  restId  =  restIdFormat.format(4);        Put  put  =  new  Put(Bytes.toBytes("rest_ids"));  

   put.add(Bytes.toBytes("restaurant_id"),  Bytes          .toBytes(restId),  Bytes.toBytes(restId));    hTable.put(put);  

Page 41: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

HBase  -­‐  BigTable  

•  Further  Reading  with  many  more  details:  

– hPp://wiki.apache.org/hadoop/Hbase/HbaseArchitecture  

– hPp://labs.google.com/papers/bigtable-­‐osdi06.pdf  

Page 42: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

MapReduce  

•  ImplementaBons  run  on  the  backbone  of  DFS  such  as  HDFS,  GFS  

•  Using  if  needed,  storage  soluBons  like  HBase,  BigTable  

Page 43: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Word  Count  

•  We  have  a  large  file  of  words,  one  word  to  a  line  

•  Count  the  number  of  Bmes  each  disBnct  word  appears  in  the  file  

•  Sample  applicaBon:  analyze  web  server  logs  to  find  popular  URLs  

Page 44: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Word  Count  

•  Input:  a  set  of  key/value  pairs  •  User  supplies  two  funcBons:    

– map(k,v)  •  Intermediate:  list(k1,v1)  

–  reduce(k1,  list(v1))    à  v2  

•  (k1,v1)  is  an  intermediate  key/value  pair  •  Output  is  the  set  of  (k1,v2)  pairs  

Page 45: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Word  Count  using  MapReduce  map(key,  value):  //  key:  document  name;  value:  text  of  document  

 for  each  word  w  in  value:      emit(w,  1)  

 reduce(key,  values):  //  key:  a  word;  value:  an  iterator  over  counts  

 result  =  0    for  each  count  v  in  values:      result  +=  v    emit(result)  

Page 46: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Overview  

Page 47: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Data  Flow  

•  Input,  final  output  are  stored  on  a  distributed  file  system  (GFS,  HDFS)  – Scheduler  tries  to  schedule  map  tasks  “close”  to  physical  storage  locaBon  of  input  data  

–  Intermediate  results  are  stored  on  local  FS  of  map  and  reduce  workers  

•  Output  is  ooen  input  to  another  map  reduce  task  – E.g.  data  mining  –  apriori  algorithm  

Page 48: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

CoordinaBon  

•  Master  data  structures  –  Task  status:  (idle,  in-­‐progress,  completed)  –  Idle  tasks  get  scheduled  as  workers  become  available  – When  a  map  task  completes,  it  sends  the  master  the  locaBon  and  sizes  of  its  R  intermediate  files,  one  for  each  reducer  

– Master  pushes  this  info  to  reducers  

•  Master  pings  workers  periodically  to  detect  failures  

Page 49: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Failures  

•  Map  worker  failure  – Map  tasks  completed  or  in-­‐progress  at  worker  are  reset  to  idle  

– Reduce  workers  are  noBfied  when  task  is  rescheduled  on  another  worker  

•  Reduce  worker  failure  – Only  in-­‐progress  tasks  are  reset  to  idle  

•  Master  failure  – MapReduce  task  is  aborted  and  client  is  noBfied  

Page 50: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

Combiners  

•  Ooen  a  map  task  will  produce  many  pairs  of  the  form  (k,v1),  (k,v2),  ...  for  the  same  key  k  – E.g.,  popular  words  in  Word  Count  

•  Can  save  network  Bme  by  pre-­‐aggregaBng  at  mapper  – combine(k1,  list(v1))  à  v2  – Usually  same  as  reduce  funcBon  

Page 51: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

ParBBon  FuncBon  

•  Inputs  to  map  tasks  are  created  by  conBguous  splits  of  input  file  

•  For  reduce,  we  need  to  ensure  that  records  with  the  same  intermediate  key  end  up  at  the  same  worker  

•  System  uses  a  default  parBBon  funcBon  e.g.,  hash(key)  mod  R  

•  SomeBmes  useful  to  override  –  E.g.,  hash(hostname(URL))  mod  R  ensures  URLs  from  a  host  end  up  in  the  same  output  file  

Page 52: CmpE138-Spring2011-SpecialTopics-L2...CmpE138 Spring2011% Special%Topics%–L2% Shivanshu%Singh% shivanshu.sjsu@gmail.com%

More  Reading  hPp://labs.google.com/papers/mapreduce-­‐osdi04-­‐slides/index.html    hPp://labs.google.com/papers/mapreduce-­‐osdi04.pdf    hPp://wiki.apache.org/hadoop/    hPp://code.google.com/edu/parallel/mapreduce-­‐tutorial.html