Hadoop pycon2011uk

22
Respect for the elephant – Hadoop Aditya Sakhuja [email protected]

description

 

Transcript of Hadoop pycon2011uk

Page 1: Hadoop pycon2011uk

Respect  for  the  elephant  –  Hadoop  

Aditya  Sakhuja  [email protected]  

 

Page 2: Hadoop pycon2011uk

Whoami  

•  So=ware  Engineer  @  Yahoo  Inc.    

•  Web  Search  -­‐>  Cloud  PlaHorms  -­‐>  Display  Ads  Serving    •  hKp://linkedin.com/in/adityasakhuja    

PyCon  UK  2011  9/24/11  

Page 3: Hadoop pycon2011uk

Agenda  •  MoVvaVon  •  History  •  Ecosystem  •  Daemon  processes  /  High  Level  View  •  Map  Reduce  Data  Flow  •  HDFS  Architecture  /  ReplicaVon  •  Can  /  Cannot  •  Ge[ng  started  yourself  •  Demo  •  Companies  Involved  •  Q&A  

PyCon  UK  2011  9/24/11  

Page 4: Hadoop pycon2011uk

MoVvaVon  

•  ‘TradiVonal’  large-­‐scale  compuVng  systems  -­‐  problems  

•  Desired  features  in  an  improved  system  •  How  Hadoop  addresses  them  

PyCon  UK  2011  9/24/11  

Page 5: Hadoop pycon2011uk

‘TradiVonal’  large-­‐scale  compuVng  systems  -­‐  problems  

 •  CPU  intensive  over  Data  intensive  •  MPI  ,  PVM,    RPCs  –  Parallel  ComputaVon  Frameworks  

•  Programming  for  tradiVonal  distributed  systems  is  complex  – Data  exchange  requires  synchronizaVon  –  Temporal  dependencies  are  complicated  –  It  is  difficult  to  deal  with  parVal  failures  of  the  system  

•  Data  typically  stored  on  SAN    •  Data  brought  to  compute  nodes  @  runVme  

PyCon  UK  2011  9/24/11  

Page 6: Hadoop pycon2011uk

Desired  Features  in  a  Large  Scale  Data  Systems  

•  Data  Driven  – A  new  improved  system  should  avoid  data  boKlenecks  

•  Scalable  •  Consistent  •  Recoverable    (  Data  /  Processor  )  •  ParVal  Failure  Support  

PyCon  UK  2011  9/24/11  

Page 7: Hadoop pycon2011uk

What  Hadoop  offers  

•  Provides  a  high  level  programming  model  – No  worries  for  Locking/Temporal  Dependencies,  Sockets  ..  

•  and  the  list  of  features  in  the  desired  list  J  (  previous  slide  )  

 

PyCon  UK  2011  9/24/11  

Page 8: Hadoop pycon2011uk

History  

•  Hadoop  is  based  on  work  done  by  Google  in  the  late  1990s/early  2000s  

•  Specifically,  on  papers  describing  the  Google  File  System  (GFS)published  in  2003,  and  Map/Reduce  published  in  2004  

•  Hadoop  MapReduce  NextGeneraVon  –  2011  – hKp://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-­‐nextgen/  

PyCon  UK  2011  9/24/11  

Page 9: Hadoop pycon2011uk

Apache  Hadoop  Ecosystem  •  Hadoop  Common:  The  common  uVliVes  that  support  the  other  Hadoop  subprojects.  •  Hadoop  Distributed  File  System  (HDFS™):  A  distributed  file  system  that  provides  high-­‐

throughput  access  to  applicaVon  data.  •  Hadoop  MapReduce:  A  so=ware  framework  for  distributed  processing  of  large  data  sets  

on  compute  clusters.  

Other  Hadoop-­‐related  projects  at  Apache  include:  •  Cassandra™:  A  scalable  mulV-­‐master  database  with  no  single  points  of  failure.  •  HBase™:  A  scalable,  distributed  database  that  supports  structured  data  storage  for  large  

tables.  •  Hive™:  A  data  warehouse  infrastructure  that  provides  data  summarizaVon  and  ad  hoc  

querying.  •  Mahout™:  A  Scalable  machine  learning  and  data  mining  library.  •  Pig™:  A  high-­‐level  data-­‐flow  language  and  execuVon  framework  for  parallel  

computaVon.  

Source  :  hKp://hadoop.apache.org/    PyCon  UK  2011  9/24/11  

Page 10: Hadoop pycon2011uk

Hadoop  Key  Daemon  Processes  

•  Namenode  •  Secondary  NameNode  •  DataNode  •  JobTracker  •  TaskTracker  

PyCon  UK  2011  9/24/11  

Page 11: Hadoop pycon2011uk

High  level  Hadoop  cluster  view  

9/24/11   PyCon  UK  2011  

Page 12: Hadoop pycon2011uk

MapReduce  Data  Flow  

PyCon  UK  2011  9/24/11  

Page 13: Hadoop pycon2011uk

HDFS  Architecture  

PyCon  UK  2011  9/24/11  

Page 14: Hadoop pycon2011uk

HDFS  ReplicaVon  

PyCon  UK  2011  9/24/11  

Page 15: Hadoop pycon2011uk

Map  Reduce  Program  Components  

•  MapReduce  programs  generally  consist  of  three  porVons  –   The  Mapper  –   The  Reducer  – The  driver  code  

•  AddiVonal  components  :  – Combiner  (o=en  the  same  code  as  the  Reducer)  – Custom  ParVVoner  

9/24/11   PyCon  UK  2011  

Page 16: Hadoop pycon2011uk

Hadoop  Is  /  Is  Not  

•  High  Bandwidth,  High  Latency  System  •  Not  a  subsVtute  for  a  DBMS,  not  alone  at-­‐least  •  HDFS  is  not  yet  a  Highly  Available  FS.  NameNode  is  a  SPOF  

•  Is  a  “Share  nothing”  Architecture  – Mappers  do  not  talk,  neither  do  Reducers  

PyCon  UK  2011  9/24/11  

Page 17: Hadoop pycon2011uk

Ge[ng  started  yourself  

Requirements  :    •  Java  SE  SDK  [download  JDK  6  or  higher  )  •  Download  and  Install    

Hadoop  Common    :  0.20.203.X  -­‐  current  stable  version  Hadoop  HDFS  :  0.21  –  stable  version  Hadoop  MapReduce  :  0.21  –  stable  version  

•  Subscribe  to  mailing  lists    for  Hadoop  subprojects,  depending  on  your  role  

•  AddiVonally/AlternaVvely  one  can  setup  VMs  from  Cloudera  /  Yahoo    •  Details  :  

•  hKp://wiki.apache.org/hadoop/Ge[ngStartedWithHadoop  •  hKp://developer.yahoo.com/hadoop/tutorial/module7.html#basic    

PyCon  UK  2011  9/24/11  

Page 18: Hadoop pycon2011uk

Simple  Demo  

•  Using  – Pig    – Map/Reduce  

PyCon  UK  2011  9/24/11  

Page 19: Hadoop pycon2011uk

Streaming  Jobs    •  Any  language  that  can  read  from  stdin  and  write  to  stdout  •  hadoop  jar  $HADOOP_HOME/hadoop-­‐streaming.jar  \  

 -­‐input  myInputDirs  \    -­‐output  myOutputDir  \    -­‐mapper  myMapScript.py  \    -­‐reducer  myReduceScript.py  \    -­‐file  myMapScript.py  \    -­‐file  myReduceScript.py  

 

9/24/11   PyCon  UK  2011  

Page 20: Hadoop pycon2011uk

Companies  involved  •  Yahoo    -­‐  4500  nodes  cluster  (  2*4  cores,  4*1  TBs  Disk  ,  16GB  RAM  )  –  (  AdServer,  Search  )  

•  HortonWorks  ,  Cloudera  •  Facebook  •  A9    (  Amazon  Product  Search  )  •  EBay  -­‐  532  node  cluster  –  (  8  *  532  cores  ,  5.3  PB  )  •  Last.fm,  TwiKer  …  •  ……  a  lot  more  can  be  found  on  the  link  below  :                        hKp://wiki.apache.org/hadoop/PoweredBy  

PyCon  UK  2011  9/24/11  

Page 21: Hadoop pycon2011uk

Useful  Links    •  hKp://wiki.apache.org/hadoop/Ge[ngStartedWithHadoop  -­‐  Ge[ng  Started  

•  hKp://hadoop.apache.org/common/docs/current/cluster_setup.html  -­‐  Cluster  Setup  

 •  hKp://developer.yahoo.com/hadoop/tutorial/module4.html  -­‐  MapReduce  

•  hKp://developer.yahoo.com/hadoop/tutorial/pigtutorial.html  -­‐  PIG  

•  hKp://hadoop.apache.org/common/docs/current/api/index.html  -­‐  APIs    •  hKp://developer.yahoo.com/hadoop/tutorial/  -­‐  YDN  resource  on  Hadoop  

PyCon  UK  2011  9/24/11  

Page 22: Hadoop pycon2011uk

Q&C  

Contact  InformaFon  :    Aditya  Sakhuja  [email protected]  hKp://twiKer.com/sakhuja  hKp://linkedin.com/in/adityasakhuja      

PyCon  UK  2011  9/24/11