ROSEdu Tech Talks Prezentarea 09: Hadoop

download ROSEdu Tech Talks Prezentarea 09: Hadoop

of 23

Transcript of ROSEdu Tech Talks Prezentarea 09: Hadoop

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    1/23

    Vlad Ureche

    ROSEdu Tech Talks

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    2/23

    Contents

    Map Reduce

    Hadoop

    HDFS

    Hbase

    Example

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    3/23

    MapReduce (1)

    Google paper released in 2004

    labs.google.com/papers/mapreduce-osdi04.pdf

    Context

    Google cluster many nodes many hw failures

    Lots of data

    Idea:

    Separate the administrative part from the algorithms

    Create a framework for all algorithms

    Move computation instead of moving data

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    4/23

    MapReduce (2)

    MapReduce is a programming model and anassociated implementation for processingand generating large data sets

    Our implementation of MapReduce runs on alarge cluster of commodity machines and ishighly scalable: a typical MapReducecomputation processes many terabytes ofdata on thousands of machines.

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    5/23

    MapReduce (3)

    MAP

    DATA

    N1

    N2

    N1

    N3

    MAP

    MAP

    SHUFFLEand

    SORT

    REDUCE

    REDUCE

    REDUCE

    3

    N1

    3

    N1

    2

    N1N2

    2

    N1N3

    2

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    6/23

    MapReduce (4)

    Map: List()

    Reduce: List()

    Key1, key2 Anything that can be comparedand checked for equality

    Value1, value2 Anything

    Map and Reduce functions are up to you!

    Fault tolerance schedulin concurrenc -

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    7/23

    MapReduce (5)

    Example: count occurrences of 2-grams in abook

    The quick brown fox

    The quick

    quick brown

    brown fox

    Input: The book

    Map: List()

    Reduce: sizeof(List)

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    8/23

    MapReduce (6)

    When should you use MR?

    Lots of data

    Jobs can be parallel

    Lots of machines

    When not to use MR?

    Intensive computation on small data

    Jobs depend on each other

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    9/23

    Hadoop (1)

    Hadoop is an open-source implementation ofthe MapReduce framework

    Is a top project of the Apache Foundation

    Appeared two years after the MapReducepaper

    Developed by companies:

    Yahoo Cloudera

    And independent submitters

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    10/23

    Hadoop (2)

    Used by everybody

    http://wiki.apache.org/hadoop/PoweredBy

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    11/23

    Hadoop (3)

    JobTracker

    Task tracker Task tracker Task tracker Task tracker Task tracker

    Completely automated

    Jobs are scheduled based on data locality

    Speculative execution

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    12/23

    Hadoop (4)

    Code

    Is open source

    Java

    Build scripts

    Bash scripts

    Configuration files

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    13/23

    Hadoop (5)

    Is part of a larger ecosistem

    HDFS distributed file system

    Hbase distributed, column-oriented database

    Mahout machine learning algorithm library Nutch web crawler

    And lots of other stuff

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    14/23

    Hadoop example

    Ad clicking log

    User information (Age, Location) database

    How could you use that to your advantage?

    Mahout machine learning framework

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    15/23

    Distributed file system

    Modelled after the GFS paper

    labs.google.com/papers/gfs-sosp2003.pdf

    Stores multiple copies of data

    Seek time >> Scan time

    Move computation vs Move data

    Small File Problem (TM)

    HDFS (1)

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    16/23

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    17/23

    HDFS (3)

    Part of Hadoop

    Open source

    Java

    Build scripts

    Bash scripts

    Configuration files

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    18/23

    HBase

    Distributed, column-oriented, sparse hash table

    Data is stored in HDFS

    Based on the BigTable paper by Google

    labs.google.com/papers/bigtable.html

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    19/23

    HBase (2)

    Table

    Key

    Columns

    Column Families

    key=location:Romania;age:16;sex=M

    ads:copiutze.ro.clickProbability = 0.0018 ads:copiutze.ro.bestPlacement = calendarPage

    stats:clickProbability=0.0015

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    20/23

    HBase (3)

    Idee de distribuire asemanatoare HDFS-ului

    Foloseste HDFS pentru stocarea fisierelor

    Master

    Region ServerRegion Server Region Server

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    21/23

    Conclusion

    MapReduce

    Lots of input data

    Parallel jobs

    Lots of computers We could also talk about

    Mahout machine learning

    Nutch web crawling Lucene/Solr search engine

    Pig, Cascading frameworks over Hadoop

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    22/23

    Questions?

  • 8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop

    23/23

    Thank you!