1CAMSAP

download 1CAMSAP

of 17

Transcript of 1CAMSAP

  • 8/13/2019 1CAMSAP

    1/17

    Hadoop for DummiesMapReduce to the Rescue

  • 8/13/2019 1CAMSAP

    2/17

    Introduction

    Knowing why MapReduce is essential Understanding how MapReduce works

    Looking at the industries that useMapReduce

    Considering real-world applications

  • 8/13/2019 1CAMSAP

    3/17

    Solution to Huge Unstructured Data

    MapReduce

    A software frameworkthat breaks big problems

    into small, manageable

    tasks and then distributesthem to

    multiple servers.

    These servers are callednodes, and they

    work together in parallelto arrive at a result.

  • 8/13/2019 1CAMSAP

    4/17

    Data requirement for MapReduceIf your job is to coax

    insight from a very largedisk-based information

    set , measured in terabytesto petabytes, then

    MapReduceswill likelymeet your needs.

    MapReduce can workwith raw data thats stored

    in disk files, in relationaldatabases, or both.

    The data may bestructured or

    unstructured, and iscommonly made up of

    text, binary, or multi-linerecords.

    The most commonMapReduce usage patternemploys a distributed filesystem known as HadoopDistributed File System

    (HDFS).

    Data is stored on localdisk and processing is

    done locally on the

    computer with the data.

  • 8/13/2019 1CAMSAP

    5/17

    MapReduce Architecture

    Map

    Key

    The key identifieswhat kind of

    information werelooking at.

    When comparedwith a relationaldatabase, a keyusually equatesto a column.

    Value

    The value portion ofthe key/value pair is anactual instance of dataassociated with a key.

  • 8/13/2019 1CAMSAP

    6/17

    Examples of Key and Value

    First name

    Transactionamount

    Search term

    Danielle

    19.96

    Snare drums

    Firstname/Danielle

    Transactionamount/19.96

    Search

    term/Snare drums

  • 8/13/2019 1CAMSAP

    7/17

    Reduce

    After the Map phase is over, all theintermediate values for a given output keyare combined together into a list.

    The reduce() function then combines theintermediate values into one or more final

    values for the same key.

  • 8/13/2019 1CAMSAP

    8/17

    Configuring MapReduce

    Components will

    fail at a high rate

    Data will becontained in a

    relatively smallnumber of big files

    Data files are

    write-once

    Lots of streamingreads

    Higher sustained

    throughput acrosslarge amounts ofdata

  • 8/13/2019 1CAMSAP

    9/17

    MapReduce in Action (Example)

    In this scenario, youre in charge of the e-commerce website for a very large retailer.You stock over 200,000 individual products,and your website receives hundreds of

    thousands of visitors each day. In aggregate,your customers place nearly 50,000 orderseach day.

  • 8/13/2019 1CAMSAP

    10/17

    Example (Prepare sorted list of

    search terms)

    Step 1: The data should ideally be broken intonumerous 1 GB +/- files.

    Step 2: Each file will be distributed to a differentnode.

    Step 3: On each node, the Map step will produce alist, consisting of each word in the file along withhow many times it appears. For example, one nodemight come up with these intermediate results fromits own set of data:

    Skate: 4992120

    Ski: 303021

    Skis: 291101

  • 8/13/2019 1CAMSAP

    11/17

  • 8/13/2019 1CAMSAP

    12/17

    Benefit from MapReduce if you have

    Lots of Data

    Multiple servers atyour disposal

    MapReduce-based

    softwaresuch asHadoop

  • 8/13/2019 1CAMSAP

    13/17

    MapReduce Users

    Financial services Telco

    Retail

    Government

    Defense

    Homeland security

    Health and life services

    Utilities

    Social networks/Internet

    Internet service providers

  • 8/13/2019 1CAMSAP

    14/17

    Other MapReduce Applications

    Risk modeling Recommendation engines

    Point of sale transaction analysis

    Threat analysis

    Search quality

    ETL logic for data warehouses

    Customer churn analysis

    Ad targeting

    Network traffic analysis

    Trade surveillance

    Data sandboxes

  • 8/13/2019 1CAMSAP

    15/17

    Real-World MapReduce Examples

    Financial Services

    Fraud

    detection

    Asset

    management

    Data sourceand data

    storeconsolidation

  • 8/13/2019 1CAMSAP

    16/17

  • 8/13/2019 1CAMSAP

    17/17

    Real-World MapReduce Examples

    Auto Manufacturing

    Vehiclemodel and

    optionvalidation

    Vehiclemass

    analysis

    Emission

    reporting

    Customer

    satisfaction