1.BigData-Hadoop an Introduction

download 1.BigData-Hadoop an Introduction

of 30

Transcript of 1.BigData-Hadoop an Introduction

  • 8/11/2019 1.BigData-Hadoop an Introduction

    1/30

    http://www.excelonlineclasses.co.nr/

    [email protected]

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    2/30

    !Online Training

    !Development

    !Testing!Job support

    !Technical Guidance

    !Job Consultancy

    !

    Any needs of IT Sector

    Excel Online Classes offers following

    services:

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    3/30

    HADOOP-

    Nagarjuna K

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    4/30

    Why and What Hadoop ?

    !A tool to process big data

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    5/30

    What is BIG Data ?

    !Facebook, Google+ etc.,

    !

    Machines too generate lots of data

    !We are having a online discussion now ,

    certainly how many of us are in this conference

    will also be recorded as data.

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    6/30

    What is BIG Data ? ..continued

    !Exponential growth of data "challenges to Google,

    Yahoo, Microsoft, Amazon

    !Need to go through TBs and PBs of data ?

    ! Which websites and books were popular ?

    ! What kind of Ads appeal to them ?

    !

    Existing tools became inadequate to process such largedata sets.

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    7/30

    Why is the data so BIG ?

    !Till Couple of decade back #Floppy disks

    !

    From then on #CD/DVD Drives

    !Half a decade back #Hard drives (500 GB)

    !

    Now#Hard Drives(I TB) are available in abundance

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    8/30

    Why is the data so BIG ?

    !

    So WHAT ?

    !

    Even the technology to read has taken a leap.

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    9/30

    Why is the data so BIG ?

    Year Device Volume

    DataTransfer

    speed

    Time toprocess

    1990Optical Drive

    1370 MB 4.4 MB/s 5 minutes

    2012

    1 TB SATA

    Drives 1 TB 100 MB/s 2.5 Hrs

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    10/30

    How to handle such BIG ?

    ! BIG elephant

    !

    Numerous small chicken ?

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    11/30

    How to handle such BIG ?

    !Concept of Torrents

    !Reduce time to read by reading it from multiple

    sources simultaneously.

    !!"#$%&' %) *' +#, -.. ,/%0'12 '#3+ +45,%&$ 4&' +6&,/',7+

    4) 7+' ,#7#8 94/:%&$ %& ;#/#55'52 *' 3465, /'#, 7+' ,#7# %&

    5'11 7+#& 7*4 "%&67'18

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    12/30

    How to handle such BIG ? --

    Issues

    !

    How to handle a system up and downs ?

    !

    How to combine the data from all the

    systems ?

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    13/30

    Problem1 : Systems Ups and

    Downs!Commodity hard ware for data storage and analysis

    !Chances of failure are very high

    !

    So, have a redundant copy of the same data across somemachines

    ! In case of eventuality of one machine, you have the other

    !Google came up with a file system # GFS (Google File System)

    which implemented all these details.

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    14/30

    GFS! Divides data into chunks and stores in the file System

    ! Can store data in ranges of PBs also

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    15/30

    Problem 2 : How to combine the

    data ?

    !Analyze data across different machines , But how do we mergethem to get a meaningful outcome ?

    !Yes, all (some) of the data has to travel across network. Then only

    merging of the data can occur.

    !Doing this is notoriously challenging

    !

    Again Google"MapReduce

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    16/30

    Map Reduce!Provides a programming model "abstracts the

    problem of disk reads and writes transforming in to a

    computation of keys and values.

    !Two phases

    !Map

    !

    Reduce

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    17/30

    So what is Hadoop ?

    !

    An operating system ?

    !Provides

    1. A reliable shared storage system

    2.

    Analysis system

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    18/30

    History of Hadoop

    !Google was the first to launch GFS and

    MapReduce

    !

    They published a paper in 2004 announcing the

    world a brand new technology

    !This technology was well proven in Google by 2004

    itself

    MapReduce paper by Googlehttp://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    19/30

    History of Hadoop

    !Doug Cutting saw an opportunity and led the

    charge to develop an open sourceversion of this

    MapReduce system called Hadoop .

    ! Soon after, Yahoo and othersrallied around tosupport this effort.

    !

    Now Hadoop is core part in :

    !Facebook, Yahoo, LinkedIn, Twitter!

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    20/30

    History of Hadoop

    !GFS # HDFS

    !MapReduce #MapReduce

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    21/30

    HDFS -- A Brief

    Design # Streaming very large files on commodity cluster.

    1. Very Large Files$MBs to PBs

    2. Streaming$Write once read many approach

    $After huge data being placed #We tend to use the data not modify it

    $Time to read the whole data is more important

    3.

    Commodity Cluster$No High end Servers

    $

    Yes, high chance of failure (But HDFS is tolerant enoguh)$Replication is done

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    22/30

    MapReduce -- A Brief

    !

    Large scale data processing in parallel.

    !

    MapReduce provides:

    $Automatic parallelization and distribution

    $

    Fault-tolerance

    $I/O scheduling

    $Status and monitoring

    !

    Two phases in MapReduce!Map

    !Reducehttp://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    23/30

    MapReduce -- A Brief

    !Map phase

    ! map (in_key, in_value) -> list(out_key, intermediate_value)

    ! Processes input key/value pair

    !

    Produces set of intermediate pairs

    !Reduce Phase

    ! reduce (out_key, list(intermediate_value)) -> list(out_value)

    ! Combines all intermediate values for a particular key

    !

    Produces a set of merged output values (usually just one)

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    24/30

    MapReduce -- A Brief

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    25/30

    Hadoop Cluster

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    26/30

    Hadoop Ecosystems

  • 8/11/2019 1.BigData-Hadoop an Introduction

    27/30

    Version of Hadoop

    !

    We will deal with either of

    !Apache hadoop-0.20

    !Cloudera hadoop - cdh3

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    28/30

    Pre-Requisites

    !

    Core-Java

    !Acquaintance with LINUX will help.

    ! For better experience :- Linux installation on your

    machines.

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    29/30

    Thank you %!Your feedback is highly important to improve our

    course material and teaching methodologies.

    ! Please email your suggestions to

    [email protected]@outlook.com

    http://www.excelonlineclasses.co.nr/

  • 8/11/2019 1.BigData-Hadoop an Introduction

    30/30

    Disclaimer! Excel Online classes acknowledges the proprietary rights of

    the trademarks and product names of other companies

    mentioned in any of the training material including but notlimited to the handouts, written material, videos, power pointpresentations, etc. All such training materials are provided toour students for learning purposes only. Students shall not usesuch materials for their private gain nor can they sell any suchmaterials to a third party. Some of the examples provided inany such training materials may not be owned by us and assuch we does not claim any proprietary rights for the same.We does not guarantee nor is it responsible for such products

    and projects. We acknowledges that any such information orproduct that has been lawfully received from any third partysource is free from restriction and without any breach orviolation of law whatsoever.

    http://www.excelonlineclasses.co.nr/