Hadoop Nishant Gandhi.

download Hadoop Nishant Gandhi.

of 21

Transcript of Hadoop Nishant Gandhi.

  • 8/3/2019 Hadoop Nishant Gandhi.

    1/21

    HadoopIntroduction & Setup

    Prepared By,Nishant M Gandhi.Certified network Manager byNettech.

    Diploma in CyberSecurity(persuing).

    C K Pithawalla College of Engineering & Technology,Surat

  • 8/3/2019 Hadoop Nishant Gandhi.

    2/21

    What is ?

    Hadoop is a framework for runningapplications on large clusters built ofcommodity hardware.

    HADOOP WIKI

    Open source, Java

    Googles MapReduce inspired Yahoos Hadoop.

    Now part of Apache group

  • 8/3/2019 Hadoop Nishant Gandhi.

    3/21

    Hadoop Architecture on DELL C Series Server

  • 8/3/2019 Hadoop Nishant Gandhi.

    4/21

    Hadoop Software Stack

    Hadoop Common: The common utilities

    Hadoop Distributed File System (HDFS): A distributed file system

    Hadoop Map Reduce: Distributed processing on compute clusters.

    Other Hadoop-related projects:

    Avro: A data serialization system.

    Cassandra: A scalable multi-master database

    Chukwa: A data collection system for managing large distributed systems.

    HBase: A scalable, distributed database that supports structured data storage for large tables.

    Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.

    Mahout: A Scalable machine learning and data mining library.

    Pig: A high-level data-flow language and execution framework for parallel computation.

    ZooKeeper:co-ordination services

  • 8/3/2019 Hadoop Nishant Gandhi.

    5/21

    Who Uses Hadoop?

    lick to edit Master text stylescond level

    Third level Fourth level

    Fifth level

    The Yahoo! Search Webmap is a Hadoopapplication that runs on more than 10,000 coreLinux cluster and produces data that is now usedin every Yahoo! Web search query.

    On February 19, 2008, Yahoo! Inc. launched whatit claimed was the world's largest Hadoopproduction application

  • 8/3/2019 Hadoop Nishant Gandhi.

    6/21

    Who Uses Hadoop?

    The Datawarehouse Hadoop cluster at Facebook

    21 PB of storage in a single HDFS cluster 2000 machines 12 TB per machine (a few machines have 24 TB each) 1200 machines with 8 cores each + 800 machines with16 cores each 32 GB of RAM per machine 15 map-reduce tasks per machine That's a total of more than 21 PB of configured storagecapacity! This is larger than the previously known Yahoo!'scluster of 14 PB. Here are the cluster statistics from theHDFS cluster at Facebook:

    k to edit Master text stylesond levelThird level Fourth level

    Fifth level

  • 8/3/2019 Hadoop Nishant Gandhi.

    7/21

    Who Uses Hadoop?

    Creater of MapReduce

    Runs Hadoop for NS-Research cluster

    HDFS is inspired by GFS

  • 8/3/2019 Hadoop Nishant Gandhi.

    8/21

    Who Uses Hadoop?

    Other Hadoop Users:

    IBM

    NEW YORK TIMESTwitter

    Veoh

    Amazon

    Apple

    eBay

    AOL

    Hewlett-Packard

    Joost

  • 8/3/2019 Hadoop Nishant Gandhi.

    9/21

    Map Reduce

    Programming model developed at Google

    Sort/merge based distributed computing

    Initially, it was intended for their internalsearch/indexing application, but now used

    extensively by more organizations (e.g., Yahoo,Amazon.com, IBM, etc.)

    It is functional style programming (e.g., LISP) that isnaturally parallelizable across a large cluster ofworkstations or PCS.

    The underlying system takes care of the partitioningof the input data, scheduling the programsexecution across several machines, handlingmachine failures, and managing required inter-machine communication. (This is the key forHadoops success)

  • 8/3/2019 Hadoop Nishant Gandhi.

    10/21

    Map Reduce

  • 8/3/2019 Hadoop Nishant Gandhi.

    11/21

    Hadoop Distributed File System(HDFS)

    At Google MapReduce operation are run on a special file systemcalled Google File System (GFS) that is highly optimized for thispurpose.

    GFS is not open source. Doug Cutting and others at Yahoo! reverse engineered the GFS and

    called it Hadoop Distributed File System (HDFS).

  • 8/3/2019 Hadoop Nishant Gandhi.

    12/21

    Goals of HDFS

    Very Large Distributed File System 10K nodes, 100 million files, 10 PB

    Assumes Commodity Hardware Files are replicated to handle hardware failure Detect failures and recovers from them

    Optimized for Batch Processing Data locations exposed so that computations

    can move to where data resides Provides very high aggregate bandwidth

    User Space, runs on heterogeneous OS

  • 8/3/2019 Hadoop Nishant Gandhi.

    13/21

    DFShellThe HDFS shell is invoked by: bin/hadoop dfs

    cat chgrp chmod chown copyFromLocal copyToLocal cp du dus

    expunge get getmerge

    ls lsr mkdir movefromLocal mv touchz

    put rm rmr setrep stat tail test text

  • 8/3/2019 Hadoop Nishant Gandhi.

    14/21

    Hadoop Single Node Setup

    Step 1:

    Download hadoop from

    http://hadoop.apache.org/mapreduce/releases.html

    Step 2:

    Untar the hadoop file:

    tar xvfz hadoop-0.20.2.tar.gz

  • 8/3/2019 Hadoop Nishant Gandhi.

    15/21

    Hadoop Single Node Setup

    Step 3:

    Set the path to java compiler by editing

    JAVA_HOMEParameter in

    hadoop/conf/hadoop-- env.sh

  • 8/3/2019 Hadoop Nishant Gandhi.

    16/21

    Hadoop Single Node Setup

    Step 4:

    Create an RSA key to be used by hadoop whensshing to localhost:

    ssh-keygen -t rsa -P

    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

  • 8/3/2019 Hadoop Nishant Gandhi.

    17/21

    Hadoop Single Node Setup

    Step 5:

    Do the following changes to the configuration files underhadoop/conf

    core--site.xml:

    hadoop.tmp.dirTEMPORARY-DIR-FOR-HADOOPDATASTORE

    fs.default.name

    hdfs://localhost:54310

  • 8/3/2019 Hadoop Nishant Gandhi.

    18/21

    Hadoop Single Node Setup

    mapred--site.xml:

    mapred.job.tracker

    localhost:54311

    hdfs--site.xml:

    dfs.replication

    1

  • 8/3/2019 Hadoop Nishant Gandhi.

    19/21

    Hadoop Single Node Setup

    Step 6:

    Format the hadoop file system. From hadoop

    directory run the following:

    bin/hadoop namenode -format

  • 8/3/2019 Hadoop Nishant Gandhi.

    20/21

    Using Hadoop

    1)How to start Hadoop?

    cd hadoop/bin

    ./start-all.sh

    2)How to stop Hadoop?

    cd hadoop/bin

    ./stop-all.sh

    3)How to copy file from local to HDFS?

    cd hadoop

    bin/hadoop dfs put local_machine_path hdfs_path

    4)How to list files in HDFS?

    cd hadoop

    bin/hadoop dfs -ls

  • 8/3/2019 Hadoop Nishant Gandhi.

    21/21

    Thank You..