Hadoop training institute in hyderabad

of 36 /36

Embed Size (px)

Transcript of Hadoop training institute in hyderabad

  • Hadoop manages:processor timememorydisk spacenetwork bandwidthDoes not have a security modelCan handle HW failure

    www.kellytechno.com

  • Issues:race conditionssynchronizationdeadlocki.e., same issues as distributed OS & distributed filesystem

    www.kellytechno.com

  • Grid computing: (What is this?)e.g. CondorMPI model is more complicateddoes not automatically distribute datarequires separate managed SAN

    www.kellytechno.com

  • Hadoop:simplified programming modeldata distributed as it is loaded

    HDFS splits large data files across machinesHDFS replicates datafailure causes additional replication

    www.kellytechno.com

  • www.kellytechno.com

  • Core idea: records are processed in isolationBenefit: reduced communicationJargon:mapper task that processes recordsReducer task that aggregates results from mappers

    www.kellytechno.com

  • www.kellytechno.com

  • How is the previous picture different from normal grid/cluster computing?

    Grid/cluster: Programmer manages communication via MPIvsHadoop: communication is implicitHadoop manages data transfer and cluster topology issueswww.kellytechno.com

  • Hadoop overheadMPI does better for small numbers of nodes

    Hadoop flat scalabity pays off with large dataLittle extra work to go from few to many nodes

    MPI requires explicit refactoring from small to larger number of nodes

    www.kellytechno.com

  • NFS:the Network File SystemSaw this in OS classSupports file system exportingSupports mounting of remote file system

    www.kellytechno.com

  • www.kellytechno.com

  • MountsCascading mountswww.kellytechno.com

  • Establishes logical connection between server and client.

    Mount operation: name of remote directory & name of server Mount request is mapped to corresponding RPC and forwarded to mount server running on server machine. Export list specifies local file systems that server exports for mounting, along with names of machines that are permitted to mount them.

    www.kellytechno.com

  • server returns a file handlea key for further accesses.

    File handle a file-system identifier, and an inode number to identify the mounted directory

    The mount operation changes only the users view and does not affect the server side.

    www.kellytechno.com

  • NFS AdvantagesTransparency clients unaware of local vs remoteStandard operations - open(),close(), fread(), etc.

    NFS disadvantagesFiles in an NFS volume reside on a single machineNo reliability guarantees if that machine goes downAll clients must go to this machine to retrieve their data

    www.kellytechno.com

  • HDFS Advantages:designed to store terabytes or petabytesdata spread across a large number of machinessupports much larger file sizes than NFSstores data reliably (replication)

    www.kellytechno.com

  • HDFS Advantages:provides fast, scalable accessserve more clients by adding more machinesintegrates with MapReduce local computation

    www.kellytechno.com

  • HDFS DisadvantagesNot as general-purpose as NFSDesign restricts use to a particular class of applicationsHDFS optimized for streaming read performance not good at random access

    www.kellytechno.com

  • HDFS DisadvantagesWrite once read many modelUpdating a files after it has been closed is not supported (cant append data)System does not provide a mechanism for local caching of data

    www.kellytechno.com

  • HDFS block-structured file system

    File broken into blocks distributed among DataNodes

    DataNodes machines used to store data blocks

    www.kellytechno.com

  • Target machines chosen randomly on a block-by-block basis

    Supports file sizes far larger than a single-machine DFS

    Each block replicated across a number of machines (3, by default)

    www.kellytechno.com

  • www.kellytechno.com

  • Expects large file sizeSmall number of large filesHundreds of MB to GB eachExpects sequential accessDefault block size in HDFS is 64MBResult:Reduces amount of metadata storage per fileSupports fast streaming of data (large amounts of contiguous data)

    www.kellytechno.com

  • HDFS expects to read a block start-to-finishUseful for MapReduceNot good for random accessNot a good general purpose file system

    www.kellytechno.com

  • HDFS files are NOT part of the ordinary file system

    HDFS files are in separate name space

    Not possible to interact with files using ls, cp, mv, etc.

    Dont worry: HDFS provides similar utilities

    www.kellytechno.com

  • Meta data handled by NameNodeDeal with synchronization by only allowing one machine to handle itStore meta data for entire file systemNot much data: file names, permissions, & locations of each block of each file

    www.kellytechno.com

  • www.kellytechno.com

  • What happens if the NameNode fails?Bigger problem than failed DataNodeBetter be using RAID ;-)Cluster is kaput until NameNode restored

    Not exactly relevant but:DataNodes are more likely to fail.Why?

    www.kellytechno.com

  • First download and unzip a copy of Hadoop (http://hadoop.apache.org/releases.html)

    Or better yet, follow this lecture first ;-)

    Important links:Hadoop website http://hadoop.apache.org/index.htmlHadoop Users Guide http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html2012 Edition of Hadoop Users Guide http://it-ebooks.info/book/635/

    www.kellytechno.com

  • HDFS configuration is in conf/hadoop-defaults.xmlDont change this file.Instead modify conf/hadoop-site.xmlBe sure to replicate this file across all nodes in your clusterFormat of entries in this file:

    property-name property-value www.kellytechno.com

  • Necessary settings:fs.default.name - describes the NameNodeFormat: protocol specifier, hostname, portExample: hdfs://punchbowl.cse.sc.edu:9000

    dfs.data.dir path on the local file system in which the DataNode instance should store its dataFormat: pathnameExample: /home/sauron/hdfs/dataCan differ from DataNode to DataNodeDefault is /tmp/tmp is not a good idea in a production system ;-)

    www.kellytechno.com

  • dfs.name.dir - path on the local FS of the NameNode where the NameNode metadata is storedFormat: pathnameExample: /home/sauron/hdfs/nameOnly used by NameNodeDefault is /tmp/tmp is not a good idea in a production system ;-)

    dfs.replication default replication factorDefault is 3Fewer than 3 will impact availability of data.

    www.kellytechno.com

  • fs.default.name hdfs://your.server.name.com:9000 dfs.data.dir /home/username/hdfs/data dfs.name.dir /home/username/hdfs/name

    www.kellytechno.com

  • The Master Node needs to know the names of the DataNode machinesAdd hostnames to conf/slavesOne fully-qualified hostname per line(NameNode runs on Master Node)

    Create Necessary [email protected]$ mkdir -p $HOME/hdfs/[email protected]$ mkdir -p $HOME/hdfs/name

    Note: owner needs read/write access to all directoriesCan run under your own name in a single machine clusterDo not run Hadoop as root. Duh!

    www.kellytechno.com

  • www.kellytechno.com