Download - Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

Cloud Distributed Computing Environment

Content of this lecture is primarily from the book “Hadoop, The Definite

Guide 2/e)

• Hadoop is an open-source software system that provides a distributed computing environment on cloud (data centers)

• It is a state-of-the-art environment for Big Data Computing

• It contains two key services: - reliable data storage using the Hadoop Distributed

File System (HDFS) - distributed data processing using a technique

called MapReduce.

History of Hadoop

• Hadoop was created by Doug Cutting, the creator of Apache Lucence, the widely used text search library.

• Hadoop has its origins in Apache Nutch, an open source web search engine.

• In 2004, Google published the paper that introduce MapReduce to the world.

• Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year, all the major Nutch algorithms had been ported to run using MapReduce and NDFS

• In Februray 2006, NDFS and the MapReduce moved out of Nutch to form an independent subproject of Lucene called Hadoop

• At around the same time, Doug Cutting joined Yahoo!, which provided dedicated team and the resources to turn Hadoop into a system that ran at web scale.

• In February 2008, Yahoo! Announced that its production search index was being generated by 10,000-core Hadoop cluster

• In January 2008, Hadoop was made its own top-level project at Apache, confirmming its success and its diverse, active community

• By the same time, Hadoop has been used by many companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times.

• In April 2008, Hadoop broke a world record to become the fastest system to sort terabyte of data

• In November 2009, Google reported that its MapReduce implmentation sorted one terabyte in 68 seconds.

Introduction

• HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.

- Very large file: some hadoop clusters stores petabytes of data.

- Streaming data access: HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern.

- Commodity hardware: Hadoop doesn’t require expensive, highly reliable harware to run on. It is designed to run on clusters of commodity hardware.

Basic Concepts

• Blocks - Files in HDFS are broken into block-sized

chunks. Each chunk is stored in an independent unit.

- By default, the size of each block is 64 MB.

- Some benefits of splitting files into blocks. -- a file can be larger than any single disk in

the network.

-- Blocks fit well with replication for providing fault tolerance and availability. To insure against corrupted blocks and disk/machine failure, each block is replicated to a small number of physically separate machines.

• Namenodes and Datanodes - The namenode manages the filesystem namespace. -- It maintains the filesystem tree and the metadata

for all the files and directories. -- It also contains the information on the locations of

blocks for a given file. - datanodes: stores blocks of files. They report back to

the namenodes periodically

The Command-Line InterfaceCopy a file from local filesystem to HDFS.

Copy a file from HDFS to local filesystem.

Compare these two local files

Hadoop FileSystems

• The Java abstract class org.apache.hadoop.fs.FileSystem represents a filesystem in Hadoop. There are

several concrete implementation of this abstract class. HDFS is one of them.

HDFS Java Interface

• How to read data from HDFS in Java programs• How to write data to HDFS in Java programs

• Reading data using the FileSystem API

• Writing data

Data Flow