Hadoop technology doc

11
HADOOP TECHNOLOGY ABSTRACT Hadoop is the popular open source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of very large data sets. Hadoop enables you to explore complex data, using custom analyses tailored to your information and questions. Hadoop is the system that allows unstructured data to be distributed across hundreds or thousands of machines forming shared nothing clusters, and the execution of Map/Reduce routines to run on the data in that cluster. Hadoop has its own filesystem which replicates data to multiple nodes to ensure if one node holding data goes down, there are at least 2 other nodes from which to retrieve that piece of information. This protects the data availability from node failure, something which is critical when there are many nodes in a cluster (aka RAID at a server level). What is Hadoop? The data are stored in a relational database in your desktop computer and this desktop computer has no problem handling this load. Then your company starts growing very quickly, and that data grows to 10GB. And then 100GB. And you start to reach the limits of your current desktop computer. So you scale-up by investing in a larger computer, and you are then OK for a few more months. When your data grows to 10TB, and then 100TB. And you are fast approaching the limits of that computer. Moreover, you are now asked to feed your application with unstructured data coming from sources like Facebook, Twitter, RFID readers, sensors, and so on.Your management wants to derive information from both the relational data and the unstructured data, and wants this information as soon as possible. What should you do? Hadoop may be the answer! Hadoop is an open source project of the Apache Foundation. 1

description

 

Transcript of Hadoop technology doc

Page 1: Hadoop technology doc

HADOOP TECHNOLOGY

ABSTRACT

Hadoop is the popular open source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of very large data sets. Hadoop enables you to explore complex data, using custom analyses tailored to your information and questions. Hadoop is the system that allows unstructured data to be distributed across hundreds or thousands of machines forming shared nothing clusters, and the execution of Map/Reduce routines to run on the data in that cluster. Hadoop has its own filesystem which replicates data to multiple nodes to ensure if one node holding data goes down, there are at least 2 other nodes from which to retrieve that piece of information. This protects the data availability from node failure, something which is critical when there are many nodes in a cluster (aka RAID at a server level).

What is Hadoop?

The data are stored in a relational database in your desktop computer and this desktop computerhas no problem handling this load.Then your company starts growing very quickly, and that data grows to 10GB.And then 100GB.And you start to reach the limits of your current desktop computer.So you scale-up by investing in a larger computer, and you are then OK for a few more months.When your data grows to 10TB, and then 100TB.And you are fast approaching the limits of that computer.Moreover, you are now asked to feed your application with unstructured data coming from sources

like Facebook, Twitter, RFID readers, sensors, and so on.Your management wants to derive information from both the relational data and the unstructureddata, and wants this information as soon as possible.What should you do? Hadoop may be the answer!Hadoop is an open source project of the Apache Foundation.It is a framework written in Java originally developed by Doug Cutting who named it after hisson's toy elephant.Hadoop uses Google’s MapReduce and Google File System technologies as its foundation.It is optimized to handle massive quantities of data which could be structured, unstructured orsemi-structured, using commodity hardware, that is, relatively inexpensive computers.This massive parallel processing is done with great performance. However, it is a batch operationhandling massive quantities of data, so the response time is not immediate.As of Hadoop version 0.20.2, updates are not possible, but appends will be possible starting inversion 0.21.Hadoop replicates its data across different computers, so that if one goes down, the data areprocessed on one of the replicated computers.Hadoop is not suitable for OnLine Transaction Processing workloads where data are randomlyaccessed on structured data like a relational database.Hadoop is not suitable for OnLine Analytical Processing or Decision Support System workloads where data are sequentially accessed on structured data like a relational database, to generate reports that provide business intelligence.Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLineAnalytical Processing.

1

Page 2: Hadoop technology doc

HADOOP TECHNOLOGY

It is NOT a replacement for a relational database system.So, what is Big Data?With all the devices available today to collect data, such as RFID readers, microphones, cameras,sensors, and so on, we are seeing an explosion in data being collected worldwide.Big Data is a term used to describe large collections of data (also known as datasets) that may beunstructured, and grow so large and quickly that it is difficult to manage with regular database orstatistics tools.Other interesting statistics providing examples of this data explosion are:There are more than 2 billion internet users in the world today,and 4.6 billion mobile phones in 2011,and 7TB of data are processed by Twitter every day,and 10TB of data are processed by Facebook every day.Interestingly, approximately 80% of these data are unstructured.With this massive quantity of data, businesses need fast, reliable, deeper data insight.Therefore, Big Data solutions based on Hadoop and other analytics software are becoming moreand more relevant.This is a list of other open source projects related to Hadoop:Eclipse is a popular IDE donated by IBM to the open source community.Lucene is a text search engine library written in Java.Hbase is the Hadoop database.Hive provides data warehousing tools to extract, transform and load data, and query this datastored in Hadoop files.Pig is a platform for analyzing large data sets. It is a high level language for expressing dataanalysis.Jaql, or jackal, is a query language for JavaScript open notation.Zoo Keeper is a centralized configuration service and naming registry for large distributedsystems.

Avro is a data serialization system.UIMA is the architecture for the development, discovery, composition and deployment for theanalysis of unstructured data.Let’s now talk about examples of Hadoop in action.Early in 2011, Watson, a super computer developed by IBM competed in the popular Question andAnswer show “Jeopardy!”.Watson was successful in beating the two most popular players in that game.It was input approximately 200 million pages of text using Hadoop to distribute the workload forloading this information into memory.Once the information was loaded, Watson used other technologies for advanced search andanalysis.In the telecommunications industry we have China Mobile, a company that built a Hadoop clusterto perform data mining on Call Data Records.China Mobile was producing 5-8TB of these records daily. By using a Hadoop-based system theywere able to process 10 times as much data as when using their old system,and at one fifth of the cost.In the media we have the New York Times which wanted to host on their website all publicdomain articles from 1851 to 1922.They converted articles from 11 million image files to 1.5TB of PDF documents. This wasimplemented by one employee who ran a job in 24 hours on a 100-instance Amazon EC2 Hadoopclusterat a very low cost.In the technology field we again have IBM with IBM ES2, an enterprise search technology basedon Hadoop, Lucene and Jaql.ES2 is designed to address unique challenges of enterprise search such as the use of an enterprisespecificvocabulary, abbreviations and acronyms.ES2 can perform mining tasks to build acronym libraries, regular expression patterns, and geoclassificationrules.

2

Page 3: Hadoop technology doc

HADOOP TECHNOLOGY

There are also many internet or social network companies using Hadoop such as Yahoo,Facebook, Amazon, eBay, Twitter, StumbleUpon, Rackspace, Ning, AOL, and so on.Yahoo is, of course, the largest production user with an application running a Hadoop clusterconsisting of approximately 10,000 Linux machines.Yahoo is also the largest contributor to the Hadoop open source project.Now, Hadoop is not a magic bullet that solves all kinds of problems.Hadoop is not good to process transactions because it is random access.It is not good when the work cannot be parallelized.It is not good for low latency data access.Not good for processing lots of small files.And not good for intensive calculations with little data.Big Data solutions are more than just Hadoop. They can integrate analytic solutions to the mix toderive valuable information that can combine structured legacy data with new unstructured data.Big data solutions may also be used to derive information from data in motion.For example, IBM has a product called InfoSphere Streams that can be used to quickly determinecustomer sentiment for a new product based on Facebook or Twitter comments.Finally, let’s end this presentation with one final thought: Cloud computing has gained atremendous track in the past few years, and it is a perfect fit for Big Data solutions.Using the cloud, a Hadoop cluster can be setup in minutes, on demand, and it can run for as longas is needed without having to pay for more than what is used.

AWARENESS OF THE TOPOLOGY OF THE NETWORK

Hadoop has awareness of the topology of the network. This allows it to optimizewhere it sends the computations to be applied to the data. Placing the work as close

as possible to the data it operates on maximizes the bandwidth available for readingthe data. In the diagram, the data we wish to apply processing to is block B1, thelight blue rectangle on node n1 on rack 1.When deciding which TaskTracker should receive a MapTask that reads data fromB1, the best option is to choose the TaskTracker that runs on the same node as thedata.If we can't place the computation on the same node, our next best option is to placeit on a node in the same rack as the data.The worst case that Hadoop currently supports is when the computation must bedone from a node in a different rack than the data. When rack-awareness isconfigured for your cluster, Hadoop will always try to run the task on theTaskTracker node with the highest bandwidth access to the data.Let us walk through an example of how a file gets written to HDFS.First, the client submits a "create" request to the NameNode. The NameNode checksthat the file does not already exist and the client has permission to write the file.If that succeeds, the NameNode determines the DataNode to write the first block to. If the client is running on a DataNode, it will try to place it there. Otherwise, it chooses at random.By default, data is replicated to two other places in the cluster. A pipeline is built between the three DataNodes that make up the pipeline. The second DataNode is arandomly chosen node on a rack other than that of the first replica of the block. Thisis to increase redundancy.The final replica is placed on a random node within the same rack as the secondreplica. The data is piped from the second DataNode to the third.To ensure the write was successful before continuing, acknowledgment packets aresent back from the third DataNode to the second,From the second DataNode to the firstAnd from the first DataNode to the client

3

Page 4: Hadoop technology doc

HADOOP TECHNOLOGY

This process occurs for each of the blocks that make up the file, in this case, thesecondand the third block. Notice that, for every block, there is a replica on at least tworacks.When the client is done writing to the DataNode pipeline and has receivedacknowledgements, it tells the NameNode that it is complete. The NameNode willcheck that the blocks are at least minimally replicated before responding.

MAP REDUCE

We will look at "the shuffle" that connects the output of each mapper to the input of a reducer.This will take us into the fundamental datatypes used by Hadoop and see an exampledata flow. Finally, we will examine Hadoop MapReduce fault tolerance, scheduling,and task execution optimizations.To understand MapReduce, we need to break it into its component operations mapand reduce. Both of these operations come from functional programming languages.These are languages that let you pass functions as arguments to other functions.We'll start with an example using a traditional for loop. Say we want to double everyelement in an array. We would write code like that shown.The variable "a" enters the for loop as [1,2,3] and comes out as [2,4,6]. Each arrayelement is mapped to a new value that is double the old value.The body of the for loop, which does the doubling, can be written as a function.We now say a[i] is the result of applying the function fn to a[i]. We define fn as afunction that returns its argument multiplied by 2.This will allow us to generalize this code. Instead of only being able to use this codeto double numbers, we could use it for any kind of map operation.

We will call this function "map" and pass the function fn as an argument to map.We now have a general function named map and can pass our "multiply by 2"function as an argument.Writing the function definition in one statement is a common idiom in functionalprogramming languages.In summary, we can rewrite a for loop as a map operation taking a function as anargument. Other than saving two lines of code, why is it useful to rewrite our codethis way? Let's say that instead of looping over an array of three elements, we wantto process a dataset with billions of elements and take advantage of a thousandcomputers running in parallel to quickly process those billions of elements. If wedecided to add this parallelism to the original program, we would need to rewrite thewhole program. But if we wanted to parallelize the program written as a call to map,we wouldn't need to change our program at all. We would just use a parallelimplementation of map.Reduce is similar. Say you want to sum all the elements of an array. You could writea for loop that iterates over the array and adds each element to a single variablenamed sum. But we can we generalize this.The body of the for loop takes the current sum and the current element of the arrayand adds them to produce a new sum. Let's replace this with a function that does thesame thing.We can replace the body of the for loop with an assignment of the output of afunction fn to s. The fn function takes the sum s and the current array elementa[i] as its arguments. The implementation of fn is a function that returns the sum ofits two arguments.We can now rewrite the sum function so that the function fn is passed in as anargument.

4

Page 5: Hadoop technology doc

HADOOP TECHNOLOGY

This generalizes our sum function into a reduce function. We will also let the initialvalue for the sum variable be passed in as an argument.We can now call the function reduce whenever we need to combine the values of an array in some way, whether it is a sum, or a concatenation, or some other type of operation we wish to apply. Again, the advantage is that, should we wish to handle large amounts of data and parallelize this code, we do not need to change our program, we simply replace the implementation of the reduce function with a more sophisticated implementation. This is what Hadoop MapReduce is. It is aimplementation of map and reduce that is parallel, distributed, fault-tolerant and The process of running a MapReduce job on Hadoop consists of 8 major steps. Thefirst step is the MapReduce program you've written tells the JobClient to run a MapReduce job.This sends a message to the JobTracker which produces a unique ID for the job.The JobClient copies job resources, such as a jar file containing a Java code youhave written to implement the map or the reduce task, to the shared file system,usually HDFS.Once the resources are in HDFS, the JobClient can tell the JobTracker to start thejob.The JobTracker does its own initialization for the job. It calculates how to splitthe data so that it can send each "split" to a different mapper process to maximizethroughput. It retrieves these "input splits" from the distributed file system.The TaskTrackers are continually sending heartbeat messages to the JobTracker.Now that the JobTracker has work for them, it will return a map task or a reducetask as a response to the heartbeat.The TaskTrackers need to obtain the code to execute, so they get it from the sharedfile system.Then they can launch a Java Virtual Machine with a child process running in it and

this child process runs your map code or your reduce code.efficiently run map and reduce operations over large amounts of data.

MAPREDUCE -- SUBMITTING A JOB

The process of running a MapReduce job on Hadoop consists of 8 major steps. Thefirst step is the MapReduce program you've written tells the JobClient to run aMapReduce job.This sends a message to the JobTracker which produces a unique ID for the job.The JobClient copies job resources, such as a jar file containing a Java code youhave written to implement the map or the reduce task, to the shared file system,usually HDFS.Once the resources are in HDFS, the JobClient can tell the JobTracker to start thejob.The JobTracker does its own initialization for the job. It calculates how to splitthe data so that it can send each "split" to a different mapper process to maximizethroughput. It retrieves these "input splits" from the distributed file system.The TaskTrackers are continually sending heartbeat messages to the JobTracker.Now that the JobTracker has work for them, it will return a map task or a reducetask as a response to the heartbeat.The TaskTrackers need to obtain the code to execute, so they get it from the sharedfile system.Then they can launch a Java Virtual Machine with a child process running in it andthis child process runs your map code or your reduce code.

MAPREDUCE – MERGESORT/SHUFFLE

we have a job with a single map step and asingle reduce step. The first step is the map step. It takes a subset of the full data set

5

Page 6: Hadoop technology doc

HADOOP TECHNOLOGY

called an input split and applies to each row in the input split an operation you havewritten, such as the "multiply the value by two" operation we used in our earlier mapexample.There may be multiple map operations running in parallel with each other, each oneprocessing a different input split.The output data is buffered in memory and spills to disk. It is sorted and partitionedby key using the default partitioner. A merge sort sorts each partition.The partitions are shuffled amongst the reducers. For example, partition 1 goes toreducer 1. The second map task also sends its partition 1 to reducer 1. Partition 2goes to the other reducer.Each reducer does its own merge steps and executes the code of your reduce task.For example, it could do a sum like we used in the earlier reduce example.This produces sorted output at each reducer.

MAPREDUCE –FUNDAMENTAL DATA TYPES The data that flows into and out of the mappers and reducers takes a specific form.Data enters Hadoop in unstructured form but before it gets to the first mapper,Hadoop has changed it into key-value pairs with Hadoop supplying its own key.The mapper produces a list of key value pairs. Both the key and the value maychange from the k1 and v1 that came in to a k2 and v2. There can now be duplicatekeys coming out of the mappers. The shuffle step will take care of grouping themtogether.The output of the shuffle is the input to the reducer step. Now, we still have a list ofthe v2's that come out of the mapper step, but they are grouped by their keys andthere is no longer more than one record with the same key.

Finally, coming out of the reducer is, potentially, an entirely new key and value, k3and v3. For example, if your reducer summed the values associated with each k2,your k3 would be equal to k2 and your v3 would be the sum of the list of v2s.Let us look at an example of a simple data flow. Say we want to transform the inputon the left to the output on the right. On the left, we just have letters. On the right,we have counts of the number of occurrences of each letter in the input.Hadoop does the first step for us. It turns the input data into key-value pairs andsupplies its own key: an increasing sequence number.The function we write for the mapper needs to take these key-value pairs andproduce something that the reduce step can use to count occurrences. The simplestsolution is make each letter a key and make every value a 1.The shuffle groups records having the same key together, so we see B now has twovalues, both 1, associated with it.The reduce is simple: it just sums the values it is given to produce a sum for eachkey.

MAPREDUCE– FAULT TOLERANCE

The first kind of failure is a failure of the task, which could be due to a bug in thecode of your map task or reduce task.The JVM tells the TaskTracker and Hadoop counts this as a failed attempt and canstart up a new task.What if it hangs rather than fails? That is detected too and the JobTracker can runyour task again on a different machine in case it was a hardware problem.If it continues to fail on each new attempt, Hadoop will fail the job altogether. The next kind of failure is a failure of the TaskTracker itself.

6

Page 7: Hadoop technology doc

HADOOP TECHNOLOGY

The JobTracker will know because it is expecting a heartbeat. If it doesn't get a heartbeat, it removes that TaskTracker from the TaskTracker pool.Finally, what if the JobTracker fails?There is only one JobTracker. If it fails, your job is failed.

MAPREDUCE –SCHEDULING & TASK EXECUTION

So far we have looked at how Hadoop executes a single job as if it is the only job on the system. But it would be unfortunate if all of your valuable data could only be queried by one user at a time. Hadoop schedules jobs using one of three schedulers. The simplest is the default FIFO scheduler.It lets users submit jobs while other jobs are running, but queues these jobs so that only one of them is running at a time. The fair scheduler is more sophisticated.It lets multiple users compete over cluster resources and tries to give every user an equal share. It also supports guaranteed minimum capacities.The capacity scheduler takes a different approach.From each user's perspective, it appears that the they have the cluster to themselves with FIFO scheduling, but users are actually sharing the resources.Hadoop offers some configuration options for speeding up the execution of your map and reduce tasks under certain conditions.One such option is speculative execution. When a task takes a long time to run, Hadoop detects this and launches a second copy of your task on a different node. Because the tasks are designed to be selfcontained and independent, starting a second copy does not affect the final answer. Whichever copy of the task finishes first has its output go to the next phase. Theother task's redundant output is discarded.Another option for improving performance is to reuse the Java Virtual Machine.The default is to put each task in its own JVM for isolation purposes, but starting up a JVM can be

relatively expensive when jobs are short, so you have the option to reuse the same JVM from one task to the next.

SUMMARY

One thing is certain, by the time the sixth annual Hadoop Summit comes around next year, Big Data will be bigger. Business applications that are emerging now will be furthered as more enterprises incorporate big data analytics and HDP solutions into their architecture. New solutions in fields like Healthcare with disease detection and coordination of patient care will become more main stream. Crime detection and prevention will benefit as the industry further harnesses the new technology. Hadoop and Big Data promise not only to result in greatly enhanced marketing and product development. It also holds the power to drive positive global social impact around improved wellness outcomes and security, and many other areas. This, when you think about it, fits perfectly with the spirit of the Summit which calls for continued stewardship of the Hadoop Platform and promotion of associated technology by open-source and commercial entities.

REFERENCES

Google MapReduce

http://labs.google.com/papers/mapreduce.htm l

Hadoop Distributed File System

http://hadoop.apache.org/hdfs

7