Big data and hadoop

BIG DATA AND HADOOP

A Presentation on

Presented By-Mohit Tare

UNDERSTANDING BIG DATA –

What ?How ?Why ?

Source-http://www.intel.com/content/www/us/en/communications/internet-minuteinfographic.html

Big Data Is Everywhere

•The Large Hadron Collider (LHC), a particle accelerator that will revolutionize our understanding of the workings of the Universe, will generate 60 terabytes of data per day – 15 petabytes (15 million gigabytes) annually.[1]•Decoding the human genome originally took 10 years to process; now it can be achieved in one week.

•12 terabytes of Tweets created each day[2]•100 terabytes of data uploaded daily to Facebook .[3]

•Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data.[3]•Convert 350 billion annual meter readings to better predict power consumption[2].

What Is Big Data?

Its LARGE Its COMPLEX Its UNSTRUCTURED

By David Kellog, “Big data refers to the datasets whose size is beyond the ability of a typical database software tools to capture ,store, manage and analyze.”[4]

O’Reilly defines big data the following way: “Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn't fit the strictures of your database architectures.” [5]

An Obvious Question – How BIG is the BIG DATA ?

A common misconception is Big data is solely related to VOLUME.While volume or size is a part of the equation…..

What about SPEED at which data is generated ?

And about the VARIETY of big data that variety of sources are generating?

You guessed it Right! The 3 Vs of Big data

[6]

Why The Sudden Explosion Of Big Data ?

•An Increased number and variety of data sources that generate large quantities of data• Sensors(location, GPS..)• Scientific Computing(CERN, biological research..)• Web 2.0(Twitter, wikis ..)

•Realization that data is too valuable to delete• Data analytics and Data Warehousing• Business Intelligence

•Dramatic Decline in the cost of hardware, especially storage• decline in price of SSDs

BIG DATA is fuelled by CLOUD

•The properties of cloud help us in dealing with the Big data•And the challenges of the Big data drives the Future designs , enhancement and expansion of cloud.•Both are in a Never Ending cycle.

The Value Of Big Data – Why Its So Important?

[6]

MANAGING BIG DATA

Traditional Enterprise Architecture VS Cluster Architecture

Hadoop – Managing Big data

TRADITIONAL ENTERPRISE ARCHITECTURE

Consists of•Servers

•SAN (Storage Area Network)

•Storage arrays

•Servers -a server is a physical computer dedicated to running one or more services to serve the needs of the users of other computers on the network.

•Storage Arrays-A disk array is a disk storage system which contains multiple disk drives(SATA,SSD).

•Storage Area Network - A storage area network (SAN) is a dedicated network that provides access to consolidated, data storage. SANs are primarily used to make storage devices, such as disk arrays, accessible to servers so that the devices appear like locally attached devices to the operating system.

SOME ADVANTAGES AND DISADVANTAGES OF ENTERPRISE ARCHITECTURE

ADVANTAGES•Coupling between Servers and Storage / Disk arrays – Which can be expanded, upgraded or retire independent of each other

•SAN enables services on any of server to have access of any of storage arrays as long as they have access permission.

•ROBUST and MINIMUM FAILURE rate.

•Mainly designed for computing intensive applications which operate on a subset of data.

DISADVANTAGES•More Costlier as it expands.

•But What about BIG DATA ?It cannot handle Data intensive operation like sorting.

What we want is an Architecture that will give -

CLUSTER ARCHITECTURE

Consists of•Nodes – each having its own cores , memory ,disks .

•Interconnection via high speed network(LAN)

• consists of a set of loosely connected computers that work together so that in many respects they can be viewed as a single system.

•usually connected to each other through fast local area networks, each node (computer used as a server) running its own instance of an operating system.

•The activities of the computing nodes are orchestrated by "clustering middleware", a software layer that sits atop the nodes and allows the users to treat the cluster as by and large one cohesive computing unit.

Benefits of Using a Cluster Architecture

•Modular and Scalable - easier to expand the system without bringing down the application that runs on top of the cluster.

•Data Locality – where data can be processed by the cores collocated in same node or Rack minimizing any transfer over network.

•Parallelization - higher degree of parallelism via the simultaneous execution of separate portions of a program on different processors.

•All this with less cost .

But Every Coin has two Sides!

•Complexity - Cost of administering a cluster of N machines .•More Storage – As data is replicated to protect from failure.•Data Distribution – How to distribute data evenly across cluster ?•Careful Management and Need of massive parallel processing Design.

Riding the Elephant - Hadoop

SOLUTION •Open Source Apache Project initiated and led by Yahoo.

•Apache Hadoop is an open source Java framework for processing and querying vast amounts of data on large clusters of commodity hardware.[8][9]

•Runs on o Linux, Mac OS/X, Windows, and Solariso Commodity hardware

•Target cluster of commodity PCso Cost-effective bulk computing

•Invented by Doug Cutting and funded by Yahoo in 2006 and reached to its “web scale capacity” in 2008.[7]

Doug Cutting

Where Does it All come from ?

• underlying technology was invented by Google back in their earlier days so they could usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users.

•Based on Google’s Map Reduce and Google File System.

What hadoop is ?

Hadoop Consists of two core components [9]–

1.Hadoop Distributed File System (HDFS)

2.Hadoop Distributed Processing Framework

– Using Map/Reduce metaphor

Hadoop Distributed File System(HDFS)

Based on Simple design principles –• To Split• To Scatter• To Replicate• To Manage data across cluster

•Files are broken in to large file blocks which is usually a multiple of storage blocks.

Typically 64 MB or higher

Hadoop Distributed File System(HDFS) contd..

•File blocks are Replicated to several datanodes, for reliability.• Default is 3 replicas, but settable

•Blocks are placed (writes are pipelined):• On same node• On same rack• On the other rack

•Clients read from closest replica.

•If the replication for a block drops below target, it is automatically re-replicated.


•Single namespace for entire cluster managed by a single Name node[7]

•Namenode, a master server that manages the file system namespace and regulates access to files by clients.

•DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.

•When a datanode fails , Namenode• identifies file blocks that have

been affected• retrieves copy from other healthy

nodes • finds new node to store another

copy of them.• Updates information in its tables.


•Client talks to both namenode and datanodes• Data is not sent through the

namenode.• First namenode is connected and then

user can directly connect to data node

HDFS Architecture[10]

•ADVANTAGES•Highly fault-tolerant•High throughput•Suitable for applications with large data sets•Streaming access to file system data•Can be built out of commodity hardware


•2 POINT OF FAILURES•Namenode can become a single point of failure•Cluster rebalancing

•SOLUTIONS•Enterprise Editions maintain Backup of namenode.•Architecture is compatible with data rebalancing schemes , but its still an area of research.

Hadoop Map/Reduce

•Map/Reduce is a programming modelfor efficient distributed computing

•User submits MapReduce job•System:• Partitions job into lots of

tasks• Schedules tasks on

nodes close to data• Monitors tasks• Kills and restarts if they

fail/hang/disappear[11]

Consists of two phases1.Mapper Phase2.Reduce Phase

Hadoop Map/Reduce contd …

1.Mapper Phase•The data are fed into the map function as key value pairs to produce intermediate key/value pairs.• Input: key1,value1 pair• Output: key2, value2 pairs

•All nodes will do same computation

•Uses Data Locality to increase performance.

•As all data blocks stored in HDFS are of equal size mapper computation can be equally divided.

Hadoop Map/Reduce contd …

Reduce Phase

•Once the mapping is done, all the intermediate results from various nodes are reduced to create the final output.

•Has 3 Phases• shuffle, • sort and • reduce.[12]

•Shuffle - Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers.

•Sort - The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.

•Reduce - In this phase the reduce method is called for each <key, (list of values)> pair in the grouped inputs and will produce final outputs.

Understood or not ? Lets understand it by an Example

• Suppose you want to analyze blog entries stored in BigData.txt and count no of times Hadoop , Big Data, Green Plum words appear in it.

•Suppose 3 nodes participate in task . In Mapper Phase , each node will receive an address of file block and pointer to mapper function.•Mapper Function will calculate word –count.

[13]

Lets understand it by an Example

•Output of mapper function will be set of <key,value >pairs.

FINAL OUTPUT OF MAPPER PHASE


•The Reduce Phase sums and reduces output . •A node is selected to perform reduce function and other nodes send their output to that node.

•After Shuffling of Reduce Phase


•After sorting phase of Reduce Phase

And FINALLY

•JobTracker keeps track of all the MapReduces jobs that are running on various nodes. •This schedules the jobs, keeps track of all the map and reduce jobs running across the nodes. •If any one of those jobs fails, it reallocates the job to another node, etc.

•TaskTracker performs the map and reduce tasks that are assigned by the JobTracker. •TaskTracker also constantly sends a hearbeat message to JobTracker, which helps JobTracker to decide whether to delegate a new task to this particular node or not.

A bit more on Map/Reduce

Accessibilty and Implementation

•HDFS•HDFS provides Java API for application to use.•Python access is also used in many applications.•It provides a command line interface called the FS shell that lets the user interact with data in the HDFS.•The syntax of the commands is similar to bash.

Example: to create a directory Usage: hadoop dfs -mkdir <paths> hadoop dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2

•Map/Reduce•Java API which has prebuilt classes and Interfaces.•Python , C++ can also be used.

C++ example on Word Count[14]

And there is more and more …

PIG

Who uses Hadoop ?

References

[1] Randal E. Bryant , Randy H. Katz , Edward D. Lazowska, “Big-Data Computing: Creating revolutionary breakthroughs in commerce, science, and society” ,Version 8: December 22, 2008. Available: http://www.cra.org/ccc/docs/init/Big_Data.pdf [Accessed Sept.9,2012]

[2]What is Big Data ?[Online]. Available : http://www-01.ibm.com/software/data/bigdata/ [Accessed Sept.9,2012]

[3] A Comprehensive List of Big Data Statistics [Online].Available :http://wikibon.org/blog/big-data-statistics/ [Accessed Sept.9,2012]

[4] James Manyika, Michael Chui ,Brad Brown, Jacques Bughin, Richard Dobbs ,Charles Roxburgh , Angela Hung Byers Big Data: The next frontier for innovation , competition ,and productivity , McKinskey Global Institute, May 2011.Availabe:http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovation[Accessed Sept.10,2012]

[5]What Is Big Data? ,O’Reilly Radar, January 11, 2012,[Online].Available : http://radar.oreilly.com/2012/01/what-is-big-data.html[Accessed Sept.10,2102]

[6]-Big Data, Wipro,[Online].Available: http://www.slideshare.net/wiprotechnologies/wipro-infographicbig-data[Accessed Sept.11,2012]

References

[7]Owan o maley ,”Introduction to Hadoop”[Online].Available : http://wiki.apache.org/hadoop/HadoopPresentations[Accessed Sept .17,2012 ]

[8]Hadoop at Yahoo!, Yahoo developer Network[Online].Available: http://developer.yahoo.com/hadoop/ [Accessed Sept .17,2012 ]

[9] Elif Dede, Madhusudhan Govindaraju, Dan Gunter, LavanyaRamakrishnan,“Ridingthe elephant: managing ensembles with hadoop”,in MTAGS '11 Proceedings of the 2011 ACM international workshop on Many taskcomputing on grids and supercomputers, Pages 49-58[Online].Available : ACM Digital Library,http://dl.acm.org/citation.cfm?id=2132876.2132888 [Accessed Sept .17,2012 ]

[10] HDFS Architecture, Hadoop 0.20 Documentation[Online].Available: http://hadoop.apache.org/docs/r0.20.2/hdfs_design.html[Accessed

Sep.20,2012]

References

[11]Doug Cutting ,”Hadoop Overview” ,[Online] Available:http://wiki.apache.org/hadoop/HadoopPresentations [Accessed Sept .17,2012 ]

[12] Map/Reduce Tutorial, Hadoop 0.20 Documentation,[Online].Available :http://hadoop.apache.org/docs/r0.20.2/mapred_tutorial.html#Reducer[Accessed Sept .17,2012 ]

[13] Patricia Florissi, Big Ideas : Demystifying Hadoop, [Video].Available : http://www.youtube.com/watch?v=XtLXPLb6EXs&feature=relmfu

[14] C/C++ MapReduce Code & build, Hadoop Wiki , C++ word Count, [Online].Available :http://wiki.apache.org/hadoop/C%2B%2BWordCount[Accessed October .1,2012]

Thank You !

And …

Stay Udacious ?

Big data and hadoop

Technology

Transcript of Big data and hadoop