Introduction to Hadoop

download Introduction to Hadoop

of 29

description

Regarding hadoop

Transcript of Introduction to Hadoop

Slide 1

Apache Hadoop- Large Scale Data Processing

Sharath Bandaru & Sai Dinesh KoppuravuriAdvanced Topics PresentationISYE 582 :Engineering Information Systems

Overview Understanding Big Data

Structured/Unstructured Data

Limitations Of Existing Data Analytics Structure

Apache Hadoop

Hadoop Architecture

HDFS

Map Reduce

Conclusions

References

Understanding Big DataBig DataIs creatingLarge And Growing FilesMeasured in:Petabytes (10^12)Terabytes (10^15)Which is largely unstructured

Structured/Unstructured Data

Why now ?

Data GrowthSTRUCTURED DATA 20%19802013UNSTRUCTURED DATA 80%Source : Cloudera, 2013

Challenges posed by Big DataVelocityVolumeVariety400 million tweets in a day on Twitter1 million transactions by Wal-Mart every hour2.5 peta bytes created by Wal-Mart transactions in an hourVideos, Photos, Text messages, Images, Audios, Documents, Emails, etc.,

Limitations Of Existing Data Analytics ArchitectureBI Reports + Interactive AppsRDBMS (aggregated data)ETL Compute GridStorage Only Grid ( original raw data )CollectionInstrumentationMoving Data To Compute Doesnt ScaleCant Explore Original High Fidelity Raw DataArchiving=Premature Data Death

So What is Apache

? A set of tools that supports running of applications on big data. Core Hadoop has two main systems:

- HDFS : self-healing high-bandwidth clustered storage.

- Map Reduce : distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction.

History

Source : Cloudera, 2013

The Key Benefit: Agility/FlexibilitySchema-on-Write (RDBMS):Schema-on-Read (Hadoop): Schema must be created before any data can be loaded.

An explicit load operation has to take place which transforms data to DB internal structure.

New columns must be added explicitly before new data for such columns can be loaded into the database Data is simply copied to the file store, no transformation is needed.

A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns (late binding).

New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it. Read is Fast Standards/Governance Load is Fast Flexibility/AgilityPros

Use The Right Tool For The Right JobRelational Databases:Hadoop:

Use when: Interactive OLAP Analytics (< 1 sec) Multistep ACID transactions 100 % SQL complianceUse when: Structured or Not (Flexibility) Scalability of Storage/Compute Complex Data Processing

Traditional ApproachBig DataPowerful ComputerProcessing limitEnterprise Approach:

Hadoop ArchitectureTask TrackerJob TrackerName NodeData NodeMasterTask TrackerData NodeTask TrackerData NodeTask TrackerData NodeSlavesMap ReduceHDFS

Hadoop ArchitectureTask TrackerJob TrackerName NodeData NodeMasterTask TrackerData NodeTask TrackerData NodeTask TrackerData NodeSlavesApplication

Job TrackerTask TrackerJob TrackerName NodeData NodeMasterTask TrackerData NodeTask TrackerData NodeTask TrackerData NodeSlavesApplication

Job TrackerTask TrackerJob TrackerName NodeData NodeMasterTask TrackerData NodeTask TrackerData NodeTask TrackerData NodeSlavesApplication

HDFS: Hadoop Distributed File System A given file is broken into blocks (default=64MB), then blocks are replicated across cluster(default=3).12345HDFS345125134245123

Optimized for : Throughput Put/Get/Delete AppendsBlock Replication for : Durability Availability ThroughputBlock Replicas are distributed across servers and racks

Fault Tolerance for DataTask TrackerJob TrackerName NodeData NodeMasterTask TrackerData NodeTask TrackerData NodeTask TrackerData NodeSlavesHDFS

Fault Tolerance for ProcessingTask TrackerJob TrackerName NodeData NodeMasterTask TrackerData NodeTask TrackerData NodeTask TrackerData NodeSlavesMap Reduce

Fault Tolerance for ProcessingTask TrackerJob TrackerName NodeData NodeMasterTask TrackerData NodeTask TrackerData NodeTask TrackerData NodeSlavesTables are backed up

Map ReduceInput DataMapMapMapMapMap

ShuffleReduceReduceResults

Understanding the concept of Map Reduce

MotherSamAn Apple Believed an apple a day keeps a doctor awayThe Story Of Sam

Understanding the concept of Map ReduceSam thought of drinking the apple

He used a to cut the and a to make juice.

Understanding the concept of Map ReduceNext day Sam applied his invention to all the fruits he could find in the fruit basket

(map ( ))

(reduce ( )) Classical Notion of Map Reduce in Functional ProgrammingA list of values mapped into another list of values, which gets reduced into a single value

Understanding the concept of Map Reduce18 Years Later Sam got his first job in Tropicana for his expertise in making juices.

Now, its not just one basket but a whole container of fruitsAlso, they produce a list of juice types separately

Fruits

NOT ENOUGH !!But, Sam had just ONE and ONE Large data and list of values for outputWait!

Understanding the concept of Map ReduceBrave Sam

Fruits

( , , , )

Each input to a map is a list of pairsEach output of a map is a list of pairs( , , , )

Grouped by keyEach input to a reduce is a (possibly a list of these, depending on the grouping/hashing mechanism)e.g.

Reduced into a list of values

Implemented parallel version of his innovation

Understanding the concept of Map ReduceSam realized,To create his favorite mix fruit juice he can use a combiner after the reducersIf several fall into the same group (based on the grouping/hashing algorithm) then use the blender (reducer) separately on each of themThe knife (mapper) and blender (reducer) should not contain residue after use Side Effect FreeSource: (Map Reduce, 2010).

Conclusions The key benefits of Apache Hadoop:

1) Agility/ Flexibility (Quickest Time to Insight)

2) Complex Data Processing (Any Language, Any Problem)

3) Scalability of Storage/Compute (Freedom to Grow)

4) Economical Storage (Keep All Your Data Alive Forever)

The key systems for Apache Hadoop are:

1) Hadoop Distributed File System : self-healing high-bandwidth clustered storage.

2) Map Reduce : distributed fault-tolerant resource management coupled with scalable data processing.

References Ekanayake, S. (2010, March). Map Reduce : The Story Of Sam. Retrieved April 13, 2013, from http://esaliya.blogspot.com/2010/03/mapreduce-explained-simply-as-story- of.html.

Jeffrey Dean and Sanjay Ghemawat. (2004, December). Map Reduce : Simplified Data Processing on Large Clusters.

The Apache Software Foundation. (2013, April). Hadoop. Retrieved April 19, 2013, from http://hadoop.apache.org/.

Isabel Drost. (2010, February). Apache Hadoop : Large Scale Data Analysis made Easy. retrieved April 13, 2013, from http://www.youtube.com/watch?v=VFHqquABHB8.

Dr. Amr Awadallah. (2011, November). Introducing Apache Hadoop : The Modern Data Operating System. Retrieved April 15, 2013, from http://www.youtube.com/watch?v=d2xeNpfzsYI