Hadoop V.01

7/31/2019 Hadoop V.01

1/24

HADOOP/BIG DATA

7/31/2019 Hadoop V.01

2/24

About Big Data

Big data is a general term used to describe the voluminous amount of unstructured and semi-structured data a company creates -- data that would take too much time and cost too muchmoney to load into a relational Database for analysis. The term is often used when speakingabout petabytes and exabytes of data.

When dealing with larger datasets, organizations face difficulties in being able to create,manipulate, and manage Big Data. Big data is particularly a problem in businessanalytics because standard tools and procedures are not designed to search and analyze massivedatasets

A primary goal for looking at big data is to discover repeatable business patterns. Unstructureddata , most of it located in text files, accounts for at least 80% of an organizations data. If left

unmanaged, the sheer volume of unstructured data thats generated each year within an

enterprise can be costly in terms of storage . Unmanaged data can also pose a liability ifinformation cannot be located in the event of a compliance audit or lawsuit.

7/31/2019 Hadoop V.01

3/24

Big data spans three dimensions

Volume Big datacomes in one size: large.

Enterprises are awashwith data, easily

amassing terabytes andeven petabytes of

information.

Velocity Often time-sensitive, big data must

be used as it isstreaming in to the

enterprise in order tomaximize its value to the

business.

Variety Big dataextends beyond

structured data,including unstructured

data of all varieties: text,audio, video, click

streams, log files andmore

7/31/2019 Hadoop V.01

4/24

Customer challenges for securing Big Data

Customers are not actively talking about security concerns.

Customers need help understanding threats in a Big Dataenvironment

Awareness & Understanding arelacking

Main considerations: Synchronizing retention anddisposition policies across jurisdictions, moving data acrosscountries.

Customers need help navigating frameworks and changes

Companies policies & laws addcomplexity

Challenge: In most instances, data is random andinconsistent, not duplicated

Opportunity: There is a need for more intelligentidentification of data

DeDuplication

Challenge: Compression normally happens instead of de-duplication, yet, will compress duplicated data regardless

Opportunity: There is a need for an automated manner indoing both de-duplicating, and then compressing

Compression

Storage Efficiency challenges for Big Data

7/31/2019 Hadoop V.01

5/24

About Hadoop

5

Hadoop is open-source software that enables reliable, scalable, distributed computing on

clusters of inexpensive servers.

Solution for Big Data: Deals with complexities of high volume, Velocity & variety of data.

It enables applications to work with thousands of nodes and petabytes of data. It is:-

Reliable

: The software is fault

tolerant, it expects andhandles hardware and

software failures

Scalable

Designed for massive scale

of processors, memory, andlocal attached storage

Distributed

Handles replication. Offersmassively parallel

programming model,

MapReduce

7/31/2019 Hadoop V.01

6/24

About Apache Hadoop Software Library

The Apache Hadoop software library is a framework that allows for the distributedprocessing of large data sets across clusters of computers using a simpleprogramming model.

It is designed to scale up from single servers to thousands of machines, each offeringlocal computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at theapplication layer, so delivering a highly-availaible service on top of a cluster ofcomputers, each of which may be prone to failures

7/31/2019 Hadoop V.01

7/24

Market Drivers for Apache Hadoop

Business Drivers

High-value projects that require use of more data

Belief that there is great ROI in mastering big data

Financial Drivers

Growing cost of data systems as percentage of IT spendCostadvantage of commodity hardware + opensource

Enables departmental-level big data strategies

7/31/2019 Hadoop V.01

8/24

Trend

The OLD WAY

Operational systems keep only currentrecords, short history

Analytics systems keep onlyconformed/cleaned/digested dataUnstructured data locked away inoperational silos

Archives offline:-Inflexible, new questionsrequire system redesigns

The New Trend

Keep raw data in Hadoop for a longtime

Able to produce a new analytics view on-demand

Keep a new copy of data that was previously on insilos

Can directly do new reports, experiments at lowincremental cost

New products/services can be added very quickly

Agile outcome justifies new infrastructure

7/31/2019 Hadoop V.01

9/24

HDFS: Hadoop Distributed File System

HBase: Column oriented, non-relational, schema-less, distributed

database modeled after Googles BigTable. Promises Random,real-time read/write access to Big Data

Hive: Data warehouse system that provides SQL interface. Datastructure can be projected ad hoc onto unstructured underlying

data

Pig: A platform for manipulating and analyzing large data sets.High level language for analysts

ZooKeeper: a centralized service for maintaining configuration

information, naming, providing distributed synchronization, andproviding group services

Hadoop is a part of a larger framework of

related technologies

7/31/2019 Hadoop V.01

10/24

Organizations using Hadoop

7/31/2019 Hadoop V.01

11/24

Hadoop Developer

Core contributor since Hadoops infancy

Project Lead for Hadoop Distributed File System

Facebook (Hadoop, Hive, Scribe) Yahoo! (Hadoop in Yahoo Search)

Veritas (San Point Direct, Veritas File System)

IBM Transarc (Andrew File System)

UW Computer Science Alumni (Condor Project)

7/31/2019 Hadoop V.01

12/24

Why Hadoop Is needed?

Need to process MultiPetabyte Datasets

Expensive to build reliabilityin each application.

Failure is expected, rather than exceptional.

The number of nodes in a cluster is not constant.Nodes fail every day

Efficient, reliable, Open Source Apache LicenseNeed common infrastructure

The above goals are same asCondor, but

Workloads are IO bound andnot CPU bound

7/31/2019 Hadoop V.01

13/24

Hadoop is particularly useful when:-

Complex information processing is needed

Unstructured data needs to be turned into structured data

Queries cant be reasonably expressed using SQL

Heavily recursive algorithms

Complex but parallelizable algorithms needed, such as geo-spatial analysis or genome sequencing

Machine learning

Data sets are too large to fit into database RAM, discs, or require too many cores (10s of TB up toPB)

Data value does not justify expense of constant real-time availability, such as archives or specialinterest info, which can be moved to Hadoop and remain available at lower cost

Results are not needed in real time

Fault tolerance is critical

Significant custom coding would be required to handle job scheduling

7/31/2019 Hadoop V.01

14/24

Hadoop Is being used as a

Staging layer: Themost common use ofHadoop in enterprise

environments is asHadoop ETL preprocessing,filtering, and

transforming vastquantities of semi-

structured and

unstructured data forloading into a datawarehouse.

Event analytics layer:

large-scale logprocessing of eventdata: call records,

behavioral analysis,social network

analysis, clickstreamdata, etc.

Content analyticslayer: next-best action,

customer experienceoptimization, social

media analytics.MapReduce providesthe abstraction layer

for integrating contentanalytics with more

traditional forms ofadvanced analysis.

7/31/2019 Hadoop V.01

15/24

Karmasphere released the results of a survey of 102

Hadoop developers regarding adoption, use and future

plans

7/31/2019 Hadoop V.01

16/24

What Data Projects is Hadoop Driving?

7/31/2019 Hadoop V.01

17/24

7/31/2019 Hadoop V.01

18/24

Are Companies Adopting Hadoop?

More than one-half (54%) of organizations surveyed are using or considering Hadoop for large-scale data processing needs

More than twice as many Hadoop users report being able to create new products and servicesand enjoy costs savings beyond those using other platforms; over 82% benefit from fasteranalyses and better utilization of computing resources

87% of Hadoop users are performing or planning new types of analyses with large scale data

94% of Hadoop users perform analytics on large volumes of data not possible before; 88%analyze data in greater detail; while 82% can now retain more of their data

Organizations use Hadoop in particular to work with unstructured data such as logs and eventdata (63%)

More than two-thirds of Hadoop users perform advanced analysis data mining or algorithmdevelopment and testing

7/31/2019 Hadoop V.01

19/24

7/31/2019 Hadoop V.01

20/24

7/31/2019 Hadoop V.01

21/24

Hadoop At Linkedin:-

LinkedIn leverages Hadoop to transform raw datato rich features using knowledge aggregated from

LinkedIns 125 million member base. LinkedInthen uses Lucene to do real-timerecommendations, and also Lucene on Hadoop tobridge offline analysis with user-facing services.The streams of user-generated information,referred to as a social media feeds, may containvaluable, real-time information on the LinkedInmember opinions, activities, and mood states.

7/31/2019 Hadoop V.01

22/24

Hadoop At Forsquare

Forsquare were finding problems in handlinghuge amount of data which they are handling.Their Business development managers, venuespecialists, and upper management eggheadsneeded access to the data in order to inform

some important decisions.

To enable easy access to data foursquareengineering decided to use Apache Hadoopand Apache Hive in combination with a customdata server (built in Ruby), all runningin Amazon EC2. The data server is builtusing Rails, MongoDB, Redis, and Resque and

communicates with Hive using the

ruby Thrift client.

7/31/2019 Hadoop V.01

23/24

Hadoop @ Orbitz

Orbitz needed an infrastructure thatprovides:-

Long term storage of large data sets;

Open access for developers and

business analysts;

Ad-hoc quering of data

Rapid deploying of reporting

applications.

They moved to Hadoop and Hive to

provide reliable and scalable storageand processing of data on inexpensivecommodity hardware.

7/31/2019 Hadoop V.01

24/24

HDFS Architecture

7/30/2012 24

Namenode

Breplication

Rack1 Rack2

Client

Blocks

Datanodes Datanodes

Client

Write

Read

Metadata opsMetadata(Name, replicas..)

(/home/foo/data,6. ..

Block ops

Hadoop V.01

Documents

Transcript of Hadoop V.01