Hadoop V.01
-
Upload
eshan-bhateja -
Category
Documents
-
view
226 -
download
0
Transcript of Hadoop V.01
-
7/31/2019 Hadoop V.01
1/24
HADOOP/BIG DATA
-
7/31/2019 Hadoop V.01
2/24
About Big Data
Big data is a general term used to describe the voluminous amount of unstructured and semi-structured data a company creates -- data that would take too much time and cost too muchmoney to load into a relational Database for analysis. The term is often used when speakingabout petabytes and exabytes of data.
When dealing with larger datasets, organizations face difficulties in being able to create,manipulate, and manage Big Data. Big data is particularly a problem in businessanalytics because standard tools and procedures are not designed to search and analyze massivedatasets
A primary goal for looking at big data is to discover repeatable business patterns. Unstructureddata , most of it located in text files, accounts for at least 80% of an organizations data. If left
unmanaged, the sheer volume of unstructured data thats generated each year within an
enterprise can be costly in terms of storage . Unmanaged data can also pose a liability ifinformation cannot be located in the event of a compliance audit or lawsuit.
-
7/31/2019 Hadoop V.01
3/24
Big data spans three dimensions
Volume Big datacomes in one size: large.
Enterprises are awashwith data, easily
amassing terabytes andeven petabytes of
information.
Velocity Often time-sensitive, big data must
be used as it isstreaming in to the
enterprise in order tomaximize its value to the
business.
Variety Big dataextends beyond
structured data,including unstructured
data of all varieties: text,audio, video, click
streams, log files andmore
-
7/31/2019 Hadoop V.01
4/24
Customer challenges for securing Big Data
Customers are not actively talking about security concerns.
Customers need help understanding threats in a Big Dataenvironment
Awareness & Understanding arelacking
Main considerations: Synchronizing retention anddisposition policies across jurisdictions, moving data acrosscountries.
Customers need help navigating frameworks and changes
Companies policies & laws addcomplexity
Challenge: In most instances, data is random andinconsistent, not duplicated
Opportunity: There is a need for more intelligentidentification of data
DeDuplication
Challenge: Compression normally happens instead of de-duplication, yet, will compress duplicated data regardless
Opportunity: There is a need for an automated manner indoing both de-duplicating, and then compressing
Compression
Storage Efficiency challenges for Big Data
-
7/31/2019 Hadoop V.01
5/24
About Hadoop
5
Hadoop is open-source software that enables reliable, scalable, distributed computing on
clusters of inexpensive servers.
Solution for Big Data: Deals with complexities of high volume, Velocity & variety of data.
It enables applications to work with thousands of nodes and petabytes of data. It is:-
Reliable
: The software is fault
tolerant, it expects andhandles hardware and
software failures
Scalable
Designed for massive scale
of processors, memory, andlocal attached storage
Distributed
Handles replication. Offersmassively parallel
programming model,
MapReduce
-
7/31/2019 Hadoop V.01
6/24
About Apache Hadoop Software Library
The Apache Hadoop software library is a framework that allows for the distributedprocessing of large data sets across clusters of computers using a simpleprogramming model.
It is designed to scale up from single servers to thousands of machines, each offeringlocal computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at theapplication layer, so delivering a highly-availaible service on top of a cluster ofcomputers, each of which may be prone to failures
-
7/31/2019 Hadoop V.01
7/24
Market Drivers for Apache Hadoop
Business Drivers
High-value projects that require use of more data
Belief that there is great ROI in mastering big data
Financial Drivers
Growing cost of data systems as percentage of IT spendCostadvantage of commodity hardware + opensource
Enables departmental-level big data strategies
-
7/31/2019 Hadoop V.01
8/24
Trend
The OLD WAY
Operational systems keep only currentrecords, short history
Analytics systems keep onlyconformed/cleaned/digested dataUnstructured data locked away inoperational silos
Archives offline:-Inflexible, new questionsrequire system redesigns
The New Trend
Keep raw data in Hadoop for a longtime
Able to produce a new analytics view on-demand
Keep a new copy of data that was previously on insilos
Can directly do new reports, experiments at lowincremental cost
New products/services can be added very quickly
Agile outcome justifies new infrastructure
-
7/31/2019 Hadoop V.01
9/24
HDFS: Hadoop Distributed File System
HBase: Column oriented, non-relational, schema-less, distributed
database modeled after Googles BigTable. Promises Random,real-time read/write access to Big Data
Hive: Data warehouse system that provides SQL interface. Datastructure can be projected ad hoc onto unstructured underlying
data
Pig: A platform for manipulating and analyzing large data sets.High level language for analysts
ZooKeeper: a centralized service for maintaining configuration
information, naming, providing distributed synchronization, andproviding group services
Hadoop is a part of a larger framework of
related technologies
-
7/31/2019 Hadoop V.01
10/24
Organizations using Hadoop
-
7/31/2019 Hadoop V.01
11/24
Hadoop Developer
Core contributor since Hadoops infancy
Project Lead for Hadoop Distributed File System
Facebook (Hadoop, Hive, Scribe) Yahoo! (Hadoop in Yahoo Search)
Veritas (San Point Direct, Veritas File System)
IBM Transarc (Andrew File System)
UW Computer Science Alumni (Condor Project)
-
7/31/2019 Hadoop V.01
12/24
Why Hadoop Is needed?
Need to process MultiPetabyte Datasets
Expensive to build reliabilityin each application.
Failure is expected, rather than exceptional.
The number of nodes in a cluster is not constant.Nodes fail every day
Efficient, reliable, Open Source Apache LicenseNeed common infrastructure
The above goals are same asCondor, but
Workloads are IO bound andnot CPU bound
-
7/31/2019 Hadoop V.01
13/24
Hadoop is particularly useful when:-
Complex information processing is needed
Unstructured data needs to be turned into structured data
Queries cant be reasonably expressed using SQL
Heavily recursive algorithms
Complex but parallelizable algorithms needed, such as geo-spatial analysis or genome sequencing
Machine learning
Data sets are too large to fit into database RAM, discs, or require too many cores (10s of TB up toPB)
Data value does not justify expense of constant real-time availability, such as archives or specialinterest info, which can be moved to Hadoop and remain available at lower cost
Results are not needed in real time
Fault tolerance is critical
Significant custom coding would be required to handle job scheduling
-
7/31/2019 Hadoop V.01
14/24
Hadoop Is being used as a
Staging layer: Themost common use ofHadoop in enterprise
environments is asHadoop ETL preprocessing,filtering, and
transforming vastquantities of semi-
structured and
unstructured data forloading into a datawarehouse.
Event analytics layer:
large-scale logprocessing of eventdata: call records,
behavioral analysis,social network
analysis, clickstreamdata, etc.
Content analyticslayer: next-best action,
customer experienceoptimization, social
media analytics.MapReduce providesthe abstraction layer
for integrating contentanalytics with more
traditional forms ofadvanced analysis.
-
7/31/2019 Hadoop V.01
15/24
Karmasphere released the results of a survey of 102
Hadoop developers regarding adoption, use and future
plans
-
7/31/2019 Hadoop V.01
16/24
What Data Projects is Hadoop Driving?
-
7/31/2019 Hadoop V.01
17/24
-
7/31/2019 Hadoop V.01
18/24
Are Companies Adopting Hadoop?
More than one-half (54%) of organizations surveyed are using or considering Hadoop for large-scale data processing needs
More than twice as many Hadoop users report being able to create new products and servicesand enjoy costs savings beyond those using other platforms; over 82% benefit from fasteranalyses and better utilization of computing resources
87% of Hadoop users are performing or planning new types of analyses with large scale data
94% of Hadoop users perform analytics on large volumes of data not possible before; 88%analyze data in greater detail; while 82% can now retain more of their data
Organizations use Hadoop in particular to work with unstructured data such as logs and eventdata (63%)
More than two-thirds of Hadoop users perform advanced analysis data mining or algorithmdevelopment and testing
-
7/31/2019 Hadoop V.01
19/24
-
7/31/2019 Hadoop V.01
20/24
-
7/31/2019 Hadoop V.01
21/24
Hadoop At Linkedin:-
LinkedIn leverages Hadoop to transform raw datato rich features using knowledge aggregated from
LinkedIns 125 million member base. LinkedInthen uses Lucene to do real-timerecommendations, and also Lucene on Hadoop tobridge offline analysis with user-facing services.The streams of user-generated information,referred to as a social media feeds, may containvaluable, real-time information on the LinkedInmember opinions, activities, and mood states.
-
7/31/2019 Hadoop V.01
22/24
Hadoop At Forsquare
Forsquare were finding problems in handlinghuge amount of data which they are handling.Their Business development managers, venuespecialists, and upper management eggheadsneeded access to the data in order to inform
some important decisions.
To enable easy access to data foursquareengineering decided to use Apache Hadoopand Apache Hive in combination with a customdata server (built in Ruby), all runningin Amazon EC2. The data server is builtusing Rails, MongoDB, Redis, and Resque and
communicates with Hive using the
ruby Thrift client.
-
7/31/2019 Hadoop V.01
23/24
Hadoop @ Orbitz
Orbitz needed an infrastructure thatprovides:-
Long term storage of large data sets;
Open access for developers and
business analysts;
Ad-hoc quering of data
Rapid deploying of reporting
applications.
They moved to Hadoop and Hive to
provide reliable and scalable storageand processing of data on inexpensivecommodity hardware.
-
7/31/2019 Hadoop V.01
24/24
HDFS Architecture
7/30/2012 24
Namenode
Breplication
Rack1 Rack2
Client
Blocks
Datanodes Datanodes
Client
Write
Read
Metadata opsMetadata(Name, replicas..)
(/home/foo/data,6. ..
Block ops