Case Study on Hadoop
-
Upload
sridhar-chandramohan-iyer -
Category
Documents
-
view
218 -
download
0
Transcript of Case Study on Hadoop
8/3/2019 Case Study on Hadoop
http://slidepdf.com/reader/full/case-study-on-hadoop 1/6
CASE STUDY ON HADOOP
What is Hadoop?
Apache Hadoop is a new way for enterprises to store and analyze data.
Hadoop is an open-source project administered by the Apache Software
Foundation. Hadoop’s contributors work for some of the world’s biggest
technology companies. That diverse, motivated community has produced a
genuinely innovative platform for consolidating, combining and
understanding large-scale data in order to better comprehend the data
deluge.
Enterprises today collect and generate more data than ever before.
Relational and data warehouse products excel at OLAP and OLTP workloads
over structured data. Hadoop, however, was designed to solve a different
problem: the fast, reliable analysis of both structured data and complex
data. As a result, many enterprises deploy Hadoop alongside their legacy IT
systems, which allows them to combine old data and new data sets in
powerful new ways.
Technically, Hadoop consists of two key services: reliable data storage using
the Hadoop Distributed File System (HDFS) and high-performance paralleldata processing using a technique called MapReduce.
Hadoop runs on a collection of commodity, shared-nothing servers. You can
add or remove servers in a Hadoop cluster at will; the system detects and
compensates for hardware or system problems on any server. Hadoop, in
other words, is self-healing. It can deliver data — and can run large-scale,
high-performance processing jobs — in spite of system changes or failures.
Originally developed and employed by dominant Web companies like Yahoo
and Facebook, Hadoop is now widely used in finance, technology, telecom,
media and entertainment, government, research institutions and other
markets with significant data. With Hadoop, enterprises can easily explore
complex data using custom analyses tailored to their information and
questions.
8/3/2019 Case Study on Hadoop
http://slidepdf.com/reader/full/case-study-on-hadoop 2/6
Cloudera is an active contributor to the Hadoop project and provides an
enterprise-ready, commercial Distribution for Hadoop.Cloudera’s
Distribution bundles the innovative work of a global open-source community;
this includes critical bug fixes and important new features from the public
development repository and applies all this to a stable version of the source
code.
In short,Cloudera integrates the most popular projects related to Hadoop
into a single package, which is run through a suite of rigorous tests to
ensure reliability during production.
Hadoop Overview
Apache Hadoop is a scalable, fault-tolerant system for data storage and
processing. Hadoop is economical and reliable, which makes it perfect to run
data-intensive applications on commodity hardware.
Hadoop excels at doing complex analyses, including detailed, special-
purpose computation, across large collections of data. Hadoop handles
search, log processing, recommendation systems, data warehousing and
video/image analysis. Unlike traditional databases, Hadoop scales to address
the needs of data-intensive distributed applications in a reliable, cost-
effective manner.
HDFS and MapReduce
Hadoop creates clusters of machines and coordinates work among them.
Clusters can be built and scaled out with inexpensive computers.
The Hadoop software package includes the robust, reliable Hadoop
Distributed File System (HDFS), which splits user data across servers in a
cluster. It uses replication to ensure that even multiple node failures will notcause data loss.
8/3/2019 Case Study on Hadoop
http://slidepdf.com/reader/full/case-study-on-hadoop 3/6
Fault-tolerant Hadoop Distributed File System (HDFS)
Provides reliable, scalable, low-cost storage.
HDFS breaks incoming files into blocks and stores them redundantly acrossthe cluster.
In addition, Hadoop includes MapReduce, a parallel distributed processing
system that is different from most similar systems on the market. It was
designed for clusters of commodity, shared-nothing hardware. No special
programming techniques are required to run analyses in parallel using
MapReduce; most existing algorithms work without changes. MapReduce
takes advantage of the distribution and replication of data in HDFS to spread
execution of any job across many nodes in a cluster.
8/3/2019 Case Study on Hadoop
http://slidepdf.com/reader/full/case-study-on-hadoop 4/6
MapReduce Software Framework
Offers clean abstraction between data analysis tasks and the underlying systems challenges involved in
ensuring reliable large-scale computation.
Processes large jobs in parallel across many nodes and combines results.
Eliminates the bottlenecks imposed by monolithic storage systems.
Results are collated and digested into a single output after each piece has
been analyzed.
If a machine fails, Hadoop continues to operate the cluster by shifting work
to the remaining machines. It automatically creates an additional copy of the
data from one of the replicas it manages. As a result, clusters are self-
healing for both storage and computation without requiring intervention by
systems administrators.
8/3/2019 Case Study on Hadoop
http://slidepdf.com/reader/full/case-study-on-hadoop 5/6
What can Hadoop do for you?
Apache Hadoop is an ideal platform for consolidating large-scale data
from a variety of new and legacy sources. It complements existing datamanagement solutions with new analyses and processing tools. It
delivers immediate value to companies in a variety of vertical markets.
Examples include:
E-tailing
Recommendation engines — increase average order size by recommendingcomplementary products based on predictive analysis for cross-selling.
Cross-channel analytics — sales attribution, average order value, lifetimevalue (e.g., how many in-store purchases resulted from a particular
recommendation, advertisement or promotion).
Event analytics — what series of steps (golden path) led to a desiredoutcome (e.g., purchase, registration).
Financial Services
Compliance and regulatory reporting.
Risk analysis and management.
Fraud detection and security analytics.
CRM and customer loyalty programs.
Credit scoring and analysis.
Trade surveillance.
Government
Fraud detection and cybersecurity.
Compliance and regulatory analysis.
Energy consumption and carbon footprint management.
Health & Life Sciences
Campaign and sales program optimization.
Brand management.
Patient care quality and program analysis.
Supply-chain management.
8/3/2019 Case Study on Hadoop
http://slidepdf.com/reader/full/case-study-on-hadoop 6/6
Drug discovery and development analysis.
Retail/CPG
Merchandizing and market basket analysis.
Campaign management and customer loyalty programs.
Supply-chain management and analytics. Event- and behavior-based targeting.
Market and consumer segmentations.
Telecommunications
Revenue assurance and price optimization.
Customer churn prevention.
Campaign management and customer loyalty.
Call Detail Record (CDR) analysis.
Network performance and optimization.
Web & Digital Media Services
Large-scale clickstream analytics.
Ad targeting, analysis, forecasting and optimization.
Abuse and click-fraud prevention.
Social graph analysis and profile segmentation.
Campaign management and loyalty programs.