Case Study on Hadoop

6
CASE STUDY ON HADOOP What is Hadoop? Apache Hadoop is a new way for enterprises to store and analyze data. Hadoop is an open-source project administered by the Apache Software Foundation. Hadoop’s contributors work for some of the world’s biggest technology companies. That diverse, motivated community has produced a genuinely innovative platform for consolidating, combining and understanding large-scale data in order to better comprehend the data deluge. Enterprises today collect and generate more data than ever before. Relational and data warehouse products excel at OLAP and OLTP workloads over structured data. Hadoop, however, was designed to solve a different problem: the fast, reliable analysis of both structured data and  complex data. As a result, ma ny enterprises deploy Hadoop alongside their legacy IT systems, which allows them to combine old data and new data sets in powerful new ways. Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce. Hadoop runs on a collection of commodity, shared-nothing servers. You can add or remove servers in a Hadoop cluster at will; the system detects and compensates for hardware or system problems on any server. Hadoop, in other words, is self-healing. It can deliver data and can run large-scale, high-performance processing jobs in spite of system changes or failures. Originally developed and employed by dominant Web companies like Yahoo and Facebook, Hadoop is now widely used in finance, technology, telecom, media and entertainment, government, research institutions and other markets with significant data. With Hadoop, enterprises can easily explore complex data using custom analyses tail ored to their information and questions.

Transcript of Case Study on Hadoop

Page 1: Case Study on Hadoop

8/3/2019 Case Study on Hadoop

http://slidepdf.com/reader/full/case-study-on-hadoop 1/6

CASE STUDY ON HADOOP

What is Hadoop?

Apache Hadoop is a new way for enterprises to store and analyze data.

Hadoop is an open-source project administered by the Apache Software

Foundation. Hadoop’s contributors work for some of the world’s biggest

technology companies. That diverse, motivated community has produced a

genuinely innovative platform for consolidating, combining and

understanding large-scale data in order to better comprehend the data

deluge.

Enterprises today collect and generate more data than ever before.

Relational and data warehouse products excel at OLAP and OLTP workloads

over structured data. Hadoop, however, was designed to solve a different

problem: the fast, reliable analysis of both structured data and  complex

data. As a result, many enterprises deploy Hadoop alongside their legacy IT

systems, which allows them to combine old data and new data sets in

powerful new ways.

Technically, Hadoop consists of two key services: reliable data storage using

the Hadoop Distributed File System (HDFS) and high-performance paralleldata processing using a technique called MapReduce.

Hadoop runs on a collection of commodity, shared-nothing servers. You can

add or remove servers in a Hadoop cluster at will; the system detects and

compensates for hardware or system problems on any server. Hadoop, in

other words, is self-healing. It can deliver data — and can run large-scale,

high-performance processing jobs — in spite of system changes or failures.

Originally developed and employed by dominant Web companies like Yahoo

and Facebook, Hadoop is now widely used in finance, technology, telecom,

media and entertainment, government, research institutions and other

markets with significant data. With Hadoop, enterprises can easily explore

complex data using custom analyses tailored to their information and

questions.

Page 2: Case Study on Hadoop

8/3/2019 Case Study on Hadoop

http://slidepdf.com/reader/full/case-study-on-hadoop 2/6

Cloudera is an active contributor to the Hadoop project and provides an

enterprise-ready, commercial Distribution for Hadoop.Cloudera’s

Distribution bundles the innovative work of a global open-source community;

this includes critical bug fixes and important new features from the public

development repository and applies all this to a stable version of the source

code.

In short,Cloudera integrates the most popular projects related to Hadoop

into a single package, which is run through a suite of rigorous tests to

ensure reliability during production.

Hadoop Overview

Apache Hadoop is a scalable, fault-tolerant system for data storage and

processing. Hadoop is economical and reliable, which makes it perfect to run

data-intensive applications on commodity hardware.

Hadoop excels at doing complex analyses, including detailed, special-

purpose computation, across large collections of data. Hadoop handles

search, log processing, recommendation systems, data warehousing and

video/image analysis. Unlike traditional databases, Hadoop scales to address

the needs of data-intensive distributed applications in a reliable, cost-

effective manner.

HDFS and MapReduce

Hadoop creates clusters of machines and coordinates work among them.

Clusters can be built and scaled out with inexpensive computers.

The Hadoop software package includes the robust, reliable Hadoop

Distributed File System (HDFS), which splits user data across servers in a

cluster. It uses replication to ensure that even multiple node failures will notcause data loss.

Page 3: Case Study on Hadoop

8/3/2019 Case Study on Hadoop

http://slidepdf.com/reader/full/case-study-on-hadoop 3/6

 

Fault-tolerant Hadoop Distributed File System (HDFS)

Provides reliable, scalable, low-cost storage.

HDFS breaks incoming files into blocks and stores them redundantly acrossthe cluster.

In addition, Hadoop includes MapReduce, a parallel distributed processing

system that is different from most similar systems on the market. It was

designed for clusters of commodity, shared-nothing hardware. No special

programming techniques are required to run analyses in parallel using

MapReduce; most existing algorithms work without changes. MapReduce

takes advantage of the distribution and replication of data in HDFS to spread

execution of any job across many nodes in a cluster.

Page 4: Case Study on Hadoop

8/3/2019 Case Study on Hadoop

http://slidepdf.com/reader/full/case-study-on-hadoop 4/6

MapReduce Software Framework

Offers clean abstraction between data analysis tasks and the underlying systems challenges involved in

ensuring reliable large-scale computation.

  Processes large jobs in parallel across many nodes and combines results.

  Eliminates the bottlenecks imposed by monolithic storage systems.

  Results are collated and digested into a single output after each piece has

been analyzed.

If a machine fails, Hadoop continues to operate the cluster by shifting work

to the remaining machines. It automatically creates an additional copy of the

data from one of the replicas it manages. As a result, clusters are self-

healing for both storage and computation without requiring intervention by

systems administrators.

Page 5: Case Study on Hadoop

8/3/2019 Case Study on Hadoop

http://slidepdf.com/reader/full/case-study-on-hadoop 5/6

What can Hadoop do for you?

Apache Hadoop is an ideal platform for consolidating large-scale data

from a variety of new and legacy sources. It complements existing datamanagement solutions with new analyses and processing tools. It

delivers immediate value to companies in a variety of vertical markets.

Examples include:

E-tailing

  Recommendation engines — increase average order size by recommendingcomplementary products based on predictive analysis for cross-selling.

  Cross-channel analytics — sales attribution, average order value, lifetimevalue (e.g., how many in-store purchases resulted from a particular

recommendation, advertisement or promotion).

  Event analytics — what series of steps (golden path) led to a desiredoutcome (e.g., purchase, registration).

Financial Services 

  Compliance and regulatory reporting.

  Risk analysis and management.

 Fraud detection and security analytics.

  CRM and customer loyalty programs.

  Credit scoring and analysis.

  Trade surveillance.

Government 

  Fraud detection and cybersecurity.

  Compliance and regulatory analysis.

  Energy consumption and carbon footprint management.

Health & Life Sciences 

  Campaign and sales program optimization.

  Brand management.

  Patient care quality and program analysis.

  Supply-chain management.

Page 6: Case Study on Hadoop

8/3/2019 Case Study on Hadoop

http://slidepdf.com/reader/full/case-study-on-hadoop 6/6

  Drug discovery and development analysis.

Retail/CPG 

  Merchandizing and market basket analysis.

  Campaign management and customer loyalty programs.

  Supply-chain management and analytics.  Event- and behavior-based targeting.

  Market and consumer segmentations.

Telecommunications 

  Revenue assurance and price optimization.

  Customer churn prevention.

  Campaign management and customer loyalty.

  Call Detail Record (CDR) analysis.

  Network performance and optimization.

Web & Digital Media Services 

  Large-scale clickstream analytics.

  Ad targeting, analysis, forecasting and optimization.

  Abuse and click-fraud prevention.

  Social graph analysis and profile segmentation.

  Campaign management and loyalty programs.