Hadoop V.01

download Hadoop V.01

of 24

Transcript of Hadoop V.01

  • 7/31/2019 Hadoop V.01

    1/24

    HADOOP/BIG DATA

  • 7/31/2019 Hadoop V.01

    2/24

    About Big Data

    Big data is a general term used to describe the voluminous amount of unstructured and semi-structured data a company creates -- data that would take too much time and cost too muchmoney to load into a relational Database for analysis. The term is often used when speakingabout petabytes and exabytes of data.

    When dealing with larger datasets, organizations face difficulties in being able to create,manipulate, and manage Big Data. Big data is particularly a problem in businessanalytics because standard tools and procedures are not designed to search and analyze massivedatasets

    A primary goal for looking at big data is to discover repeatable business patterns. Unstructureddata , most of it located in text files, accounts for at least 80% of an organizations data. If left

    unmanaged, the sheer volume of unstructured data thats generated each year within an

    enterprise can be costly in terms of storage . Unmanaged data can also pose a liability ifinformation cannot be located in the event of a compliance audit or lawsuit.

  • 7/31/2019 Hadoop V.01

    3/24

    Big data spans three dimensions

    Volume Big datacomes in one size: large.

    Enterprises are awashwith data, easily

    amassing terabytes andeven petabytes of

    information.

    Velocity Often time-sensitive, big data must

    be used as it isstreaming in to the

    enterprise in order tomaximize its value to the

    business.

    Variety Big dataextends beyond

    structured data,including unstructured

    data of all varieties: text,audio, video, click

    streams, log files andmore

  • 7/31/2019 Hadoop V.01

    4/24

    Customer challenges for securing Big Data

    Customers are not actively talking about security concerns.

    Customers need help understanding threats in a Big Dataenvironment

    Awareness & Understanding arelacking

    Main considerations: Synchronizing retention anddisposition policies across jurisdictions, moving data acrosscountries.

    Customers need help navigating frameworks and changes

    Companies policies & laws addcomplexity

    Challenge: In most instances, data is random andinconsistent, not duplicated

    Opportunity: There is a need for more intelligentidentification of data

    DeDuplication

    Challenge: Compression normally happens instead of de-duplication, yet, will compress duplicated data regardless

    Opportunity: There is a need for an automated manner indoing both de-duplicating, and then compressing

    Compression

    Storage Efficiency challenges for Big Data

  • 7/31/2019 Hadoop V.01

    5/24

    About Hadoop

    5

    Hadoop is open-source software that enables reliable, scalable, distributed computing on

    clusters of inexpensive servers.

    Solution for Big Data: Deals with complexities of high volume, Velocity & variety of data.

    It enables applications to work with thousands of nodes and petabytes of data. It is:-

    Reliable

    : The software is fault

    tolerant, it expects andhandles hardware and

    software failures

    Scalable

    Designed for massive scale

    of processors, memory, andlocal attached storage

    Distributed

    Handles replication. Offersmassively parallel

    programming model,

    MapReduce

  • 7/31/2019 Hadoop V.01

    6/24

    About Apache Hadoop Software Library

    The Apache Hadoop software library is a framework that allows for the distributedprocessing of large data sets across clusters of computers using a simpleprogramming model.

    It is designed to scale up from single servers to thousands of machines, each offeringlocal computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at theapplication layer, so delivering a highly-availaible service on top of a cluster ofcomputers, each of which may be prone to failures

  • 7/31/2019 Hadoop V.01

    7/24

    Market Drivers for Apache Hadoop

    Business Drivers

    High-value projects that require use of more data

    Belief that there is great ROI in mastering big data

    Financial Drivers

    Growing cost of data systems as percentage of IT spendCostadvantage of commodity hardware + opensource

    Enables departmental-level big data strategies

  • 7/31/2019 Hadoop V.01

    8/24

    Trend

    The OLD WAY

    Operational systems keep only currentrecords, short history

    Analytics systems keep onlyconformed/cleaned/digested dataUnstructured data locked away inoperational silos

    Archives offline:-Inflexible, new questionsrequire system redesigns

    The New Trend

    Keep raw data in Hadoop for a longtime

    Able to produce a new analytics view on-demand

    Keep a new copy of data that was previously on insilos

    Can directly do new reports, experiments at lowincremental cost

    New products/services can be added very quickly

    Agile outcome justifies new infrastructure

  • 7/31/2019 Hadoop V.01

    9/24

    HDFS: Hadoop Distributed File System

    HBase: Column oriented, non-relational, schema-less, distributed

    database modeled after Googles BigTable. Promises Random,real-time read/write access to Big Data

    Hive: Data warehouse system that provides SQL interface. Datastructure can be projected ad hoc onto unstructured underlying

    data

    Pig: A platform for manipulating and analyzing large data sets.High level language for analysts

    ZooKeeper: a centralized service for maintaining configuration

    information, naming, providing distributed synchronization, andproviding group services

    Hadoop is a part of a larger framework of

    related technologies

  • 7/31/2019 Hadoop V.01

    10/24

    Organizations using Hadoop

  • 7/31/2019 Hadoop V.01

    11/24

    Hadoop Developer

    Core contributor since Hadoops infancy

    Project Lead for Hadoop Distributed File System

    Facebook (Hadoop, Hive, Scribe) Yahoo! (Hadoop in Yahoo Search)

    Veritas (San Point Direct, Veritas File System)

    IBM Transarc (Andrew File System)

    UW Computer Science Alumni (Condor Project)

  • 7/31/2019 Hadoop V.01

    12/24

    Why Hadoop Is needed?

    Need to process MultiPetabyte Datasets

    Expensive to build reliabilityin each application.

    Failure is expected, rather than exceptional.

    The number of nodes in a cluster is not constant.Nodes fail every day

    Efficient, reliable, Open Source Apache LicenseNeed common infrastructure

    The above goals are same asCondor, but

    Workloads are IO bound andnot CPU bound

  • 7/31/2019 Hadoop V.01

    13/24

    Hadoop is particularly useful when:-

    Complex information processing is needed

    Unstructured data needs to be turned into structured data

    Queries cant be reasonably expressed using SQL

    Heavily recursive algorithms

    Complex but parallelizable algorithms needed, such as geo-spatial analysis or genome sequencing

    Machine learning

    Data sets are too large to fit into database RAM, discs, or require too many cores (10s of TB up toPB)

    Data value does not justify expense of constant real-time availability, such as archives or specialinterest info, which can be moved to Hadoop and remain available at lower cost

    Results are not needed in real time

    Fault tolerance is critical

    Significant custom coding would be required to handle job scheduling

  • 7/31/2019 Hadoop V.01

    14/24

    Hadoop Is being used as a

    Staging layer: Themost common use ofHadoop in enterprise

    environments is asHadoop ETL preprocessing,filtering, and

    transforming vastquantities of semi-

    structured and

    unstructured data forloading into a datawarehouse.

    Event analytics layer:

    large-scale logprocessing of eventdata: call records,

    behavioral analysis,social network

    analysis, clickstreamdata, etc.

    Content analyticslayer: next-best action,

    customer experienceoptimization, social

    media analytics.MapReduce providesthe abstraction layer

    for integrating contentanalytics with more

    traditional forms ofadvanced analysis.

  • 7/31/2019 Hadoop V.01

    15/24

    Karmasphere released the results of a survey of 102

    Hadoop developers regarding adoption, use and future

    plans

  • 7/31/2019 Hadoop V.01

    16/24

    What Data Projects is Hadoop Driving?

  • 7/31/2019 Hadoop V.01

    17/24

  • 7/31/2019 Hadoop V.01

    18/24

    Are Companies Adopting Hadoop?

    More than one-half (54%) of organizations surveyed are using or considering Hadoop for large-scale data processing needs

    More than twice as many Hadoop users report being able to create new products and servicesand enjoy costs savings beyond those using other platforms; over 82% benefit from fasteranalyses and better utilization of computing resources

    87% of Hadoop users are performing or planning new types of analyses with large scale data

    94% of Hadoop users perform analytics on large volumes of data not possible before; 88%analyze data in greater detail; while 82% can now retain more of their data

    Organizations use Hadoop in particular to work with unstructured data such as logs and eventdata (63%)

    More than two-thirds of Hadoop users perform advanced analysis data mining or algorithmdevelopment and testing

  • 7/31/2019 Hadoop V.01

    19/24

  • 7/31/2019 Hadoop V.01

    20/24

  • 7/31/2019 Hadoop V.01

    21/24

    Hadoop At Linkedin:-

    LinkedIn leverages Hadoop to transform raw datato rich features using knowledge aggregated from

    LinkedIns 125 million member base. LinkedInthen uses Lucene to do real-timerecommendations, and also Lucene on Hadoop tobridge offline analysis with user-facing services.The streams of user-generated information,referred to as a social media feeds, may containvaluable, real-time information on the LinkedInmember opinions, activities, and mood states.

  • 7/31/2019 Hadoop V.01

    22/24

    Hadoop At Forsquare

    Forsquare were finding problems in handlinghuge amount of data which they are handling.Their Business development managers, venuespecialists, and upper management eggheadsneeded access to the data in order to inform

    some important decisions.

    To enable easy access to data foursquareengineering decided to use Apache Hadoopand Apache Hive in combination with a customdata server (built in Ruby), all runningin Amazon EC2. The data server is builtusing Rails, MongoDB, Redis, and Resque and

    communicates with Hive using the

    ruby Thrift client.

  • 7/31/2019 Hadoop V.01

    23/24

    Hadoop @ Orbitz

    Orbitz needed an infrastructure thatprovides:-

    Long term storage of large data sets;

    Open access for developers and

    business analysts;

    Ad-hoc quering of data

    Rapid deploying of reporting

    applications.

    They moved to Hadoop and Hive to

    provide reliable and scalable storageand processing of data on inexpensivecommodity hardware.

  • 7/31/2019 Hadoop V.01

    24/24

    HDFS Architecture

    7/30/2012 24

    Namenode

    Breplication

    Rack1 Rack2

    Client

    Blocks

    Datanodes Datanodes

    Client

    Write

    Read

    Metadata opsMetadata(Name, replicas..)

    (/home/foo/data,6. ..

    Block ops