Big Data v2.1

download Big Data v2.1

of 46

Transcript of Big Data v2.1

  • 7/31/2019 Big Data v2.1

    1/46

  • 7/31/2019 Big Data v2.1

    2/46

    2 > 9/15/2010

    Agenda

    Advances in Relational Database Technology

    MapReduce and Hadoop

    Applications of MapReduce MapReduce and RDBMS - Comparisons

    Coexistence between MapReduce and RDBMS

  • 7/31/2019 Big Data v2.1

    3/46

    3 > 9/15/2010

    Brief Bio of Dilip KrishnaFinancial Data and Technology Guy

    Engagement Partner, North-East Financial Services, Teradata

    > Former Director of N.A Enterprise Risk Management practice

    > Consulted on ERM and Basel II initiatives - U.S and Canadianbanks

    Long experience in technology and business consulting in thefinancial industry

    Large-scale projects including Basel II implementations.

    Authored numerous articles about risk management and dataarchitecture

    Engineering degrees from the Ohio State University and theIndian Institute of Technology

    CFA and FRM designations.

  • 7/31/2019 Big Data v2.1

    4/46

    4 > 9/15/2010

    Disclaimer

    1. These ideas do not necessarily reflect positions ofTeradata

    > I take full responsibility for any errors or omissions

    2. This presentation is meant as a review of current BigData technologies, but the examples of RDBMS willsolely use Teradata by necessity

  • 7/31/2019 Big Data v2.1

    5/46

    5 > 9/15/2010

    What is this?

  • 7/31/2019 Big Data v2.1

    6/466 > 9/15/2010

    Big Data has evolved

    =

    Really BigData!

    Seagate 2TB

    ExternalDesktop Hard

    Drive

  • 7/31/2019 Big Data v2.1

    7/467 > 9/15/2010

    What is Big Data?

    Big data is a collection of digital information whose size is beyondthe ability of most software tools and people to capture, manage, andprocess the data.

    Current examples of Big Data include weblogs, RFID readers, sensornetworks, social networks, Internet text and documents, Internetsearch indexing, call detail records, genomics, astronomy, biologicalresearch, military surveillance, photography archives, video archives,

    and large scale eCommerce.

    Big Data requires scalable technologies to efficiently processlarge quantities of data within tolerable elapsed times. Big Data

    scalable technologies include MPP databases, the Apache HadoopFramework, the Internet, and archival storage systems.

    Currently Big Data ranges from a few dozen terabytes to severalpetabytes and is constantly increasing at both the low and high endsize.

    Source: Teradata working definition, Sep 2010

  • 7/31/2019 Big Data v2.1

    8/468 > 9/15/2010

    Big Data can work on lots of technologiesbut

    * Source: Curt Monash, http://www.networkworld.com/community/node/41777, downloaded on 11-Sep-2010

  • 7/31/2019 Big Data v2.1

    9/469 > 9/15/2010

    Its not the size of data, but whatyou do with it!

    Number of Concurrent Users

    10

    s

    100

    s

    1,

    000

    s

    10

    ,000

    s

    Frequency of Update

    Monthly

    Daily

    Now

    Week

    ly

    Level of Detail

    Dim

    ension

    alDetailed

    Summary

    Storage Scalability

    Gigabytes

    Terabytes1

    00sTerabytes

    Parallel OperationConditional NoneTotal

    Query Complexity

    Simple

    3-5W

    ay

    Table

    Join

    s

    6-20

    Way

    Tabl

    eJoin

    s

    Data LoadingManual Automated Integrated

    Workload Mix

    ActiveData

    Warehousin

    g

    Batch

    Repo

    rting

    AdHoc

    Interactive

    Analysis

  • 7/31/2019 Big Data v2.1

    10/46

    Relational Databases

  • 7/31/2019 Big Data v2.1

    11/4611 > 9/15/2010

    Not your fathers Relational Database

    Concepts date back to 1970 (Codd and Date)

    Evolution of full-fledged RDBMS systems Oracle,

    Sybase, DB2 etc. Performance Considerations

    > Massively-Parallel Processing (MPP) platforms pioneeredby Teradata

    > Column-oriented databases Sybase IQ, Vertica etc

    Data Warehouse Appliances

    Extremely large data-sets are routinely being processed

    > Many companies in petabyte-club

  • 7/31/2019 Big Data v2.1

    12/4612 > 9/15/2010

    An explosion in maturity in RDBMSAn Example

    Up to 86PBUp to 17TBUp to 275TBUp to 88PBUp to 6TBScalability

    Powered by the Teradata Database

    ExtremePerformance

    forOperational

    Analytics

    ExtremePerformance

    Appliance

    Data MartAppliance

    ExtremeData

    Appliance

    DataWarehouseAppliance

    ActiveEnterprise Data

    Warehouse

    Purpose

    Test/Development

    or SmallerData Marts

    Analyticson Extreme

    DataVolumes

    Entry-levelEDW

    or DepartmentalData Marts

    EDW/ADW forfor both Strategicand Operational

    Intelligence

    Teradata Confidential12 > 9/15/2010

    Big Data

  • 7/31/2019 Big Data v2.1

    13/4613 > 9/15/2010

    Benefits of RDBMS and DW Appliances

    Declarative ProgrammingStyle of SQL> Ease-of-use (no

    programming required)

    Separation of code anddata using Schemas

    Rich set of analysis tools> Report writers

    > Business intelligence tools

    > Data mining tools

    > Database design tools

    Custom coding possiblevia UDFs (but not easy)

    Codebase Maturity

    > 30+ years ofdevelopment

    Manageability> Out of the box Performance

    > One number to call

    > TCO Reduction

    > Reduced administration

    > Built-in high availability

    > Scalability

    > Rapid time-to-value

    Low Cost per Query

  • 7/31/2019 Big Data v2.1

    14/4614 > 9/15/2010

    Scalable Shared Nothing SoftwareVirtual Processors

    Divide the data, workload, and system resourcesevenly among many parallel, processing units

    (AMPs)No single point of control for any operation

    >I/O, Buffers, Locking, Logging, Dictionary

    >Nothing centralized

    >Nothing in the way of linear scalability

    AMPsLogs

    Locks

    Buffer

    sI/O

  • 7/31/2019 Big Data v2.1

    15/4615 > 9/15/2010

    Automatic Data Distribution

    Each node has rows equally distributed from every table

    Each node works independently, in parallel, on its rows

    Table-A Table-B Table-C Table-D

    Node 1

    AMP1 AMP2

    Node 2

    AMP3 AMP4

    Bynet

  • 7/31/2019 Big Data v2.1

    16/46

    MapReduce and Hadoop

  • 7/31/2019 Big Data v2.1

    17/4617 > 9/15/2010

    What is MapReduce?

    A parallel programmingframework> Originally from Google

    Generate search indexes Web scoring algorithms

    > C++, Java, Python, etc.

    > Harness 1000s of CPUs

    MapReduce provides

    > Automatic parallelization

    > Fault tolerance

    > Monitoring & status updates

    Hadoop

    > Open source MapReduce

    Scheduler

    Results

    Map Function

    map

    reduce

    shuffle

  • 7/31/2019 Big Data v2.1

    18/4618 > 9/15/2010

    What is Hadoop?

    Open source version of MapReduce

    > Top level Apache project

    > Contributors: Yahoo!, IBM, Google Hadoop Core includes

    > Distributed File System

    > Map/Reduce

    > Job Tracker/Task Tracker

    Written in Java

    > Linux, Mac OS/X, Windows, and Solaris

    Growing list of supporting tools

    Hadoop is named for inventors child's stuffed elephanthttp://hadoop.apache.org

  • 7/31/2019 Big Data v2.1

    19/4619 > 9/15/2010

    MapReduce has arrived

    Clusters of 100s/1000s of commodity Intel machines

    Implementation is a C++ library linked into userprograms

    15,000+ computers running HadoopResearch for Ad systems and web search

    Product search indexes

    Analytics from user sessions

    Log analysis for reporting and analytics and machinelearning

    Log analysis, data mining, and machine learning

    Large scale image conversion

    High energy physics, genomics, Digital Sky Survey

    > 60 Listed users of Hadoop on http://wiki.apache.org/hadoop/PoweredBy

  • 7/31/2019 Big Data v2.1

    20/4620 > 9/15/2010

    Yahoo Hadoop clusters

    Has ~38,000 machines running Hadoop

    Largest Hadoop clusters are currently2000 nodes

    Several petabytes of user data (compressed, unreplicated) Run hundreds of thousands of jobs every month

  • 7/31/2019 Big Data v2.1

    21/4621 > 9/15/2010

    How does MapReduce work?An Example

    2 Input Files

    File1 =>

    File2 =>

    Hello World Bye World

    Hello Hadoop Bye Hadoop

    Map Reduce Class

    public class WordCount {public void map( )

    public static class Reduce ()

    public static void main(){

    setMapperClass(Map.class);setReducerClass(Reduce.class);

    }}

    Intermediate Files(Output of Map)

    < Hello, 1>< Hadoop, 1>< Goodbye, 1>< Hadoop, 1>

    < Hello, 1>< Hadoop, 1>< Goodbye, 1>< Hadoop, 1>

    Output File(Output of Reduce)

    Bye 1Goodbye 1Hadoop 2Hello 2World 2

  • 7/31/2019 Big Data v2.1

    22/46

    22 > 9/15/2010

    Simplified Hadoop Architecture

    HadoopProgrammer

    Name NodeMetadataJobTracker

    Data NodeMapsReducersTaskTracker

    Data NodeMapsReducersTaskTracker

    Rack 1

    HDFS HDFS

    Data NodeMapsReducersTaskTracker

    Data NodeMapsReducersTaskTracker

    Rack 2

    HDFS HDFS

    Replicated blocks (aka fallback)

    Code toparallelize

  • 7/31/2019 Big Data v2.1

    23/46

    23 > 9/15/2010

    Running Successive MapReduce Jobs

    MR is batch processing

    Run one MR job

    > Output files on reduce node

    Run next MR job

    > Input files are prior reduceroutput

    > Creates new reducer files Keep running MapReduce

    batch jobs until solved

    Job 1 -- Map

    Job 1 -- Reduce

    Job 2 -- Map

    Job 2 -- Reduce

  • 7/31/2019 Big Data v2.1

    24/46

    24 > 9/15/2010

    Benefits of MapReduce / Hadoop

    Support for 100s to 1000s of server nodes

    > Extreme scalability

    > Commodity hardware = low costs

    Simplicity of programming model> Attractive to Programmers (programming vs. SQL)

    Easy integration of developer tools> Java, grep, python, etc.

    Batch programming

    > Complex multi-step processing

    Absence of rigid schema

    >No ETL Flexibility

    Some fault-tolerance

    Easy setup

  • 7/31/2019 Big Data v2.1

    25/46

    25 > 9/15/2010

    MapReduce / HadoopChallenges

    Performance lags MPP databases

    Quality of Service

    > Big jobs can hog the cluster (No workload management)> JobTracker memory as limited resource

    > No Fair Scheduler

    > Hard to troubleshoot performance bottlenecks

    Skills

    > Ongoing code maintenance depends on open-sourceprocess

    > Skilled resources are still few-and-far-between

  • 7/31/2019 Big Data v2.1

    26/46

    26 > 9/15/2010

    Addressing Hadoop drawbacksRelated Open Source Projects

    Pig (Pig Latin)

    > Process language

    > Data set handling, transformations

    Hive> SQL interface to Hadoop

    HBase> Column oriented database

    Many others

    > Zookeeper

    > Eclipse Hadoop plug-in

    > Thrift

    > Cascading

    > Scope

    > Dryad/Linq

    > etc.

  • 7/31/2019 Big Data v2.1

    27/46

    MapReduce Applications

  • 7/31/2019 Big Data v2.1

    28/46

    28 > 9/15/2010

    Common Hadoop Applications

    Reporting

    > Web logs

    Web analytics> Click streams

    > Session analysis

    > Social media

    Data mining

    > Personalization

    > Clustering, k-means, bayesian

    Text analysis

    > Search indexes

    > Sentiment analysis

  • 7/31/2019 Big Data v2.1

    29/46

    29 > 9/15/2010

    Web Analytics / Web Server Log Analysis

    Cross-channel, trends,forecasts, anomalies

    Web Site AnalyticsData Warehouse

    Multivariate testing

    Usability testing

    Site error monitoring

    Surveys

    eCommerce merchandising

    Dynamic ad serving

    Behavior targeting

    E-Mail targeting

    Hadoop map reduce

  • 7/31/2019 Big Data v2.1

    30/46

    30 > 9/15/2010

    Search Index Page Ranking

  • 7/31/2019 Big Data v2.1

    31/46

    31 > 9/15/2010

    Large Scale Data Transformation

    Millions of files or objects

    Find key object and transform it

    > Convert BMPs to JPGs> Convert DOCs to PDFs

    > Change NYC to New York City

    > Etc.

    ETL but data stays on node

    New York Times

    > Convert 11M articles to PDFs

    > Convert TIFFs in clouds

    > Use Amazon EC2 and S3 clouds

    > 4TB of articles 1.5TB of PDFs

    Hadoop map reduce

    EC2/S3 EC2/S3 EC2/S3

  • 7/31/2019 Big Data v2.1

    32/46

    32 > 9/15/2010

    RackSpace eMail Logs Tech Support

    http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data

    http://blog.racklabs.com/?p=66

    TechSupport

    Reports

    100s of eMailServers

    Hadoop FileSystem

    emaillog

    streams

    LuceneText

    Indexing

  • 7/31/2019 Big Data v2.1

    33/46

    MapReduce and RDBMSComparisons

  • 7/31/2019 Big Data v2.1

    34/46

    34 > 9/15/2010

    Lots of controversy and flame-wars

    http://www.dbms2.com/2009/04/14/ebay-thinks-mpp-dbms-clobber-mapreduce/

    M R d A j t b k d

  • 7/31/2019 Big Data v2.1

    35/46

    35 > 9/15/2010

    MapReduce: A major step backwardsDeWitt and Stonebraker

    Not a database

    > No schema, indexes, optimizer

    Not high performance

    Not a data warehouse> No integrated data, no history

    > Simple reports, single fact table

    > Severe data skew

    Not mature technology

    > Many single points of failure

    > No ISV tools

    > No performance tools

    Not for business users

    > No pivot, drill down, self service

    But does it matter??

  • 7/31/2019 Big Data v2.1

    36/46

    36 > 9/15/2010

    Architectural Comparisons

    High labor cost/query (labor +operating costs) Lower costs/queryCost of Operation

    Easy to setup and use

    Low cost of acquisition

    High cost-of-acquisition

    Difficult to setup

    Time-to-value

    Significant flexibility Limited flexibilityFlexibility

    Externally imposed NativeSchema Support

    Programmed NativeIndexing

    Programmatic

    Attractive to programmers

    DeclarativeProgramming Model

    MapReduceRDBMS

    Optimized Query plan allowsdata push

    Intermediate data not written

    to disk

    Compression yields

    performance (and space)benefits

    Parse once during ETL phase

    Data pull via intermediate files

    Improved fault-tolerance(checkpointing)

    Reduced performance

    Pipelining

    Compression yields poorer

    performance (currently)

    Compression

    Parse (at least) once per queryRepetitive RecordParsing

    * MapReduce and Parallel RDBMSs: Friends or Foes?

    f

  • 7/31/2019 Big Data v2.1

    37/46

    37 > 9/15/2010

    Performance ComparisonsLoad Times better in Hadoop

    Source: Stonebreaker et al, A Comparison of Approaches to Large-Scale Data Analysis, Apr 13,2009

    P f C i

  • 7/31/2019 Big Data v2.1

    38/46

    38 > 9/15/2010

    Performance Comparisonsgrep and select times worse in Hadoop

    Source: Stonebreaker et al, A Comparison of Approaches to Large-Scale Data Analysis, Apr 13,2009

    P f C i

  • 7/31/2019 Big Data v2.1

    39/46

    39 > 9/15/2010

    Performance ComparisonsAggregate and Join Comparison Much worse in Hadoop

    Source: Stonebreaker et al, A Comparison of Approaches to Large-Scale Data Analysis, Apr 13,2009

  • 7/31/2019 Big Data v2.1

    40/46

    MapReduce and RDBMS

    Complementary, not Competitive!

    A l ti P

  • 7/31/2019 Big Data v2.1

    41/46

    41 > 9/15/2010

    STRATEGIC INTELLIGENCE OPERATIONAL INTELLIGENCE

    Analytics ProcessesComparing MapReduce and RDBMS

    OPERATIONALIZINGWHAT IShappening now?

    REPORTING

    WHAThappened?

    ANALYZINGWHY did ithappen?

    PREDICTINGWHAT WILL

    happen?

    ACTIVATINGMAKE it happen!

    Batch

    Ad Hoc

    Analytics

    Event driven

    Continuous updates,tactical queriesHadoop

    HadoopLarger Data Sizes

    RDBMSComplexity, # queries

  • 7/31/2019 Big Data v2.1

    42/46

    42 > 9/15/2010

    Teradata and Hadoop

    Use cases

    > ETL on steroids

    > Hadoop as a BI Tool> Hadoop as a BI application

    Teradata DeveloperExchange

    > UDF examples

    http://developer.teradata.com/extensibility/articles/hadoop-dfs-to-teradatahttp://developer.teradata.com/extensibility/articles/sessionization-map-reduce-support-in-teradata

    MapReduceUDFs

    MarketingFinancialsOperations

    SalesInventoryCustomers

    TPT

    ETL Analytic applications

  • 7/31/2019 Big Data v2.1

    43/46

    43 > 9/15/2010

    Hadoop Parallel Processing

    Analytic, utility, etc. applications

    Data sources

    Data processing

    MPP Data WarehousesWeb App server farmsWeb crawlers HDFS clusters

  • 7/31/2019 Big Data v2.1

    44/46

    44 > 9/15/2010

    When to Use Which Infrastructure

    Complex processesMulti-step processes1,000+ nodes requiredCant move the data

    Extensive text parsingHouseholding analysis

    Data mining

    Simple reportsData cleansing

    It depends Iterative discoveryDrill down, OLAP

    DashboardsBusiness user queryingIntegrated subject areasMultidimensional viewsTrends over time

    Visualization toolsText analysis tools

    MapReduce RDBMS

  • 7/31/2019 Big Data v2.1

    45/46

    45 > 9/15/2010

    Conclusion

    MapReduce vs RDBMS> Surface similarities, but deep differences> RDBMS performance far superior to Hadoop> Still immature but expect Deficiencies to be solved over time

    Complementary rather than Competitive technologies

    > Unstructured datasets> One/few-time processing (e.g. web-logs)

    Key Criterion - $/Query> Large # queries on dataset consider RDBMS

    > Quick time-to-value for few queries consider MapReduce> Expect MapReduce high cost/query to drop over time

    Consider MapReduce for> ETL processing

    > Parallel programming

    Keep a close eye on MapReduce, but dont give upyour RDBMS just yet!

  • 7/31/2019 Big Data v2.1

    46/46

    Acknowledgements and Research material

    MapReduce: Simplified DataProcessing on Large Clusters, Deanand Ghemawat, Google

    Internal paper MapReduce and

    Hadoop Technical Overview, DanGraham and Rick Burns, Teradata

    Tutorial: What is MapReduce andHadoop, Dan Graham and RickBurns, Teradata

    MapReduce, Hadoop, and DataWarehousing Coexistence , DanGraham and Rick Burns, Teradata

    Pig Latin: A Not-So-ForeignLanguage for Data Processing,Olston et al

    Hive A Warehousing Solution Over

    a MapReduce Framework, Thusooet al

    MapReduce and Parallel DBMSs:Friends or Foes?, Stonebraker et al,Communications Of The ACM |

    January 2010 MapReduce: A major step

    backwards, Dewitt andStonebraker,

    A Comparison of Approaches to

    Large-Scale Data Analysis, Pavlo etal, SIGMOD09, June 29July 2,2009

    Special thanks to Dan Graham, Teradata Corporation

    Special thanks to Dan Graham, Teradata Corporation