Big Data v2.1

7/31/2019 Big Data v2.1

1/46

7/31/2019 Big Data v2.1

2/46

2 > 9/15/2010

Agenda

Advances in Relational Database Technology

MapReduce and Hadoop

Applications of MapReduce MapReduce and RDBMS - Comparisons

Coexistence between MapReduce and RDBMS

7/31/2019 Big Data v2.1

3/46

3 > 9/15/2010

Brief Bio of Dilip KrishnaFinancial Data and Technology Guy

Engagement Partner, North-East Financial Services, Teradata

> Former Director of N.A Enterprise Risk Management practice

> Consulted on ERM and Basel II initiatives - U.S and Canadianbanks

Long experience in technology and business consulting in thefinancial industry

Large-scale projects including Basel II implementations.

Authored numerous articles about risk management and dataarchitecture

Engineering degrees from the Ohio State University and theIndian Institute of Technology

CFA and FRM designations.

7/31/2019 Big Data v2.1

4/46

4 > 9/15/2010

Disclaimer

1. These ideas do not necessarily reflect positions ofTeradata

> I take full responsibility for any errors or omissions

2. This presentation is meant as a review of current BigData technologies, but the examples of RDBMS willsolely use Teradata by necessity

7/31/2019 Big Data v2.1

5/46

5 > 9/15/2010

What is this?

7/31/2019 Big Data v2.1

6/466 > 9/15/2010

Big Data has evolved

=

Really BigData!

Seagate 2TB

ExternalDesktop Hard

Drive

7/31/2019 Big Data v2.1

7/467 > 9/15/2010

What is Big Data?

Big data is a collection of digital information whose size is beyondthe ability of most software tools and people to capture, manage, andprocess the data.

Current examples of Big Data include weblogs, RFID readers, sensornetworks, social networks, Internet text and documents, Internetsearch indexing, call detail records, genomics, astronomy, biologicalresearch, military surveillance, photography archives, video archives,

and large scale eCommerce.

Big Data requires scalable technologies to efficiently processlarge quantities of data within tolerable elapsed times. Big Data

scalable technologies include MPP databases, the Apache HadoopFramework, the Internet, and archival storage systems.

Currently Big Data ranges from a few dozen terabytes to severalpetabytes and is constantly increasing at both the low and high endsize.

Source: Teradata working definition, Sep 2010

7/31/2019 Big Data v2.1

8/468 > 9/15/2010

Big Data can work on lots of technologiesbut

* Source: Curt Monash, http://www.networkworld.com/community/node/41777, downloaded on 11-Sep-2010

7/31/2019 Big Data v2.1

9/469 > 9/15/2010

Its not the size of data, but whatyou do with it!

Number of Concurrent Users

10

s

100

s

1,

000

s

10

,000

s

Frequency of Update

Monthly

Daily

Now

Week

ly

Level of Detail

Dim

ension

alDetailed

Summary

Storage Scalability

Gigabytes

Terabytes1

00sTerabytes

Parallel OperationConditional NoneTotal

Query Complexity

Simple

3-5W

ay

Table

Join

s

6-20

Way

Tabl

eJoin

s

Data LoadingManual Automated Integrated

Workload Mix

ActiveData

Warehousin

g

Batch

Repo

rting

AdHoc

Interactive

Analysis

7/31/2019 Big Data v2.1

10/46

Relational Databases

7/31/2019 Big Data v2.1

11/4611 > 9/15/2010

Not your fathers Relational Database

Concepts date back to 1970 (Codd and Date)

Evolution of full-fledged RDBMS systems Oracle,

Sybase, DB2 etc. Performance Considerations

> Massively-Parallel Processing (MPP) platforms pioneeredby Teradata

> Column-oriented databases Sybase IQ, Vertica etc

Data Warehouse Appliances

Extremely large data-sets are routinely being processed

> Many companies in petabyte-club

7/31/2019 Big Data v2.1

12/4612 > 9/15/2010

An explosion in maturity in RDBMSAn Example

Up to 86PBUp to 17TBUp to 275TBUp to 88PBUp to 6TBScalability

Powered by the Teradata Database

ExtremePerformance

forOperational

Analytics

ExtremePerformance

Appliance

Data MartAppliance

ExtremeData

Appliance

DataWarehouseAppliance

ActiveEnterprise Data

Warehouse

Purpose

Test/Development

or SmallerData Marts

Analyticson Extreme

DataVolumes

Entry-levelEDW

or DepartmentalData Marts

EDW/ADW forfor both Strategicand Operational

Intelligence

Teradata Confidential12 > 9/15/2010

Big Data

7/31/2019 Big Data v2.1

13/4613 > 9/15/2010

Benefits of RDBMS and DW Appliances

Declarative ProgrammingStyle of SQL> Ease-of-use (no

programming required)

Separation of code anddata using Schemas

Rich set of analysis tools> Report writers

> Business intelligence tools

> Data mining tools

> Database design tools

Custom coding possiblevia UDFs (but not easy)

Codebase Maturity

> 30+ years ofdevelopment

Manageability> Out of the box Performance

> One number to call

> TCO Reduction

> Reduced administration

> Built-in high availability

> Scalability

> Rapid time-to-value

Low Cost per Query

7/31/2019 Big Data v2.1

14/4614 > 9/15/2010

Scalable Shared Nothing SoftwareVirtual Processors

Divide the data, workload, and system resourcesevenly among many parallel, processing units

(AMPs)No single point of control for any operation

>I/O, Buffers, Locking, Logging, Dictionary

>Nothing centralized

>Nothing in the way of linear scalability

AMPsLogs

Locks

Buffer

sI/O

7/31/2019 Big Data v2.1

15/4615 > 9/15/2010

Automatic Data Distribution

Each node has rows equally distributed from every table

Each node works independently, in parallel, on its rows

Table-A Table-B Table-C Table-D

Node 1

AMP1 AMP2

Node 2

AMP3 AMP4

Bynet

7/31/2019 Big Data v2.1

16/46

MapReduce and Hadoop

7/31/2019 Big Data v2.1

17/4617 > 9/15/2010

What is MapReduce?

A parallel programmingframework> Originally from Google

Generate search indexes Web scoring algorithms

> C++, Java, Python, etc.

> Harness 1000s of CPUs

MapReduce provides

> Automatic parallelization

> Fault tolerance

> Monitoring & status updates

Hadoop

> Open source MapReduce

Scheduler

Results

Map Function

map

reduce

shuffle

7/31/2019 Big Data v2.1

18/4618 > 9/15/2010

What is Hadoop?

Open source version of MapReduce

> Top level Apache project

> Contributors: Yahoo!, IBM, Google Hadoop Core includes

> Distributed File System

> Map/Reduce

> Job Tracker/Task Tracker

Written in Java

> Linux, Mac OS/X, Windows, and Solaris

Growing list of supporting tools

Hadoop is named for inventors child's stuffed elephanthttp://hadoop.apache.org

7/31/2019 Big Data v2.1

19/4619 > 9/15/2010

MapReduce has arrived

Clusters of 100s/1000s of commodity Intel machines

Implementation is a C++ library linked into userprograms

15,000+ computers running HadoopResearch for Ad systems and web search

Product search indexes

Analytics from user sessions

Log analysis for reporting and analytics and machinelearning

Log analysis, data mining, and machine learning

Large scale image conversion

High energy physics, genomics, Digital Sky Survey

> 60 Listed users of Hadoop on http://wiki.apache.org/hadoop/PoweredBy

7/31/2019 Big Data v2.1

20/4620 > 9/15/2010

Yahoo Hadoop clusters

Has ~38,000 machines running Hadoop

Largest Hadoop clusters are currently2000 nodes

Several petabytes of user data (compressed, unreplicated) Run hundreds of thousands of jobs every month

7/31/2019 Big Data v2.1

21/4621 > 9/15/2010

How does MapReduce work?An Example

2 Input Files

File1 =>

File2 =>

Hello World Bye World

Hello Hadoop Bye Hadoop

Map Reduce Class

public class WordCount {public void map( )

public static class Reduce ()

public static void main(){

setMapperClass(Map.class);setReducerClass(Reduce.class);

}}

Intermediate Files(Output of Map)

< Hello, 1>< Hadoop, 1>< Goodbye, 1>< Hadoop, 1>

< Hello, 1>< Hadoop, 1>< Goodbye, 1>< Hadoop, 1>

Output File(Output of Reduce)

Bye 1Goodbye 1Hadoop 2Hello 2World 2

7/31/2019 Big Data v2.1

22/46

22 > 9/15/2010

Simplified Hadoop Architecture

HadoopProgrammer

Name NodeMetadataJobTracker

Data NodeMapsReducersTaskTracker


Rack 1

HDFS HDFS



Rack 2

HDFS HDFS

Replicated blocks (aka fallback)

Code toparallelize

7/31/2019 Big Data v2.1

23/46

23 > 9/15/2010

Running Successive MapReduce Jobs

MR is batch processing

Run one MR job

> Output files on reduce node

Run next MR job

> Input files are prior reduceroutput

> Creates new reducer files Keep running MapReduce

batch jobs until solved

Job 1 -- Map

Job 1 -- Reduce

Job 2 -- Map

Job 2 -- Reduce

7/31/2019 Big Data v2.1

24/46

24 > 9/15/2010

Benefits of MapReduce / Hadoop

Support for 100s to 1000s of server nodes

> Extreme scalability

> Commodity hardware = low costs

Simplicity of programming model> Attractive to Programmers (programming vs. SQL)

Easy integration of developer tools> Java, grep, python, etc.

Batch programming

> Complex multi-step processing

Absence of rigid schema

>No ETL Flexibility

Some fault-tolerance

Easy setup

7/31/2019 Big Data v2.1

25/46

25 > 9/15/2010

MapReduce / HadoopChallenges

Performance lags MPP databases

Quality of Service

> Big jobs can hog the cluster (No workload management)> JobTracker memory as limited resource

> No Fair Scheduler

> Hard to troubleshoot performance bottlenecks

Skills

> Ongoing code maintenance depends on open-sourceprocess

> Skilled resources are still few-and-far-between

7/31/2019 Big Data v2.1

26/46

26 > 9/15/2010

Addressing Hadoop drawbacksRelated Open Source Projects

Pig (Pig Latin)

> Process language

> Data set handling, transformations

Hive> SQL interface to Hadoop

HBase> Column oriented database

Many others

> Zookeeper

> Eclipse Hadoop plug-in

> Thrift

> Cascading

> Scope

> Dryad/Linq

> etc.

7/31/2019 Big Data v2.1

27/46

MapReduce Applications

7/31/2019 Big Data v2.1

28/46

28 > 9/15/2010

Common Hadoop Applications

Reporting

> Web logs

Web analytics> Click streams

> Session analysis

> Social media

Data mining

> Personalization

> Clustering, k-means, bayesian

Text analysis

> Search indexes

> Sentiment analysis

7/31/2019 Big Data v2.1

29/46

29 > 9/15/2010

Web Analytics / Web Server Log Analysis

Cross-channel, trends,forecasts, anomalies

Web Site AnalyticsData Warehouse

Multivariate testing

Usability testing

Site error monitoring

Surveys

eCommerce merchandising

Dynamic ad serving

Behavior targeting

E-Mail targeting

Hadoop map reduce

7/31/2019 Big Data v2.1

30/46

30 > 9/15/2010

Search Index Page Ranking

7/31/2019 Big Data v2.1

31/46

31 > 9/15/2010

Large Scale Data Transformation

Millions of files or objects

Find key object and transform it

> Convert BMPs to JPGs> Convert DOCs to PDFs

> Change NYC to New York City

> Etc.

ETL but data stays on node

New York Times

> Convert 11M articles to PDFs

> Convert TIFFs in clouds

> Use Amazon EC2 and S3 clouds

> 4TB of articles 1.5TB of PDFs

Hadoop map reduce

EC2/S3 EC2/S3 EC2/S3

7/31/2019 Big Data v2.1

32/46

32 > 9/15/2010

RackSpace eMail Logs Tech Support

http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data

http://blog.racklabs.com/?p=66

TechSupport

Reports

100s of eMailServers

Hadoop FileSystem

emaillog

streams

LuceneText

Indexing

7/31/2019 Big Data v2.1

33/46

MapReduce and RDBMSComparisons

7/31/2019 Big Data v2.1

34/46

34 > 9/15/2010

Lots of controversy and flame-wars

http://www.dbms2.com/2009/04/14/ebay-thinks-mpp-dbms-clobber-mapreduce/

M R d A j t b k d

7/31/2019 Big Data v2.1

35/46

35 > 9/15/2010

MapReduce: A major step backwardsDeWitt and Stonebraker

Not a database

> No schema, indexes, optimizer

Not high performance

Not a data warehouse> No integrated data, no history

> Simple reports, single fact table

> Severe data skew

Not mature technology

> Many single points of failure

> No ISV tools

> No performance tools

Not for business users

> No pivot, drill down, self service

But does it matter??

7/31/2019 Big Data v2.1

36/46

36 > 9/15/2010

Architectural Comparisons

High labor cost/query (labor +operating costs) Lower costs/queryCost of Operation

Easy to setup and use

Low cost of acquisition

High cost-of-acquisition

Difficult to setup

Time-to-value

Significant flexibility Limited flexibilityFlexibility

Externally imposed NativeSchema Support

Programmed NativeIndexing

Programmatic

Attractive to programmers

DeclarativeProgramming Model

MapReduceRDBMS

Optimized Query plan allowsdata push

Intermediate data not written

to disk

Compression yields

performance (and space)benefits

Parse once during ETL phase

Data pull via intermediate files

Improved fault-tolerance(checkpointing)

Reduced performance

Pipelining

Compression yields poorer

performance (currently)

Compression

Parse (at least) once per queryRepetitive RecordParsing

* MapReduce and Parallel RDBMSs: Friends or Foes?

f

7/31/2019 Big Data v2.1

37/46

37 > 9/15/2010

Performance ComparisonsLoad Times better in Hadoop

Source: Stonebreaker et al, A Comparison of Approaches to Large-Scale Data Analysis, Apr 13,2009

P f C i

7/31/2019 Big Data v2.1

38/46

38 > 9/15/2010

Performance Comparisonsgrep and select times worse in Hadoop


P f C i

7/31/2019 Big Data v2.1

39/46

39 > 9/15/2010

Performance ComparisonsAggregate and Join Comparison Much worse in Hadoop


7/31/2019 Big Data v2.1

40/46

MapReduce and RDBMS

Complementary, not Competitive!

A l ti P

7/31/2019 Big Data v2.1

41/46

41 > 9/15/2010

STRATEGIC INTELLIGENCE OPERATIONAL INTELLIGENCE

Analytics ProcessesComparing MapReduce and RDBMS

OPERATIONALIZINGWHAT IShappening now?

REPORTING

WHAThappened?

ANALYZINGWHY did ithappen?

PREDICTINGWHAT WILL

happen?

ACTIVATINGMAKE it happen!

Batch

Ad Hoc

Analytics

Event driven

Continuous updates,tactical queriesHadoop

HadoopLarger Data Sizes

RDBMSComplexity, # queries

7/31/2019 Big Data v2.1

42/46

42 > 9/15/2010

Teradata and Hadoop

Use cases

> ETL on steroids

> Hadoop as a BI Tool> Hadoop as a BI application

Teradata DeveloperExchange

> UDF examples

http://developer.teradata.com/extensibility/articles/hadoop-dfs-to-teradatahttp://developer.teradata.com/extensibility/articles/sessionization-map-reduce-support-in-teradata

MapReduceUDFs

MarketingFinancialsOperations

SalesInventoryCustomers

TPT

ETL Analytic applications

7/31/2019 Big Data v2.1

43/46

43 > 9/15/2010

Hadoop Parallel Processing

Analytic, utility, etc. applications

Data sources

Data processing

MPP Data WarehousesWeb App server farmsWeb crawlers HDFS clusters

7/31/2019 Big Data v2.1

44/46

44 > 9/15/2010

When to Use Which Infrastructure

Complex processesMulti-step processes1,000+ nodes requiredCant move the data

Extensive text parsingHouseholding analysis

Data mining

Simple reportsData cleansing

It depends Iterative discoveryDrill down, OLAP

DashboardsBusiness user queryingIntegrated subject areasMultidimensional viewsTrends over time

Visualization toolsText analysis tools

MapReduce RDBMS

7/31/2019 Big Data v2.1

45/46

45 > 9/15/2010

Conclusion

MapReduce vs RDBMS> Surface similarities, but deep differences> RDBMS performance far superior to Hadoop> Still immature but expect Deficiencies to be solved over time

Complementary rather than Competitive technologies

> Unstructured datasets> One/few-time processing (e.g. web-logs)

Key Criterion - $/Query> Large # queries on dataset consider RDBMS

> Quick time-to-value for few queries consider MapReduce> Expect MapReduce high cost/query to drop over time

Consider MapReduce for> ETL processing

> Parallel programming

Keep a close eye on MapReduce, but dont give upyour RDBMS just yet!

7/31/2019 Big Data v2.1

46/46

Acknowledgements and Research material

MapReduce: Simplified DataProcessing on Large Clusters, Deanand Ghemawat, Google

Internal paper MapReduce and

Hadoop Technical Overview, DanGraham and Rick Burns, Teradata

Tutorial: What is MapReduce andHadoop, Dan Graham and RickBurns, Teradata

MapReduce, Hadoop, and DataWarehousing Coexistence , DanGraham and Rick Burns, Teradata

Pig Latin: A Not-So-ForeignLanguage for Data Processing,Olston et al

Hive A Warehousing Solution Over

a MapReduce Framework, Thusooet al

MapReduce and Parallel DBMSs:Friends or Foes?, Stonebraker et al,Communications Of The ACM |

January 2010 MapReduce: A major step

backwards, Dewitt andStonebraker,

A Comparison of Approaches to

Large-Scale Data Analysis, Pavlo etal, SIGMOD09, June 29July 2,2009

Special thanks to Dan Graham, Teradata Corporation

Special thanks to Dan Graham, Teradata Corporation

Big Data v2.1

Documents

Transcript of Big Data v2.1