Big Data v2.1
-
Upload
gsinghglakes -
Category
Documents
-
view
229 -
download
0
Transcript of Big Data v2.1
-
7/31/2019 Big Data v2.1
1/46
-
7/31/2019 Big Data v2.1
2/46
2 > 9/15/2010
Agenda
Advances in Relational Database Technology
MapReduce and Hadoop
Applications of MapReduce MapReduce and RDBMS - Comparisons
Coexistence between MapReduce and RDBMS
-
7/31/2019 Big Data v2.1
3/46
3 > 9/15/2010
Brief Bio of Dilip KrishnaFinancial Data and Technology Guy
Engagement Partner, North-East Financial Services, Teradata
> Former Director of N.A Enterprise Risk Management practice
> Consulted on ERM and Basel II initiatives - U.S and Canadianbanks
Long experience in technology and business consulting in thefinancial industry
Large-scale projects including Basel II implementations.
Authored numerous articles about risk management and dataarchitecture
Engineering degrees from the Ohio State University and theIndian Institute of Technology
CFA and FRM designations.
-
7/31/2019 Big Data v2.1
4/46
4 > 9/15/2010
Disclaimer
1. These ideas do not necessarily reflect positions ofTeradata
> I take full responsibility for any errors or omissions
2. This presentation is meant as a review of current BigData technologies, but the examples of RDBMS willsolely use Teradata by necessity
-
7/31/2019 Big Data v2.1
5/46
5 > 9/15/2010
What is this?
-
7/31/2019 Big Data v2.1
6/466 > 9/15/2010
Big Data has evolved
=
Really BigData!
Seagate 2TB
ExternalDesktop Hard
Drive
-
7/31/2019 Big Data v2.1
7/467 > 9/15/2010
What is Big Data?
Big data is a collection of digital information whose size is beyondthe ability of most software tools and people to capture, manage, andprocess the data.
Current examples of Big Data include weblogs, RFID readers, sensornetworks, social networks, Internet text and documents, Internetsearch indexing, call detail records, genomics, astronomy, biologicalresearch, military surveillance, photography archives, video archives,
and large scale eCommerce.
Big Data requires scalable technologies to efficiently processlarge quantities of data within tolerable elapsed times. Big Data
scalable technologies include MPP databases, the Apache HadoopFramework, the Internet, and archival storage systems.
Currently Big Data ranges from a few dozen terabytes to severalpetabytes and is constantly increasing at both the low and high endsize.
Source: Teradata working definition, Sep 2010
-
7/31/2019 Big Data v2.1
8/468 > 9/15/2010
Big Data can work on lots of technologiesbut
* Source: Curt Monash, http://www.networkworld.com/community/node/41777, downloaded on 11-Sep-2010
-
7/31/2019 Big Data v2.1
9/469 > 9/15/2010
Its not the size of data, but whatyou do with it!
Number of Concurrent Users
10
s
100
s
1,
000
s
10
,000
s
Frequency of Update
Monthly
Daily
Now
Week
ly
Level of Detail
Dim
ension
alDetailed
Summary
Storage Scalability
Gigabytes
Terabytes1
00sTerabytes
Parallel OperationConditional NoneTotal
Query Complexity
Simple
3-5W
ay
Table
Join
s
6-20
Way
Tabl
eJoin
s
Data LoadingManual Automated Integrated
Workload Mix
ActiveData
Warehousin
g
Batch
Repo
rting
AdHoc
Interactive
Analysis
-
7/31/2019 Big Data v2.1
10/46
Relational Databases
-
7/31/2019 Big Data v2.1
11/4611 > 9/15/2010
Not your fathers Relational Database
Concepts date back to 1970 (Codd and Date)
Evolution of full-fledged RDBMS systems Oracle,
Sybase, DB2 etc. Performance Considerations
> Massively-Parallel Processing (MPP) platforms pioneeredby Teradata
> Column-oriented databases Sybase IQ, Vertica etc
Data Warehouse Appliances
Extremely large data-sets are routinely being processed
> Many companies in petabyte-club
-
7/31/2019 Big Data v2.1
12/4612 > 9/15/2010
An explosion in maturity in RDBMSAn Example
Up to 86PBUp to 17TBUp to 275TBUp to 88PBUp to 6TBScalability
Powered by the Teradata Database
ExtremePerformance
forOperational
Analytics
ExtremePerformance
Appliance
Data MartAppliance
ExtremeData
Appliance
DataWarehouseAppliance
ActiveEnterprise Data
Warehouse
Purpose
Test/Development
or SmallerData Marts
Analyticson Extreme
DataVolumes
Entry-levelEDW
or DepartmentalData Marts
EDW/ADW forfor both Strategicand Operational
Intelligence
Teradata Confidential12 > 9/15/2010
Big Data
-
7/31/2019 Big Data v2.1
13/4613 > 9/15/2010
Benefits of RDBMS and DW Appliances
Declarative ProgrammingStyle of SQL> Ease-of-use (no
programming required)
Separation of code anddata using Schemas
Rich set of analysis tools> Report writers
> Business intelligence tools
> Data mining tools
> Database design tools
Custom coding possiblevia UDFs (but not easy)
Codebase Maturity
> 30+ years ofdevelopment
Manageability> Out of the box Performance
> One number to call
> TCO Reduction
> Reduced administration
> Built-in high availability
> Scalability
> Rapid time-to-value
Low Cost per Query
-
7/31/2019 Big Data v2.1
14/4614 > 9/15/2010
Scalable Shared Nothing SoftwareVirtual Processors
Divide the data, workload, and system resourcesevenly among many parallel, processing units
(AMPs)No single point of control for any operation
>I/O, Buffers, Locking, Logging, Dictionary
>Nothing centralized
>Nothing in the way of linear scalability
AMPsLogs
Locks
Buffer
sI/O
-
7/31/2019 Big Data v2.1
15/4615 > 9/15/2010
Automatic Data Distribution
Each node has rows equally distributed from every table
Each node works independently, in parallel, on its rows
Table-A Table-B Table-C Table-D
Node 1
AMP1 AMP2
Node 2
AMP3 AMP4
Bynet
-
7/31/2019 Big Data v2.1
16/46
MapReduce and Hadoop
-
7/31/2019 Big Data v2.1
17/4617 > 9/15/2010
What is MapReduce?
A parallel programmingframework> Originally from Google
Generate search indexes Web scoring algorithms
> C++, Java, Python, etc.
> Harness 1000s of CPUs
MapReduce provides
> Automatic parallelization
> Fault tolerance
> Monitoring & status updates
Hadoop
> Open source MapReduce
Scheduler
Results
Map Function
map
reduce
shuffle
-
7/31/2019 Big Data v2.1
18/4618 > 9/15/2010
What is Hadoop?
Open source version of MapReduce
> Top level Apache project
> Contributors: Yahoo!, IBM, Google Hadoop Core includes
> Distributed File System
> Map/Reduce
> Job Tracker/Task Tracker
Written in Java
> Linux, Mac OS/X, Windows, and Solaris
Growing list of supporting tools
Hadoop is named for inventors child's stuffed elephanthttp://hadoop.apache.org
-
7/31/2019 Big Data v2.1
19/4619 > 9/15/2010
MapReduce has arrived
Clusters of 100s/1000s of commodity Intel machines
Implementation is a C++ library linked into userprograms
15,000+ computers running HadoopResearch for Ad systems and web search
Product search indexes
Analytics from user sessions
Log analysis for reporting and analytics and machinelearning
Log analysis, data mining, and machine learning
Large scale image conversion
High energy physics, genomics, Digital Sky Survey
> 60 Listed users of Hadoop on http://wiki.apache.org/hadoop/PoweredBy
-
7/31/2019 Big Data v2.1
20/4620 > 9/15/2010
Yahoo Hadoop clusters
Has ~38,000 machines running Hadoop
Largest Hadoop clusters are currently2000 nodes
Several petabytes of user data (compressed, unreplicated) Run hundreds of thousands of jobs every month
-
7/31/2019 Big Data v2.1
21/4621 > 9/15/2010
How does MapReduce work?An Example
2 Input Files
File1 =>
File2 =>
Hello World Bye World
Hello Hadoop Bye Hadoop
Map Reduce Class
public class WordCount {public void map( )
public static class Reduce ()
public static void main(){
setMapperClass(Map.class);setReducerClass(Reduce.class);
}}
Intermediate Files(Output of Map)
< Hello, 1>< Hadoop, 1>< Goodbye, 1>< Hadoop, 1>
< Hello, 1>< Hadoop, 1>< Goodbye, 1>< Hadoop, 1>
Output File(Output of Reduce)
Bye 1Goodbye 1Hadoop 2Hello 2World 2
-
7/31/2019 Big Data v2.1
22/46
22 > 9/15/2010
Simplified Hadoop Architecture
HadoopProgrammer
Name NodeMetadataJobTracker
Data NodeMapsReducersTaskTracker
Data NodeMapsReducersTaskTracker
Rack 1
HDFS HDFS
Data NodeMapsReducersTaskTracker
Data NodeMapsReducersTaskTracker
Rack 2
HDFS HDFS
Replicated blocks (aka fallback)
Code toparallelize
-
7/31/2019 Big Data v2.1
23/46
23 > 9/15/2010
Running Successive MapReduce Jobs
MR is batch processing
Run one MR job
> Output files on reduce node
Run next MR job
> Input files are prior reduceroutput
> Creates new reducer files Keep running MapReduce
batch jobs until solved
Job 1 -- Map
Job 1 -- Reduce
Job 2 -- Map
Job 2 -- Reduce
-
7/31/2019 Big Data v2.1
24/46
24 > 9/15/2010
Benefits of MapReduce / Hadoop
Support for 100s to 1000s of server nodes
> Extreme scalability
> Commodity hardware = low costs
Simplicity of programming model> Attractive to Programmers (programming vs. SQL)
Easy integration of developer tools> Java, grep, python, etc.
Batch programming
> Complex multi-step processing
Absence of rigid schema
>No ETL Flexibility
Some fault-tolerance
Easy setup
-
7/31/2019 Big Data v2.1
25/46
25 > 9/15/2010
MapReduce / HadoopChallenges
Performance lags MPP databases
Quality of Service
> Big jobs can hog the cluster (No workload management)> JobTracker memory as limited resource
> No Fair Scheduler
> Hard to troubleshoot performance bottlenecks
Skills
> Ongoing code maintenance depends on open-sourceprocess
> Skilled resources are still few-and-far-between
-
7/31/2019 Big Data v2.1
26/46
26 > 9/15/2010
Addressing Hadoop drawbacksRelated Open Source Projects
Pig (Pig Latin)
> Process language
> Data set handling, transformations
Hive> SQL interface to Hadoop
HBase> Column oriented database
Many others
> Zookeeper
> Eclipse Hadoop plug-in
> Thrift
> Cascading
> Scope
> Dryad/Linq
> etc.
-
7/31/2019 Big Data v2.1
27/46
MapReduce Applications
-
7/31/2019 Big Data v2.1
28/46
28 > 9/15/2010
Common Hadoop Applications
Reporting
> Web logs
Web analytics> Click streams
> Session analysis
> Social media
Data mining
> Personalization
> Clustering, k-means, bayesian
Text analysis
> Search indexes
> Sentiment analysis
-
7/31/2019 Big Data v2.1
29/46
29 > 9/15/2010
Web Analytics / Web Server Log Analysis
Cross-channel, trends,forecasts, anomalies
Web Site AnalyticsData Warehouse
Multivariate testing
Usability testing
Site error monitoring
Surveys
eCommerce merchandising
Dynamic ad serving
Behavior targeting
E-Mail targeting
Hadoop map reduce
-
7/31/2019 Big Data v2.1
30/46
30 > 9/15/2010
Search Index Page Ranking
-
7/31/2019 Big Data v2.1
31/46
31 > 9/15/2010
Large Scale Data Transformation
Millions of files or objects
Find key object and transform it
> Convert BMPs to JPGs> Convert DOCs to PDFs
> Change NYC to New York City
> Etc.
ETL but data stays on node
New York Times
> Convert 11M articles to PDFs
> Convert TIFFs in clouds
> Use Amazon EC2 and S3 clouds
> 4TB of articles 1.5TB of PDFs
Hadoop map reduce
EC2/S3 EC2/S3 EC2/S3
-
7/31/2019 Big Data v2.1
32/46
32 > 9/15/2010
RackSpace eMail Logs Tech Support
http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data
http://blog.racklabs.com/?p=66
TechSupport
Reports
100s of eMailServers
Hadoop FileSystem
emaillog
streams
LuceneText
Indexing
-
7/31/2019 Big Data v2.1
33/46
MapReduce and RDBMSComparisons
-
7/31/2019 Big Data v2.1
34/46
34 > 9/15/2010
Lots of controversy and flame-wars
http://www.dbms2.com/2009/04/14/ebay-thinks-mpp-dbms-clobber-mapreduce/
M R d A j t b k d
-
7/31/2019 Big Data v2.1
35/46
35 > 9/15/2010
MapReduce: A major step backwardsDeWitt and Stonebraker
Not a database
> No schema, indexes, optimizer
Not high performance
Not a data warehouse> No integrated data, no history
> Simple reports, single fact table
> Severe data skew
Not mature technology
> Many single points of failure
> No ISV tools
> No performance tools
Not for business users
> No pivot, drill down, self service
But does it matter??
-
7/31/2019 Big Data v2.1
36/46
36 > 9/15/2010
Architectural Comparisons
High labor cost/query (labor +operating costs) Lower costs/queryCost of Operation
Easy to setup and use
Low cost of acquisition
High cost-of-acquisition
Difficult to setup
Time-to-value
Significant flexibility Limited flexibilityFlexibility
Externally imposed NativeSchema Support
Programmed NativeIndexing
Programmatic
Attractive to programmers
DeclarativeProgramming Model
MapReduceRDBMS
Optimized Query plan allowsdata push
Intermediate data not written
to disk
Compression yields
performance (and space)benefits
Parse once during ETL phase
Data pull via intermediate files
Improved fault-tolerance(checkpointing)
Reduced performance
Pipelining
Compression yields poorer
performance (currently)
Compression
Parse (at least) once per queryRepetitive RecordParsing
* MapReduce and Parallel RDBMSs: Friends or Foes?
f
-
7/31/2019 Big Data v2.1
37/46
37 > 9/15/2010
Performance ComparisonsLoad Times better in Hadoop
Source: Stonebreaker et al, A Comparison of Approaches to Large-Scale Data Analysis, Apr 13,2009
P f C i
-
7/31/2019 Big Data v2.1
38/46
38 > 9/15/2010
Performance Comparisonsgrep and select times worse in Hadoop
Source: Stonebreaker et al, A Comparison of Approaches to Large-Scale Data Analysis, Apr 13,2009
P f C i
-
7/31/2019 Big Data v2.1
39/46
39 > 9/15/2010
Performance ComparisonsAggregate and Join Comparison Much worse in Hadoop
Source: Stonebreaker et al, A Comparison of Approaches to Large-Scale Data Analysis, Apr 13,2009
-
7/31/2019 Big Data v2.1
40/46
MapReduce and RDBMS
Complementary, not Competitive!
A l ti P
-
7/31/2019 Big Data v2.1
41/46
41 > 9/15/2010
STRATEGIC INTELLIGENCE OPERATIONAL INTELLIGENCE
Analytics ProcessesComparing MapReduce and RDBMS
OPERATIONALIZINGWHAT IShappening now?
REPORTING
WHAThappened?
ANALYZINGWHY did ithappen?
PREDICTINGWHAT WILL
happen?
ACTIVATINGMAKE it happen!
Batch
Ad Hoc
Analytics
Event driven
Continuous updates,tactical queriesHadoop
HadoopLarger Data Sizes
RDBMSComplexity, # queries
-
7/31/2019 Big Data v2.1
42/46
42 > 9/15/2010
Teradata and Hadoop
Use cases
> ETL on steroids
> Hadoop as a BI Tool> Hadoop as a BI application
Teradata DeveloperExchange
> UDF examples
http://developer.teradata.com/extensibility/articles/hadoop-dfs-to-teradatahttp://developer.teradata.com/extensibility/articles/sessionization-map-reduce-support-in-teradata
MapReduceUDFs
MarketingFinancialsOperations
SalesInventoryCustomers
TPT
ETL Analytic applications
-
7/31/2019 Big Data v2.1
43/46
43 > 9/15/2010
Hadoop Parallel Processing
Analytic, utility, etc. applications
Data sources
Data processing
MPP Data WarehousesWeb App server farmsWeb crawlers HDFS clusters
-
7/31/2019 Big Data v2.1
44/46
44 > 9/15/2010
When to Use Which Infrastructure
Complex processesMulti-step processes1,000+ nodes requiredCant move the data
Extensive text parsingHouseholding analysis
Data mining
Simple reportsData cleansing
It depends Iterative discoveryDrill down, OLAP
DashboardsBusiness user queryingIntegrated subject areasMultidimensional viewsTrends over time
Visualization toolsText analysis tools
MapReduce RDBMS
-
7/31/2019 Big Data v2.1
45/46
45 > 9/15/2010
Conclusion
MapReduce vs RDBMS> Surface similarities, but deep differences> RDBMS performance far superior to Hadoop> Still immature but expect Deficiencies to be solved over time
Complementary rather than Competitive technologies
> Unstructured datasets> One/few-time processing (e.g. web-logs)
Key Criterion - $/Query> Large # queries on dataset consider RDBMS
> Quick time-to-value for few queries consider MapReduce> Expect MapReduce high cost/query to drop over time
Consider MapReduce for> ETL processing
> Parallel programming
Keep a close eye on MapReduce, but dont give upyour RDBMS just yet!
-
7/31/2019 Big Data v2.1
46/46
Acknowledgements and Research material
MapReduce: Simplified DataProcessing on Large Clusters, Deanand Ghemawat, Google
Internal paper MapReduce and
Hadoop Technical Overview, DanGraham and Rick Burns, Teradata
Tutorial: What is MapReduce andHadoop, Dan Graham and RickBurns, Teradata
MapReduce, Hadoop, and DataWarehousing Coexistence , DanGraham and Rick Burns, Teradata
Pig Latin: A Not-So-ForeignLanguage for Data Processing,Olston et al
Hive A Warehousing Solution Over
a MapReduce Framework, Thusooet al
MapReduce and Parallel DBMSs:Friends or Foes?, Stonebraker et al,Communications Of The ACM |
January 2010 MapReduce: A major step
backwards, Dewitt andStonebraker,
A Comparison of Approaches to
Large-Scale Data Analysis, Pavlo etal, SIGMOD09, June 29July 2,2009
Special thanks to Dan Graham, Teradata Corporation
Special thanks to Dan Graham, Teradata Corporation