Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.
-
Upload
hadoop-dev -
Category
Data & Analytics
-
view
901 -
download
3
Transcript of Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.
© 2015 IBM Corporation
Pushing The Performance EnvelopeIdentifying Performance Bottlenecks in Big SQL/Hadoop space.
Roy Cecil [ [email protected]]/ 10/26/2015
• IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal
without notice at IBM’s sole discretion.
• Information regarding potential future products is intended to outline our general product direction
and it should not be relied on in making a purchasing decision.
• The information mentioned regarding potential future products is not a commitment, promise, or
legal obligation to deliver any material, code or functionality. Information about potential future
products may not be incorporated into any contract.
• The development, release, and timing of any future features or functionality described for our
products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks in a
controlled environment. The actual throughput or performance that any user will experience will vary
depending upon many factors, including considerations such as the amount of multiprogramming in the
user’s job stream, the I/O configuration, the storage configuration, and the workload processed.
Therefore, no assurance can be given that an individual user will achieve results similar to those stated
here.
Please Note:
2
Agenda
2
Big SQL Architecture Overview
Architecture Overview – IBM Open Platform
4
Text Analytics
POSIX Distributed File System
Multi-workload, Multi-tenant
scheduling
IBM BigInsights
Enterprise Management
Machine Learning
with Big R
Big R
IBM Open Platform (IOP) with Apache Hadoop- Full Open Source
IBM BigInsights
Data Scientist
IBM BigInsights
Analyst
Big SQL
BigSheets
Big SQL
BigSheets
for Apache Hadoop
Insight - IBM BigInsights for Apache Hadoop
24
x 7
Su
pp
ort
HadoopSystems
Architecture Overview
*FMP = Fenced mode process
Management Node
Big SQLMaster Node
Management Node
Big SQLScheduler
Big SQLWorker Node
JavaI/O
FMP
NativeI/O
FMP
HDFS Data Node
MRTask Tracker
Other ServiceHDFS
Data HDFSData HDFS
Data
TempData
UDF FMP
Compute Node
Database Service
Hive Metastore
Hive Server
Big SQLWorker Node
JavaI/O
FMP
NativeI/O
FMP
HDFS Data Node
MRTaskTracker
Other ServiceHDFS
Data HDFSData HDFS
Data
TempData
UDF FMP
Compute Node
Big SQLWorker Node
JavaI/O
FMP
NativeI/O
FMP
HDFS Data Node
MRTask Tracker
Other ServiceHDFS
Data HDFSData HDFS
Data
TempData
UDF FMP
Compute Node
DDLFMP
UDF FMP
Big SQL Head Node
Big SQL WorkerNodeBig SQL Worker Node
Big SQL WorkerNode
6
Big SQL does not own the data.Therefore, indexes cannot be built/Data is scatter partitioned – there isNO co-location of data
7
DB2 TempTablespace
Compute Node
Big SQL Worker Node
DB2 TempTablespaceTempTablespace
Big SQL Runtime
Big SQL Optimizer & Query Re-write Engine
•
••
••
•
SORTHEAP
HDFS dataHDFS data
HDFS data
JavaI/O
readerFMP
NativeI/O
readerFMP
Bufferpool cache is only for temporarydata (within the current query).SORTHEAP used to sort operations.They spill to Bufferpool and to disk If insufficient.
Big SQL Optimizer and query Re-write Engine selects best access plans.
A Look into DataNode
• Readers & Writers are responsible for reading/writing data from/to HDFS for the Big SQL engine.
• Native I/O reader (also known as dfsReader & C++ reader) The high-speed interface for common file formats Delimited, Parquet, RC, Avro, and Sequencefile
• Java I/O reader Handles all other formats via standard Hadoop/Hive API’s
• Both perform multi-threaded direct I/O on local data
• The database engine understands storage format capabilities Projection list is pushed into I/O format whenever possible Predicates are pushed as close to the data as
possible (into storage format, if possible) Predicates that cannot be pushed down are
evaluated within the database engine
• The database engine is only aware of which nodesneed to read Scheduler directs the readers to their portion of work
Readers/Writers
8
Big SQLWorker Node
JavaI/O
FMP
NativeI/O
FMP
HDFS Data Node
MRTask Tracker
Other ServiceHDFS
Data HDFSData HDFS
Data
TempData
UDF FMP
Compute Node
Scheduler
• The Scheduler is the main RDBMS↔Hadoop service interface
• Interfaces with Hive Metastore for table metadata Compiler ask it for some "hadoop" metadata, such as partitioning columns
• Acts like the MapReduce job tracker for Big SQL Big SQL provides query predicates for scheduler to perform
partition elimination Determines splits for each “table” involved in the query Schedules splits on available Big SQL nodes
(favoring scheduling locally to the data) Serves work (splits) to I/O engines Coordinates “commits” after INSERTs
9
Management Node
Big SQLMaster Node
Big SQLScheduler
DDLFMP
UDF FMP
Mgmt Node
Database Service
Hive Metastore
Big SQLWorker Node
JavaI/O
FMP
NativeI/O
FMP
HDFS Data Node
MRTask Tracker
UDF FMP
Metrics Driven Performance
Performance Management
11
• Data/Event correlation
• Form Hypothesis
• Performance Tuning
• Big SQL Metrics
• Hadoop Metrics
• Operating System Metrics
• Configuration
• Software
• Hardware
• Baseline
Change Management
Monitoring
CorrelationOptimizing
Categories of Metrics
12
Ambari Console – System/Hadoop Metrics
13
Historical view – Drill Down
14
Exploit the power of Hadoop Metrics
15
HDFS
MapReduce
RPC
Resource Manager
Others
Add Hadoop Metrics
16
Big SQL Metrics - Data Server Manager
DSM Welcome Screen
18
Adding a connection to your Big SQL database
19
$db2 get dbm cfg | grep SVCENAMETCP/IP Service name (SVCENAME) = db2c_bigsql
$ grep db2c_bigsql /etc/servicesdb2c_bigsql 32051/tcp
Overview Tab
20
21
Locking Tab
22
Applications
23
Workload
24
Memory
25
I/O
26
Storage Tab
27
Alerts
28
Case Study
TPC-DS ( query 16 )
30
Query 16 – Execution Overview
31
Query 16 – Statement View
32
Query 16 – Applications View
33
Query 16 – Statement View ( Detailed )
34
Where are we writing.
35
Query 16 - Plans
36
Query 16 - Plans
37
Query 16 Plans
38
Query 16 plans
39
Query 16- Force Application off.
40
41
Query 16 – After ANALYZE
42
Query 16 – After ANALYZE
43
Top 10 Performance Tips
45
46
Spread the Big SQL data
path over as many
disks as possible
Share disks between
Big SQL, HDFS (dfs.data.dir)
or GPFS, and MapReduce
intermediate data
(mapred.local.dir)
Big SQL[bigsql_db_path]
MapRed cache[mapred.local.dir]
HDFS/GPFS[dfs.data.dir]
47
• Big SQL needs to share cluster resources with other Hadoop components
• When installing Big SQL, the user specifies the percentage of cluster resources to dedicate to Big SQL
The default is 25%
Recommended range is 25% -> 75%
48
Out Of The Box results
PARQUET is the
optimal storage format
for Big SQL
For more details : http://bit.ly/1W7KOAk
49
Big SQL (and Hive) provide the ability to partition a table based on a data value
This improves query performance by eliminating those partitions that do not
contain the data value of interest
Big SQL stores different data partitions as separate files in hdfs and only scans the
partitions required by a query thereby improving runtime
Partition on a column commonly referenced in range delimiting or equality
predicates.
Range of dates are ideal for use as partition columns
50
Big SQL’s engine internally works with data in units of 32K pages and
works most efficiently when the definition of table allows a row to fit
within 32k of memory. To exploit this optimization when possible use
VARCHAR(n) instead of STRING
Use the bigsql.string.size property (via SET HADOOP PROPERTY) to lower
the default size of the VARCHAR to which the STRING is mapped when
creating new tables.
51
Big SQL uses a powerful Cost Based Optimizer to select an optimum
plan for the queries against it. Having up-to-date statistics is key to
having good query performance.
More on best practices around ANALYSE @ http://ibm.co/1PDXR8r
52
Informational constraints are like defining referential integrity
constraints but only not enforced.
Informational constraints provide Optimizer with hints about unique
values which would prevent it from doing unnecessary sorting and
aggregation. It also helps Optimizer make better selectivity estimates.
53
Big SQL comes with a powerful WLM( Workload Management ) tool.
With the WLM tool it is easy to define different workload and assign
resources ( CPU/Memory) to it.
This allows better exploitation of your system without sacrificing the
QoS requirements for your workloads.
It also improves the overall throughput of multi-stream workloads.
54
Self Tuning Memory Manager ( STMM ) is a thread that observes the
memory usage patterns on your cluster and adjusts the various
buffers to ensure your queries perform optimally.
One should turn on STMM and ensure that the database is activated
on all nodes.
Disclaimer Based on IBM internal tests comparing BigInsights Big SQL, Cloudera Impala and Hortonworks Hive (current versions available as of 9/01/2014) Running on identical hardware. The test workload was based on the latest revision of the TPC-DS benchmark specification at 10TB data size. Successful executions measure the ability to execute queries a) directly from the specification without modification, b) after simple modifications,c) after extensive query rewrites. All minor modifications are either permitted by the TPC-DS benchmark specification or are of a similar nature. All queries were reviewed and attested by a TPC certified auditor. Development effort measured time required by a skilled SQL developer familiar with each system to modify queries so they will execute correctly. Performance test measured scaled query throughput per hour of 4 concurrent users executing a common subset of 46 queries across all 3 systems at 10TB data size. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera.Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries.
Audited Results
58
Big SQL runs more SQL out-of-boxBig SQL 4.1 Spark SQL 1.5.0
1 hour 3-4 weeksPorting Effort:
Big SQL is the only engine that can
execute all 99 queries with minimal porting
effort
Big SQL vs. Spark SQL @ 1TB TPC-DS
• Single Stream Results:
Big SQL was faster than Spark SQL 76 / 99 Queries
When Big SQL was slower, it was only slower by 1.6X on average
Query vs. Query, Big SQL was on average 5.5X faster
Removing Top 5 / Bottom 5, Big SQL was 2.5X faster
But, … what happens when you scale it?
Scale Single Stream 4 Concurrent Streams
1 TB • Big SQL was faster on 76 / 99 Queries
• Big SQL averaged 5.5X faster
• Removing Top / Bottom 5, Big SQL averaged 2.5X faster
• Spark SQL FAILED on 3 queries
• Big SQL was 4.4X faster*
10 TB • Big SQL was faster on 80/99 Queries
• Spark SQL FAILED on 7 queries
• Big SQL averaged 6.2X faster*
• Removing Top / Bottom 5, Big SQL averaged 4.6X faster
• Big SQL elapsed time for workload was better than linear
• Spark SQL could not complete the workload (numerous issues). Partial results possible with only 2 concurrent streams.
*Compares only queries that both Big SQL and Spark SQL could complete (benefits Spark SQL)
More Users
Mo
re Data
Recommendation: Use Both…. Right Tool for the Right JobNot Mutually Exclusive. Big SQL & Spark SQL can co-exist in the cluster
Migrating existing workloads to Hadoop
Security
Many Concurrent Users
Best in-class Performance
Machine Learning
Large Scale / Complex Transformations
Very Good Performance
Avoid maintaining 2 versions of SQL queries
(RDBMS vs. Hadoop)
Ideal tool for Data Engineers and Data Scientists
Big SQL Spark SQL
… invoke Big SQL from Spark for best of both…
63
© 2015 IBM Corporation
Thank You
Sr. Performance Engineer, IBM Software Labs, Dublin
We Value Your Feedback!
Don’t forget to submit your Insight session and speaker
feedback! Your feedback is very important to us – we use it
to continually improve the conference.
Access the Insight Conference Connect tool at
insight2015survey.com to quickly submit your surveys from
your smartphone, laptop or conference kiosk.
65
66
Notices and Disclaimers
Copyright © 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form
without written permission from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for
accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to
update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO
EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO,
LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted
according to the terms and conditions of the agreements under which they are provided.
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as
illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other
results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services
available in all countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the
views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or
other guidance or advice to any individual participant or their specific situation.
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the
identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the
customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will
ensure that the customer is in compliance with any law.
67
Notices and Disclaimers (con’t)
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly
available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the
suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights,
trademarks or other intellectual property right.
• IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document
Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM
SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON,
OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®,
pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ,
Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of
International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at:
www.ibm.com/legal/copytrade.shtml.
© 2015 IBM Corporation
Thank You