SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382. · 2015-08-12 · SEIZE THE DATA. 2015...
Transcript of SEIZE THE DATA. 2015 - Hewlett Packard Enterpriseh41382. · 2015-08-12 · SEIZE THE DATA. 2015...
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.1 SEIZE THE DATA. 2015
SEIZE THE DATA. 2015
SEIZE THE DATA. 2015
Vertica SQL on Hadoop
Aliaksei Sandryhaila, Bob Hansen / August 12, 2015
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.3 SEIZE THE DATA. 2015
This is a rolling (up to three year) Roadmap and is subject to change without notice.
Forward-looking statements
This document contains forward looking statements regarding future operations, product development, product capabilities and availability dates. This information is subject to substantial uncertainties and is subject to change at any time without prior notification. Statements contained in this document concerning these matters only reflect Hewlett Packard's predictions and / or expectations as of the date of this document and actual results and future plans of Hewlett-Packard may differ significantly as a result of, among other things, changes in product strategy resulting from technological, internal corporate, market and other changes. This is not a commitment to deliver any material, code or functionality and should not be relied upon in making purchasing decisions.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.4 SEIZE THE DATA. 2015
This is a rolling (up to three year) Roadmap and is subject to change without notice.
HP confidential information
This Roadmap contains HP Confidential Information.
If you have a valid Confidential Disclosure Agreement with HP, disclosure of the Roadmap is subject to that CDA. If not, it is subject to the following terms: for a period of 3 years after the date of disclosure, you may use the Roadmap solely for the purpose of evaluating purchase decisions from HP and use a reasonable standard of care to prevent disclosures. You will not disclose the contents of the Roadmap to any third party unless it becomes publically known, rightfully received by you from a third party without duty of confidentiality, or disclosed with HP’s prior written approval.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.5 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Big Data gets more valuable as more kinds of data are brought together
Hadoop
HDFS
Billing
Clickstream
Telemetry
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.6 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
VSQLH makes Vertica the power tool for your Hadoop data
Hive Pig
MapReduceHB
ase
HDFS
Billing
Clickstream
Telemetry
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.7 SEIZE THE DATA. 2015
Flat files in HDFS can be accessed via the HDFS Connector
CREATE EXTERNAL TABLE tweet_keywords
(keyword VARCHAR(64),
created_at TIMESTAMP,
score INT)
AS COPY FROM SOURCE
Hdfs(url=‘http://n01:50070/webhdfs/v1/posts/*’)
PARSER TwitterSentiment();
HDFS
json csv
This is a rolling (up to 3 year) roadmap and is subject to change without notice
HP Vertica
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.8 SEIZE THE DATA. 2015
Structured data in HCatalog can be accessed via the Hcatalog Connector
CREATE HADOOP SCHEMA hive WITH
HOSTNAME='hcat.vertica.com‘
HCATALOG_DB=‘hive‘
HCATALOG_USER=‘hive_user’;
HC
ata
log
HDFS
json csv
HDFS
txt orc
This is a rolling (up to 3 year) roadmap and is subject to change without notice
HP Vertica
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.9 SEIZE THE DATA. 2015
Structured data in HCatalog can be accessed via the Hcatalog Connector
SELECT
keyword,
EXTRACT(month from created_at),
AVG(score)
FROM hive.tweet_keywords
WHERE created_at >=
now()- ‘3 months’::interval
GROUP BY 1,2
ORDER BY 2;
HC
ata
log
HDFS
json csv
HDFS
txt orc
This is a rolling (up to 3 year) roadmap and is subject to change without notice
HP Vertica
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.10 SEIZE THE DATA. 2015
HDFS can be used as a storage tier in Vertica Enterprise
CREATE LOCATION 'webhdfs://server/user/asmith/data'
ALL NODES
SHARED
USAGE 'data'
LABEL 'archive';
SELECT set_object_storage_policy('finances',
'archive', 1900, 2014);
HC
ata
log
HDFS
json csv
HDFS
txt orc
HDFS
ROS ROS
This is a rolling (up to 3 year) roadmap and is subject to change without notice
HP Vertica
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.11 SEIZE THE DATA. 2015
What’s new in Excavator
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.12 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Excavator powers VSQLH to extend across your organization
Hive Pig
MapReduceHB
ase
HDFS
Billing
Clickstream
Telemetry
Finance
Marketing
Sales
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.13 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Hadoop data is accessible across department boundaries
Access secured HDFS clusters
• HDFS connector (flat files on HDFS)
• HCatalog connector (Hive data on HDFS)
• HDFS storage (Vertica data on HDFS) HC
ata
log
HDFS
json csv
HDFS
txt orc
HDFS
ROS ROS
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.14 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Vertica can access Kerberized Hadoop clusters
Support for Kerberos
• Session- and process-wide ticket reuse
• Multiple concurrent queries
• Transparent on a Kerberized Vertica cluster
HDFS
Kerberos serverClient
ID
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.15 SEIZE THE DATA. 2015
Kerberos setup in Vertica takes a few quick steps
ALTER DATABASE mydb SET KerberosKeytabFile = '/etc/krb5.keytab';
ALTER DATABASE mydb SET KerberosServiceName = 'vertica';
ALTER DATABASE mydb SET KerberosRealm = 'example.com';
CREATE AUTHENTICATION 'gss_auth' METHOD 'gss' HOST '0.0.0.0/0';
GRANT AUTHENTICATION 'gss_auth' TO Public;
This is a rolling (up to 3 year) roadmap and is subject to change without notice
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.16 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Work with native Hadoop data directly
Vertica supports ORC
• ORC – Optimized Row Columnar format
• Introduced in Hive 0.11
• Supported by all major Hadoop distributions (Apache, HDP, CDH, MapR)
• Advantages− open-source (third-party APIs available)− column-major− efficient (metadata, encoding, compression)− forward- and backward-compatible
• Similar to RCFile, Parquet− support for these and other formats can be added in the future
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.17 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Benefit from ORC similarly to Vertica’s native format
ORC format
• Tables stored in column-major blocks of rows (stripes)
• Metadata stored in file footer, stripe footer and index
• Full data type support− primitive SQL types− complex Hive types
• Encoding (RLE, delta, dictionary)
• Compression (zlib, snappy)
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.18 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Read ORC conveniently and fast
Built-in ORC reader
• No UDL setup needed – ORC is natively supported by Vertica
• Create and query ORC tables− Stream as external tables
CREATE EXTERNAL TABLE t(…) AS COPY FROM “…” ORC;
− Load into Vertica tablesCREATE TABLE t(…);
COPY t FROM “…” ORC;
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.19 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Read ORC conveniently and fast
Access data on HDFS and local FS
• Read from HDFSCOPY t FROM “webhdfs://hostname:port/data/sales/orders.orc” ORC;
• Read from local FSCOPY t FROM “/data/sales/orders.orc” ORC;
• No need for serialization/deserialization
SerDe
orc
Before
orc
Excavator
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.20 SEIZE THE DATA. 2015
Use Vertica as a stable SQLH engine
Stability of a mature RDBMS
• ANSI SQL support
• Resource management
• Workload distribution
Vertica Impala (CDH 5.3)
Supported unmodified
TPC-DS* queries
* Transaction Processing Performance Council Decision Support benchmark (99 business analytics queries)
94
56
This is a rolling (up to 3 year) roadmap and is subject to change without notice
5
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.21 SEIZE THE DATA. 2015
Achieve competitive performance on Hadoop
Performance optimizations
• Predicate pushdown
• Column pruning
• Co-located data processing
• Sideways Information Passing
3 TB (scale factor 3K)9-node cluster (32x Intel Xeon, 256M RAM)CDH 5.3
This is a rolling (up to 3 year) roadmap and is subject to change without notice
0x
1x
2x
3x
4x
5x
6x
7x
8x
Rel
ativ
e sp
eed
-up
(lar
ger
is b
ette
r)
Queries (slowest to fastest)
TPC-DS performance vs Impala
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.22 SEIZE THE DATA. 2015
Up to 7x
faster
ComparableUp to 2x
slower
43 queries more!
0x
1x
2x
3x
4x
5x
6x
7x
8x
Rel
ativ
e sp
eed
-up
Queries
TPC-DS performance vs Impala
VSQLH is reliable and performantThis is a rolling (up to 3 year) roadmap and is subject to change without notice
• Work in progress: additional ~1.5x speed-up
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.23 SEIZE THE DATA. 2015
Contribute to VSQLH
Open-source VSQLH components
• ORC C++ API
− Developed in collaboration with Hortonworks
− Became a top-level Apache project in June 2015
− Check out orc.apache.org
• HDFS C++ API
− Developed in collaboration with Hortonworks
− Integrated into Apache Hadoop in July 2015
• APIs for other file formats and filesystems can be integrated into Vertica
This is a rolling (up to 3 year) roadmap and is subject to change without notice
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.24 SEIZE THE DATA. 2015
Where will we go from here?
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.25 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
VSQLH will be even faster
REST
HDFS HDFSorc
orcorc
orcorc
orc
orcorc
orc
EE EE EE
Optimize
Plan
Native access to HDFS Hcatalog Optimizer
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.26 SEIZE THE DATA. 2015
This is a rolling (up to 3 year) roadmap and is subject to change without notice
Q & A
HC
ata
log
HDFS
json csv
HDFS
txt orc
HDFS
ROS ROS0x
1x
2x
3x
4x
5x
6x
7x
8x
Rel
ativ
e sp
eed
-up
Queries
TPC-DS performance vs Impala
SEIZE THE DATA. 2015