Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

43
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Phoenix + Apache HBase An Enterprise Grade Data Warehouse Ankit Singhal , Rajeshbabu , Josh Elser June, 30 2016

Transcript of Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

Page 1: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Phoenix + Apache HBaseAn Enterprise Grade Data WarehouseAnkit Singhal , Rajeshbabu , Josh ElserJune, 30 2016

Page 2: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

About us!!

– Committer and member of Apache Phoenix PMC– MTS at Hortonworks.

Ankit Singhal

– Committer and member of Apache Phoenix PMC– Committer in Apache HBase– MTS at Hortonworks.

RajeshBabu

– Committer in Apache Phoenix– Committer and Member of Apache Calcite PMC– MTS at Hortonworks.

Josh Elser

Page 3: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaPhoenix & HBase as an Enterprise Data Warehouse

Use Cases

Optimizations

Phoenix Query server

Q&A

Page 4: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data WarehouseEDW helps organize and aggregate analytical data from various functional domains and serves as a critical repository for organizations’ operations.

STAGING

Files

IOTdata

Data Warehouse

Mart

OLTP

ETL Visualization or BI

Page 5: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Phoenix Offerings and Interoperability:-

ETL Data Warehouse Visualization & BI

Page 6: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Table,a,123

Table,,123

RegionServer

HDFS

HBase client

Phoenix client

Phx coproc

ZooKeeper

Table,b,123

Table,a,123Phx coproc

Table,c,123

Table,b,123Phx coproc

RegionServer RegionServer

Application

HBase & Phoenix HBase , a distributed NoSQL storePhoenix , provides OLTP and Analytics over HBase

Page 7: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Open Source Data Warehouse

Hardware cost

Softw

are

cost

Specialized H/WCommodity H/W

Lice

nsin

g co

stN

o Co

stSMPMPP

Open Source MPP

HBase+ Phoenix

Page 8: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Phoenix & HBase as a Data Warehouse

Architecture

Run on commodity

H/WTrue MPP

O/S and H/W

flexibility

Support OLTP and

ROLAP

Page 9: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Phoenix & HBase as a Data Warehouse

Scalability

Linear scalability for storage

Linear scalability

for memory

Open to Third party

storage

Page 10: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Phoenix & HBase as a Data Warehouse

Reliability

Highly Available

Replication for disaster

recovery

Fully ACID for Data Integrity

Page 11: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Phoenix & HBase as a Data Warehouse

Manageability

Performance Tuning

Data Modeling &

Schema Evolution

Data pruning

Online expansion

Or upgradeData Backup and recovery

Page 12: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaPhoenix & HBase as an Enterprise Data Warehouse

Use cases

Page 13: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Who uses Phoenix !!

Page 14: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Analytics Use case - (Web Advertising company)

Functional Requirements– Create a single source of truth– Cross dimensional query on 50+ dimension and 80+ metrics– Support fast Top-N queries

Non-functional requirements– Less than 3 second Response time for slice and dice– 250+ concurrent users – 100k+ Analytics queries/day– Highly available– Linear scalability

Page 15: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Warehouse Capacity

Data Size(ETL Input)– 24TB/day of raw data system wide– 25 Billion of impressions

HBase Input(cube)– 6 Billion rows of aggregated data(100GB/day)

HBase Cluster size– 65 Nodes of HBase– 520 TB of disk– 4.1 TB of memory

Page 16: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Use Case Architecture

AdServer

Click Tracking

KafkaInput

KafkaInput

ETL Filter Aggregate

In- Memory Store

ETL Filter Aggregate

Real-time

KafkaCAMUS

HDFSETL

HDFSData

Uploader

DATA

API

HBaseViews

ANALYTICS

UI

Batch Processing

Data Ingestion Analytics

ApacheKafka

Page 17: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Cube Generation

Cubes are stored in HBase

ANALYTICS

UI

Convert slice and

dice query to SQL query

Data API

Analytics Data Warehouse Architecture

Bulk Load

HDFS

ETL

Backup and

recovery

Page 18: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Time Series Use Case- (Apache Ambari)

Functional requirements– Store all cluster metrics collected every second(10k to 100k metrics/second)– Optimize storage/access for time series data

Non-functional requirements– Near real time response time – Scalable– Real time ingestion

Ambari Metrics System (AMS)

Page 19: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AMS architecture

Metric Monitors

Hosts

Hadoop Sinks

HBase

Phoenix

Metric Collector

Ambari Server

Page 20: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaPhoenix & HBase as an Enterprise Data Warehouse

Use Cases

Optimizations

Page 21: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Schema Design

Most important criteria for driving overall performance of queries on the table Primary key should be composed from most-used predicate columns in the queries In most cases, leading part of primary key should help to convert queries into point

lookups or range scans in HBase

Primary key design

Page 22: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Schema Design

Use salting to alleviate write hot-spotting

CREATE TABLE …(

) SALT_BUCKETS = N

– Number of buckets should be equal to number of RegionServers

Otherwise, try to presplit the table if you know the row key data set

CREATE TABLE …(

) SPLITS(…)

Salting vs pre-split

Page 23: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Schema Design

Use block encoding and/or compression for better performance

CREATE TABLE …(

) DATA_BLOCK_ENCODING= ‘FAST_DIFF’, COMPRESSION=‘SNAPPY’

Use region replication for read high availability

CREATE TABLE …(

) “REGION_REPLICATION” = “2”

Table properties

Page 24: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Schema Design

Set UPDATE_CACHE_FREQUENCY to bigger value to avoid frequently touching server for metadata updates

CREATE TABLE …(

) UPDATE_CACHE_FREQUENCY = 300000

Table properties

Page 25: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Schema Design

Divide columns into multiple column families if there are rarely accessed columns– HBase reads only the files of column families specified in the query to reduce I/O

pk1 pk2CF1 CF2

Col1 Col2 Col3 Col4 Col5 Col6 Col7

Frequently accessing columns Rarely accessing columns

Page 26: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Secondary Indexes

Global indexes– Optimized for read heavy use casesCREATE INDEX idx on table(…)

Local Indexes– Optimized for write heavy and space constrained use casesCREATE LOCAL INDEX idx on table(…)

Functional indexes– Allow you to create indexes on arbitrary expressions.CREATE INDEX UPPER_NAME_INDEX ON EMP(UPPER(FIRSTNAME||’ ’|| LASTNAME ))

Page 27: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Secondary Indexes

Use covered indexes to efficiently scan over the index table instead of primary table.

CREATE INDEX idx ON table(…) include(…) Pass index hint to guide query optimizer to select the right index for query

SELECT /*+INDEX(<table> <index>)*/..

Page 28: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Row Timestamp Column

Maps HBase native row timestamp to a Phoenix column Leverage optimizations provided by HBase like setting the minimum and maximum time

range for scans to entirely skip the store files which don’t fall in that time range. Perfect for time series use cases. Syntax

CREATE TABLE …(CREATED_DATE NOT NULL DATE

CONSTRAINT PK PRIMARY KEY(CREATED_DATE ROW_TIMESTAMP…

)

Page 29: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Use of Statistics

Region A

Region F

Region L

Region R

Chunk A

Chunk C

Chunk F

Chunk I

Chunk L

Chunk O

Chunk R

Chunk U

A

F

R

L

A

F

R

L

C

I

O

U

Client Client

Page 30: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Skip Scan Phoenix supports skip scan to jump to matching keys directly when the query has key

sets in predicate

SELECT * FROM METRIC_RECORD WHERE METRIC_NAME LIKE 'abc%' AND HOSTNAME in ('host1’, 'host2');

CLIENT 1-CHUNK PARALLEL 1-WAY SKIP SCAN ON 2 RANGES OVER METRIC_RECORD ['abc','host1'] - ['abd','host2']

Region1

Region2

Region3

Region4

Client

RS 3

RS 2

RS 1

Skip scan

Page 31: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Join optimizations

Hash Join– Hash join outperforms other types of join algorithms when one of the relations is smaller or

records matching the predicate should fit into memory

Sort-Merge join– When the relations are very big in size then use the sort-merge join algorithm

NO_STAR_JOIN hint– For multiple inner-join queries, Phoenix applies a star-join optimization by default. Use this hint in

the query if the overall size of all right-hand-side tables would exceed the memory size limit.

NO_CHILD_PARENT_OPTIMIZATION hint– Prevents the usage of child-parent-join optimization.

Page 32: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Optimize Writes

Upsert values– Call it multiple times before commit for batching mutations– Use prepared statement when you run the query multiple times

Upsert select– Configure phoenix.mutate.batchSize based on row size– Set auto-commit to true for writing scan results directly to HBase.– Set auto-commit to true while running upsert selects on the same table so that writes happen at

server.

Page 33: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hints

SERIAL SCAN, RANGE SCAN SERIAL SMALL SCAN

Some important hints

Page 34: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Additional References

For some more optimizations you can refer to these documents– http://phoenix.apache.org/tuning.html– https://hbase.apache.org/book.html#performance

Page 35: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaPhoenix & HBase as an Enterprise Data Warehouse

Use Cases

Optimizations

Phoenix Query Server

Page 36: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Phoenix Query Server

A standalone service that proxies user requests to HBase/Phoenix– Optional

Reference client implementation via JDBC– ”Thick” versus “Thin”

First introduced in Apache Phoenix 4.4.0 Built on Apache Calcite’s Avatica

– ”A framework for building database drivers”

Page 37: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Traditional Apache Phoenix RPC Model

Table,a,123

Table,,123

RegionServer

HDFS

HBase client

Phoenix client

Phx coprocZooKeeper Table,b,123

Table,a,123Phx coproc

Table,c,123

Table,b,123Phx coproc

RegionServer RegionServer

Application

Page 38: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Query Server Model

Table,a,123

Table,,123

RegionServer

HDFS

HBase client

Phoenix client

Phx coprocZooKeeper Table,b,123

Table,a,123Phx coproc

Table,d,123

Table,b,123Phx coproc

RegionServer RegionServer

Query Server

Application

Page 39: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Query Server Technology

HTTP Server and wire API definition Pluggable serialization

– Google Protocol Buffers

“Thin” JDBC Driver (over HTTP) Other goodies!

– Pluggable metrics system– TCK (technology compatibility kit)– SPNEGO for Kerberos authentication– Horizontally scalable with load balancing

Page 40: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Query Server Clients

Go language database/sql/driver– https://github.com/Boostport/avatica

.NET driver– https://github.com/Azure/hdinsight-phoenix-sharp– https://www.nuget.org/packages/Microsoft.Phoenix.Client/1.0.0-preview

ODBC– Built by http://www.simba.com/, also available from Hortonworks

Python DB API v2.0 (not “battle tested”)– https://bitbucket.org/lalinsky/python-phoenixdb

Client enablement

Page 41: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaPhoenix & HBase as an Enterprise Data Warehouse

Use Cases

Optimizations

Phoenix Query Server

Q&A

Page 42: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

We hope to see you all migrating to Phoenix & HBase and expecting more questions on the user mailing lists.

Get involved in mailing lists:[email protected]@hbase.apache.org

You can reach us on:[email protected]@[email protected]

Phoenix & HBase

Page 43: Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank You