Druid at Hadoop Ecosystem

26
Druid @Hadoop Ecosystem Slim Bouguerra , Nishant Bangarwa , Jesús Camacho Rodríguez, Ashutosh Chauhan, Gunther Hagleitner, Julian Hyde, Carter Shanklin Druid meetup 21/02/2017

Transcript of Druid at Hadoop Ecosystem

Page 1: Druid at Hadoop Ecosystem

Druid @Hadoop EcosystemSlim Bouguerra, Nishant Bangarwa , Jesús Camacho Rodríguez, Ashutosh Chauhan, Gunther Hagleitner, Julian Hyde, Carter Shanklin

Druid meetup21/02/2017

Page 2: Druid at Hadoop Ecosystem

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

1. Security enhancement

2. Deployment and management

3. SQL interaction

Page 3: Druid at Hadoop Ecosystem

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Security : Integration with Kerberos Spnego

Page 4: Druid at Hadoop Ecosystem

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Kerberos/Spnego integration (druid 0.10)

Securing all the endpoint except some specified ones if needed. UI access to coordinator and overlord is protected as well (browser configuration

needed).

Druid

Druid

KDC server

1 kinit user

2 Token

3 Negotiate using token

User Browser

4 Valid cookie

1 kinit user2 Token

3 Negotiate

4 Valid cookie

Page 5: Druid at Hadoop Ecosystem

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Kerberos/Spnego integration (druid 0.10)

Securing all the endpoint except some specified ones if needed. UI access to coordinator and overlord is protected as well (browser configuration

needed).

druid.hadoop.security.spnego.keytab

keytab_dir/spnego.service.keytab

This is the SPNEGO service keytab that is used for authentication.

druid.hadoop.security.spnego.principal HTTP/_HOST@realm

This is the SPNEGO service principal that is used for authentication

curl --negotiate -u:anyUser -b ~/cookies.txt -c ~/cookies.txt -X POST -H'Content-Type: application/json' http://_endpoint

Page 6: Druid at Hadoop Ecosystem

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Security Next: Integrate with Apache Ranger/ Apache KNOX

Leveraging SSO via Apache KNOX Data source Level user/group based authorization. Row/Column level user/group based authorization.

Page 7: Druid at Hadoop Ecosystem

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Deployment and management: Apache Ambari integration

Page 8: Druid at Hadoop Ecosystem

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Simple Druid Management with Ambari UI is the source of truth (What you see is what you get !).

Page 9: Druid at Hadoop Ecosystem

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Simple Druid Management with Ambari Works with hadoop/hdfs zookeeper… superset, etc..

Page 10: Druid at Hadoop Ecosystem

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Simple Druid Management with Ambari

Versionsmanagements

Page 11: Druid at Hadoop Ecosystem

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Deployment and management via Ambari/HDP

UI is the source of truth (What you see is what you get !). Works with hadoop/hdfs out of the box. Installs and configures Superset (Ex Caravel -> Ex Panomamix ) UI. Integrates with Kerberos (Hadoop and HDFS interaction/ intra Druid security). Supports rolling deployments. Monitoring via Graphana dashboard (backed by Hbase).

Page 12: Druid at Hadoop Ecosystem

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

SQL interface: Hive integration

Page 13: Druid at Hadoop Ecosystem

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Benefits both to Druid and Apache Hive

Efficient execution of OLAP queries in Hive to power BI tools.

Interaction with realtime data.

Create/Drop data source using SQL syntax.

Being able to execute complex SQL operations out of the box on Druid data and other sources like joins and window functions.

Hive side Druid side

Page 14: Druid at Hadoop Ecosystem

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data source creation

Data already existing in druid– All you need is to point hive to broker and specify datasource name

Data outside of druid– Data already existing in Hive .– Data stored in distributed filesystem like HDFS, S3 in a format that can be read by hive eg TSV, CSV

ORC, Parquet etc.– Need Perform some pre-processing over various data sources before feeding it to druid

Create Table statement

Page 15: Druid at Hadoop Ecosystem

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Druid data sources in Hive

Point hive to the broker:– SET hive.druid.broker.address.default=druid.broker.hostname:8082;

Simple CREATE EXTERNAL TABLE statementCREATE EXTERNAL TABLE druid_table_1STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'TBLPROPERTIES ("druid.datasource" = "wikiticker");

Hive table nameHive storage handler classnameDruid data source name

⇢ Broker node endpoint specified as a Hive configuration parameter⇢ Automatic Druid data schema discovery: segment metadata query

Registering Druid data sources

Page 16: Druid at Hadoop Ecosystem

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Druid data sources in Hive

Point hive to druid metadata storage and deep storage path– Set hive.druid.metadata.password=diurd – Set hive.druid.metadata.username=druid – Set hive.druid.metadata.uri=jdbc:mysql://host/druid_db– Set druid.storage.storageDirectory=s3a://druid-cloud-bucket/

Use Create Table As Select (CTAS) statementCREATE TABLE druid_table_1STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'TBLPROPERTIES ("druid.datasource" = "wikiticker”, "druid.segment.granularity" = "HOUR")ASSELECT __time, page, user, c_added, c_removedFROM src;

Hive table nameHive storage handler classnameDruid data source name

Creating Druid data sources

Page 17: Druid at Hadoop Ecosystem

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Druid data sources in Hive

Use Create Table As Select (CTAS) statement

CREATE TABLE druid_table_1STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler’TBLPROPERTIES ("druid.datasource" = "wikiticker”, "druid.segment.granularity" = "HOUR")ASSELECT __time, page, user, c_added, c_removedFROM src;

⇢ Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive column type

Creating Druid data sources

Timestamp Dimensions Metrics

Credit [email protected]

Page 18: Druid at Hadoop Ecosystem

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Druid data sources in Hive

__time page user c_added c_removed

2011-01-01T01:05:00Z Justin Boxer 1800 25

2011-01-02T19:00:00Z Justin Reach 2912 42

2011-01-01T11:00:00Z Ke$ha Xeno 1953 17

2011-01-02T13:00:00Z Ke$ha Helz 3194 170

2011-01-02T18:00:00Z Miley Ashu 2232 34

CTAS query results

Select

File Sink

Original CTASphysical plan

Table Scan

Credit [email protected]

Page 19: Druid at Hadoop Ecosystem

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Rewritten CTASphysical plan

Druid data sources in Hive

Creating Druid data sources

File Data needs to be partitioned by time granularity "druid.segment.granularity" = "HOUR"

Table Scan

Select

File Sink

__time page user c_added c_removed __time_granularity

2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z

2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z

2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z

2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z

2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z

CTAS query results

Truncate timestamp to day granularity

Select

File Sink: Druid output format

Reduce

Table Scan

Page 20: Druid at Hadoop Ecosystem

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Rewritten CTASphysical plan

Druid data sources in Hive

Creating Druid data sources

File Sink operator uses Druid output format– Creates segment files and save segments descriptors

metadata to hdfs.– After successful reducer operation all the descriptors

will be committed to metadata storage atomically. – Wait for handoff if coordinator is detected.

Table Scan

Select

2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z

2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z

CTAS query results

Select

File Sink

Reduce

Table Scan

2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z

2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z

2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z

Segment 2011-01-01

Segment 2011-01-02

File Sink: Druid output format

Page 21: Druid at Hadoop Ecosystem

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Druid data sources in Hive

Use Insert Overwrite As Select (CTAS) statement:– Can only append or overwrite.– Need to keep the same schema.

Update data sources

INSERT OVERWRITE TABLE druid_table_1 ASSELECT __time, page, user, c_added, c_removedFROM src;

Use Drop table to delete meta data from hive and data source from druid

DROP TABLE druid_table_1 [PURGE];

Page 22: Druid at Hadoop Ecosystem

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Querying Druid data sources

Automatic rewriting when query is expressed over Druid table– Powered by Apache Calcite– Main challenge: identify patterns in logical plan corresponding to

different kinds of Druid queries (Timeseries, GroupBy, Select)

Translate (sub)plan of operators into valid Druid JSON query– Druid query is encapsulated within Hive TableScan operator

Hive TableScan uses Druid input format– Submits query to Druid and generates records out of the query results– interaction with Druid broker node or historicals in parallel

It might not be possible to push all computation to Druid– Our contract is that the query should always be executed

Page 23: Druid at Hadoop Ecosystem

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Druid input format extends InputFormat<NullWritable, DruidWritable> Submits query to Druid and generates records out of the query

results Current version

– Timeseries, TopN, and GroupBy queries are not partitioned– Select queries partitioned along time dimension column considering

uniform distribution

Ongoing work for select query– Bypass broker: query Druid realtime and historical nodes directly

Page 24: Druid at Hadoop Ecosystem

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Next

Push more time filters predicates and/or computation down the chain. Make use of Long/Float Columns. Complex column types (sketches, HLL etc…). Stream version of Select query. Interact with coordinator for data creation. Time semantic (Time zone handling). Null semantic.

Hive integration

Page 25: Druid at Hadoop Ecosystem

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank You@ApacheHive | @ApacheCalcite | @druidio | @ApacheAmbari

http://cwiki.apache.org/confluence/display/Hive/Druid+Integrationhttp://calcite.apache.org/docs/druid_adapter.htmlhttps://issues.apache.org/jira/browse/AMBARI-17981

Page 26: Druid at Hadoop Ecosystem

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Demo: