Druid at Hadoop Ecosystem

Druid @Hadoop EcosystemSlim Bouguerra, Nishant Bangarwa , Jesús Camacho Rodríguez, Ashutosh Chauhan, Gunther Hagleitner, Julian Hyde, Carter Shanklin

Druid meetup21/02/2017

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

1. Security enhancement

2. Deployment and management

3. SQL interaction


Security : Integration with Kerberos Spnego


Kerberos/Spnego integration (druid 0.10)

Securing all the endpoint except some specified ones if needed. UI access to coordinator and overlord is protected as well (browser configuration

needed).

Druid

Druid

KDC server

1 kinit user

2 Token

3 Negotiate using token

User Browser

4 Valid cookie

1 kinit user2 Token

3 Negotiate

4 Valid cookie


Kerberos/Spnego integration (druid 0.10)

Securing all the endpoint except some specified ones if needed. UI access to coordinator and overlord is protected as well (browser configuration

needed).

druid.hadoop.security.spnego.keytab

keytab_dir/spnego.service.keytab

This is the SPNEGO service keytab that is used for authentication.

druid.hadoop.security.spnego.principal HTTP/_HOST@realm

This is the SPNEGO service principal that is used for authentication

curl --negotiate -u:anyUser -b ~/cookies.txt -c ~/cookies.txt -X POST -H'Content-Type: application/json' http://_endpoint


Security Next: Integrate with Apache Ranger/ Apache KNOX

Leveraging SSO via Apache KNOX Data source Level user/group based authorization. Row/Column level user/group based authorization.


Deployment and management: Apache Ambari integration


Simple Druid Management with Ambari UI is the source of truth (What you see is what you get !).


Simple Druid Management with Ambari Works with hadoop/hdfs zookeeper… superset, etc..


Simple Druid Management with Ambari

Versionsmanagements


Deployment and management via Ambari/HDP

UI is the source of truth (What you see is what you get !). Works with hadoop/hdfs out of the box. Installs and configures Superset (Ex Caravel -> Ex Panomamix ) UI. Integrates with Kerberos (Hadoop and HDFS interaction/ intra Druid security). Supports rolling deployments. Monitoring via Graphana dashboard (backed by Hbase).


SQL interface: Hive integration


Benefits both to Druid and Apache Hive

Efficient execution of OLAP queries in Hive to power BI tools.

Interaction with realtime data.

Create/Drop data source using SQL syntax.

Being able to execute complex SQL operations out of the box on Druid data and other sources like joins and window functions.

Hive side Druid side


Data source creation

Data already existing in druid– All you need is to point hive to broker and specify datasource name

Data outside of druid– Data already existing in Hive .– Data stored in distributed filesystem like HDFS, S3 in a format that can be read by hive eg TSV, CSV

ORC, Parquet etc.– Need Perform some pre-processing over various data sources before feeding it to druid

Create Table statement


Druid data sources in Hive

Point hive to the broker:– SET hive.druid.broker.address.default=druid.broker.hostname:8082;

Simple CREATE EXTERNAL TABLE statementCREATE EXTERNAL TABLE druid_table_1STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'TBLPROPERTIES ("druid.datasource" = "wikiticker");

Hive table nameHive storage handler classnameDruid data source name

⇢ Broker node endpoint specified as a Hive configuration parameter⇢ Automatic Druid data schema discovery: segment metadata query

Registering Druid data sources



Point hive to druid metadata storage and deep storage path– Set hive.druid.metadata.password=diurd – Set hive.druid.metadata.username=druid – Set hive.druid.metadata.uri=jdbc:mysql://host/druid_db– Set druid.storage.storageDirectory=s3a://druid-cloud-bucket/

Use Create Table As Select (CTAS) statementCREATE TABLE druid_table_1STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'TBLPROPERTIES ("druid.datasource" = "wikiticker”, "druid.segment.granularity" = "HOUR")ASSELECT __time, page, user, c_added, c_removedFROM src;

Hive table nameHive storage handler classnameDruid data source name

Creating Druid data sources



Use Create Table As Select (CTAS) statement

CREATE TABLE druid_table_1STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler’TBLPROPERTIES ("druid.datasource" = "wikiticker”, "druid.segment.granularity" = "HOUR")ASSELECT __time, page, user, c_added, c_removedFROM src;

⇢ Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive column type


Timestamp Dimensions Metrics

Credit [email protected]



__time page user c_added c_removed

2011-01-01T01:05:00Z Justin Boxer 1800 25

2011-01-02T19:00:00Z Justin Reach 2912 42

2011-01-01T11:00:00Z Ke$ha Xeno 1953 17

2011-01-02T13:00:00Z Ke$ha Helz 3194 170

2011-01-02T18:00:00Z Miley Ashu 2232 34

CTAS query results

Select

File Sink

Original CTASphysical plan

Table Scan

Credit [email protected]


Rewritten CTASphysical plan



File Data needs to be partitioned by time granularity "druid.segment.granularity" = "HOUR"

Table Scan

Select

File Sink

__time page user c_added c_removed __time_granularity

2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z

2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z

2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z

2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z

2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z

CTAS query results

Truncate timestamp to day granularity

Select

File Sink: Druid output format

Reduce

Table Scan


Rewritten CTASphysical plan



File Sink operator uses Druid output format– Creates segment files and save segments descriptors

metadata to hdfs.– After successful reducer operation all the descriptors

will be committed to metadata storage atomically. – Wait for handoff if coordinator is detected.

Table Scan

Select

2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z

2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z

CTAS query results

Select

File Sink

Reduce

Table Scan

2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z

2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z

2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z

Segment 2011-01-01

Segment 2011-01-02

File Sink: Druid output format



Use Insert Overwrite As Select (CTAS) statement:– Can only append or overwrite.– Need to keep the same schema.

Update data sources

INSERT OVERWRITE TABLE druid_table_1 ASSELECT __time, page, user, c_added, c_removedFROM src;

Use Drop table to delete meta data from hive and data source from druid

DROP TABLE druid_table_1 [PURGE];


Querying Druid data sources

Automatic rewriting when query is expressed over Druid table– Powered by Apache Calcite– Main challenge: identify patterns in logical plan corresponding to

different kinds of Druid queries (Timeseries, GroupBy, Select)

Translate (sub)plan of operators into valid Druid JSON query– Druid query is encapsulated within Hive TableScan operator

Hive TableScan uses Druid input format– Submits query to Druid and generates records out of the query results– interaction with Druid broker node or historicals in parallel

It might not be possible to push all computation to Druid– Our contract is that the query should always be executed


Druid input format extends InputFormat<NullWritable, DruidWritable> Submits query to Druid and generates records out of the query

results Current version

– Timeseries, TopN, and GroupBy queries are not partitioned– Select queries partitioned along time dimension column considering

uniform distribution

Ongoing work for select query– Bypass broker: query Druid realtime and historical nodes directly


Next

Push more time filters predicates and/or computation down the chain. Make use of Long/Float Columns. Complex column types (sketches, HLL etc…). Stream version of Select query. Interact with coordinator for data creation. Time semantic (Time zone handling). Null semantic.

Hive integration


Thank You@ApacheHive | @ApacheCalcite | @druidio | @ApacheAmbari

http://cwiki.apache.org/confluence/display/Hive/Druid+Integrationhttp://calcite.apache.org/docs/druid_adapter.htmlhttps://issues.apache.org/jira/browse/AMBARI-17981


Demo:

Druid at Hadoop Ecosystem

Data & Analytics

Transcript of Druid at Hadoop Ecosystem