Druid at Hadoop Ecosystem
-
Upload
slim-bouguerra -
Category
Data & Analytics
-
view
425 -
download
2
Transcript of Druid at Hadoop Ecosystem
Druid @Hadoop EcosystemSlim Bouguerra, Nishant Bangarwa , Jesús Camacho Rodríguez, Ashutosh Chauhan, Gunther Hagleitner, Julian Hyde, Carter Shanklin
Druid meetup21/02/2017
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
1. Security enhancement
2. Deployment and management
3. SQL interaction
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security : Integration with Kerberos Spnego
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kerberos/Spnego integration (druid 0.10)
Securing all the endpoint except some specified ones if needed. UI access to coordinator and overlord is protected as well (browser configuration
needed).
Druid
Druid
KDC server
1 kinit user
2 Token
3 Negotiate using token
User Browser
4 Valid cookie
1 kinit user2 Token
3 Negotiate
4 Valid cookie
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kerberos/Spnego integration (druid 0.10)
Securing all the endpoint except some specified ones if needed. UI access to coordinator and overlord is protected as well (browser configuration
needed).
druid.hadoop.security.spnego.keytab
keytab_dir/spnego.service.keytab
This is the SPNEGO service keytab that is used for authentication.
druid.hadoop.security.spnego.principal HTTP/_HOST@realm
This is the SPNEGO service principal that is used for authentication
curl --negotiate -u:anyUser -b ~/cookies.txt -c ~/cookies.txt -X POST -H'Content-Type: application/json' http://_endpoint
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security Next: Integrate with Apache Ranger/ Apache KNOX
Leveraging SSO via Apache KNOX Data source Level user/group based authorization. Row/Column level user/group based authorization.
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Deployment and management: Apache Ambari integration
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Simple Druid Management with Ambari UI is the source of truth (What you see is what you get !).
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Simple Druid Management with Ambari Works with hadoop/hdfs zookeeper… superset, etc..
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Simple Druid Management with Ambari
Versionsmanagements
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Deployment and management via Ambari/HDP
UI is the source of truth (What you see is what you get !). Works with hadoop/hdfs out of the box. Installs and configures Superset (Ex Caravel -> Ex Panomamix ) UI. Integrates with Kerberos (Hadoop and HDFS interaction/ intra Druid security). Supports rolling deployments. Monitoring via Graphana dashboard (backed by Hbase).
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SQL interface: Hive integration
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Benefits both to Druid and Apache Hive
Efficient execution of OLAP queries in Hive to power BI tools.
Interaction with realtime data.
Create/Drop data source using SQL syntax.
Being able to execute complex SQL operations out of the box on Druid data and other sources like joins and window functions.
Hive side Druid side
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data source creation
Data already existing in druid– All you need is to point hive to broker and specify datasource name
Data outside of druid– Data already existing in Hive .– Data stored in distributed filesystem like HDFS, S3 in a format that can be read by hive eg TSV, CSV
ORC, Parquet etc.– Need Perform some pre-processing over various data sources before feeding it to druid
Create Table statement
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
Point hive to the broker:– SET hive.druid.broker.address.default=druid.broker.hostname:8082;
Simple CREATE EXTERNAL TABLE statementCREATE EXTERNAL TABLE druid_table_1STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'TBLPROPERTIES ("druid.datasource" = "wikiticker");
Hive table nameHive storage handler classnameDruid data source name
⇢ Broker node endpoint specified as a Hive configuration parameter⇢ Automatic Druid data schema discovery: segment metadata query
Registering Druid data sources
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
Point hive to druid metadata storage and deep storage path– Set hive.druid.metadata.password=diurd – Set hive.druid.metadata.username=druid – Set hive.druid.metadata.uri=jdbc:mysql://host/druid_db– Set druid.storage.storageDirectory=s3a://druid-cloud-bucket/
Use Create Table As Select (CTAS) statementCREATE TABLE druid_table_1STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'TBLPROPERTIES ("druid.datasource" = "wikiticker”, "druid.segment.granularity" = "HOUR")ASSELECT __time, page, user, c_added, c_removedFROM src;
Hive table nameHive storage handler classnameDruid data source name
Creating Druid data sources
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
Use Create Table As Select (CTAS) statement
CREATE TABLE druid_table_1STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler’TBLPROPERTIES ("druid.datasource" = "wikiticker”, "druid.segment.granularity" = "HOUR")ASSELECT __time, page, user, c_added, c_removedFROM src;
⇢ Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive column type
Creating Druid data sources
Timestamp Dimensions Metrics
Credit [email protected]
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
__time page user c_added c_removed
2011-01-01T01:05:00Z Justin Boxer 1800 25
2011-01-02T19:00:00Z Justin Reach 2912 42
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17
2011-01-02T13:00:00Z Ke$ha Helz 3194 170
2011-01-02T18:00:00Z Miley Ashu 2232 34
CTAS query results
Select
File Sink
Original CTASphysical plan
Table Scan
Credit [email protected]
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Rewritten CTASphysical plan
Druid data sources in Hive
Creating Druid data sources
File Data needs to be partitioned by time granularity "druid.segment.granularity" = "HOUR"
Table Scan
Select
File Sink
__time page user c_added c_removed __time_granularity
2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z
2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z
2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z
2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z
CTAS query results
Truncate timestamp to day granularity
Select
File Sink: Druid output format
Reduce
Table Scan
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Rewritten CTASphysical plan
Druid data sources in Hive
Creating Druid data sources
File Sink operator uses Druid output format– Creates segment files and save segments descriptors
metadata to hdfs.– After successful reducer operation all the descriptors
will be committed to metadata storage atomically. – Wait for handoff if coordinator is detected.
Table Scan
Select
2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z
CTAS query results
Select
File Sink
Reduce
Table Scan
2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z
2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z
2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z
Segment 2011-01-01
Segment 2011-01-02
File Sink: Druid output format
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
Use Insert Overwrite As Select (CTAS) statement:– Can only append or overwrite.– Need to keep the same schema.
Update data sources
INSERT OVERWRITE TABLE druid_table_1 ASSELECT __time, page, user, c_added, c_removedFROM src;
Use Drop table to delete meta data from hive and data source from druid
DROP TABLE druid_table_1 [PURGE];
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Querying Druid data sources
Automatic rewriting when query is expressed over Druid table– Powered by Apache Calcite– Main challenge: identify patterns in logical plan corresponding to
different kinds of Druid queries (Timeseries, GroupBy, Select)
Translate (sub)plan of operators into valid Druid JSON query– Druid query is encapsulated within Hive TableScan operator
Hive TableScan uses Druid input format– Submits query to Druid and generates records out of the query results– interaction with Druid broker node or historicals in parallel
It might not be possible to push all computation to Druid– Our contract is that the query should always be executed
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid input format extends InputFormat<NullWritable, DruidWritable> Submits query to Druid and generates records out of the query
results Current version
– Timeseries, TopN, and GroupBy queries are not partitioned– Select queries partitioned along time dimension column considering
uniform distribution
Ongoing work for select query– Bypass broker: query Druid realtime and historical nodes directly
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Next
Push more time filters predicates and/or computation down the chain. Make use of Long/Float Columns. Complex column types (sketches, HLL etc…). Stream version of Select query. Interact with coordinator for data creation. Time semantic (Time zone handling). Null semantic.
Hive integration
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You@ApacheHive | @ApacheCalcite | @druidio | @ApacheAmbari
http://cwiki.apache.org/confluence/display/Hive/Druid+Integrationhttp://calcite.apache.org/docs/druid_adapter.htmlhttps://issues.apache.org/jira/browse/AMBARI-17981
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo: