A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional...

132
A NEW PLATFORM FOR A NEW ERA

Transcript of A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional...

Page 1: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

A NEW PLATFORM FOR A NEW ERA

Additional Line 18 Point Verdana

Page 2: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

2 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 2 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Pivotal HD and HAWQ Immersion v5 John Funk

Page 3: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

3 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Course Outline �  PHD and HAWQ Introduction

�  HAWQ Architecture

�  HDFS Review

�  HAWQ Distribution, Partitioning and Storage options

�  Query execution in HAWQ

�  Loading and Unloading data in HAWQ

�  PXF – Pivotal Xtension Framework Best Practices

�  HAWQ, HBASE and HIVE Comparative Usage

�  Securing HAWQ

Page 4: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

4 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 4 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Pivotal HD and HAWQ Introduction and Positioning

Page 5: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

5 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Pivotal HD and HAWQ is the…

Enterprise platform that provides the fewest barriers, lowest risk, most cost effective and fastest way to enter in to

big data analytics on Hadoop

Page 6: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

6 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ Evolved From… •  Greenplum database re-platformed on Hadoop/HDFS

•  Over a decade of proven Greenplum database performance

•  HAWQ provides all major features found in Greenplum database •  SQL Completeness: 2003 Extensions •  Robust Query Optimizer •  Row or Column-Oriented Table Storage •  Compression •  Distributions •  Multi-level Partitioning •  Parallel Loading and Unloading •  High speed data redistribution

•  Views •  External Tables •  Resource Management •  Security •  Authentication •  Management and Monitoring •  ODBC/JDBC Compliant

Page 7: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

7 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ Benefits… •  Out of the box SQL for Hadoop

•  SQL adoption versus learning MapReduce programming

•  GPXF External Tables providing SQL access to Hadoop •  HDFS, HBase, Hive or any data types

•  Broad data access, integration and portability

•  Performance and Scalability •  Parallel Everything •  Dynamic Pipelining •  High Speed Interconnect •  Optimized HDFS access with libhdfs3

•  Co-Location •  Partition Elimination •  Higher Cluster Utilization •  Concurrency Control

Page 8: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

8 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Pivotal HD Architecture

HDFS

HBase Pig, Hive, Mahout

Map Reduce

Sqoop Flume

Resource

Management & Workflow

Yarn

Zookeeper

Apache Pivotal

Command Center Configure,

Deploy, Monitor, Manage

Data Loader

Pivotal HD Enterprise

Spring

Unified Storage Service

Xtension Framework

Catalog Services

Query Optimizer

Dynamic Pipelining

ANSI SQL + Analytics

HAWQ – Advanced Database Services

Hadoop Virtualization Extension

Distrubuted In-memory

Store

Query Transactions

Ingestion Processing

Hadoop Driver – Parallel with Compaction

ANSI SQL + In-Memory

GemFire XD – Real-Time Database Services

MADlib Algorithms

Oozie Vaidya

Page 9: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

9 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Flexible Deployment Model

deploy

Portable

Elastic

Promotable

HW Abstracted

Manageable

Public Cloud On Premise Private Cloud

Page 10: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

10 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Pivotal HD �  World’s first true SQL processing for enterprise-ready

Hadoop

�  100% Apache Hadoop-based platform

�  Virtualization and cloud ready with VMWare and Isilon

�  Scale tested in 1000 node Pivotal Analytics Workbench

�  Available as a software-only or appliance-based solution

�  Backed by EMC’s global, 24x7 support infrastructure

Page 11: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

11 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 11 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Introduction to Pivotal HD

Page 12: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

12 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Pivotal HD Architecture

HDFS

HBase Pig, Hive, Mahout

Map Reduce

Sqoop Flume

Resource

Management & Workflow

Yarn

Zookeeper

Apache Pivotal

Command Center Configure,

Deploy, Monitor, Manage

Data Loader

Pivotal HD Enterprise

Spring

Unified Storage Service

Xtension Framework

Catalog Services

Query Optimizer

Dynamic Pipelining

ANSI SQL + Analytics

HAWQ – Advanced Database Services

Hadoop Virtualization Extension

Distrubuted In-memory

Store

Query Transactions

Ingestion Processing

Hadoop Driver – Parallel with Compaction

ANSI SQL + In-Memory

GemFire XD – Real-Time Database Services

MADlib Algorithms

Oozie Vaidya

Page 13: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

13 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Pig � Pig provides a high-level, data flow oriented, abstraction for

MapReduce –  Much more concise than MapReduce code –  Though not very intuitive

� Compiles to MapReduce programs, which it runs for you

� Output can be dumped to terminal, or as files in HDFS for access by HAWQ or other tools

� Useful operators, extensible through “Piggybank”

� Developed at Yahoo!

Page 14: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

14 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Hive � Hive provides a SQL-like interface to data in HDFS

� To users who know SQL, Hive provides a much more intuitive interface than MapReduce or Pig

� Like Pig, Hive operates by translating the user’s query into one or more MapReduce jobs, running these on potentially very large data sets, and finally printing the result

� Drawbacks – Limited SQL, job latency and frequent I/O (slow)

� Developed at Facebook

Page 15: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

15 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HBase �  HBase provides random, real time read/write access to data stored within

HDFS –  Sparse, wide tables

�  Flexible schema

�  Key/value store: given (‘table’, ‘rowkey’), retrieve row –  Does not perform well if not retrieved by key/value

�  Update to row adds new data with current timestamp –  Previous state can be recovered using previous timestamp

�  Using PXF external tables, HAWQ is able to incorporate HBase data into queries

–  Pushing predicates into HBase when possible

Page 16: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

16 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 16 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Pivotal HAWQ

Page 17: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

17 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ: The Crown Jewels �  SQL compliant

� World-class query optimizer

�  Interactive query

�  Horizontal scalability

�  Robust data management

�  Common Hadoop formats

�  Deep analytics

Page 18: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

18 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

High Performance Query Processing HAWQ

�  Interactive and true ANSI SQL support

� Multi-petabyte horizontal scalability

� Cost-based parallel query optimizer

� Programmable analytics

Page 19: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

19 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Enterprise Class Database Services & Management HAWQ

� Scatter-gather data loading

� Row and column storage

� Workload management

� Multi-level partitioning

� 3rd-party tool & open client interfaces

Page 20: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

20 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Pre-Integrated Deep Analytics HAWQ

� Performance via fully parallelized implementation

� Consistent, user friendly SQL interfaces

� Ease of data preparation

� Pre-integrated MADLib support –  Linear Regression –  Logistic Regression –  Multinomial Logistic

Regression

–  K-Means –  Association Rules –  PLDA - useful for topic

modeling

Page 21: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

21 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

A fast extensible framework connecting HAWQ to a data

store of choice that exposes a parallel API

PXF: Pivotal Xtension Framework

Page 22: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

22 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

PXF �  An advanced version of GPDB

external tables

�  Enables combining HAWQ data and Hadoop data in single query

�  Supports connectors for HDFS (read and write), HBase and Hive

�  Provides extensible framework API to enable custom connector development for other data sources

–  GemFireXD, JSON format, Cassandra, Accumulo

HDFS HBase Hive

PXF Xtension Framework

Page 23: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

23 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

PXF Features �  What is it?

–  HAWQ feature to access data stored in other popular Hadoop modules (HDFS, HBase, Hive) using full SQL interface of HAWQ

�  Why is it important? –  A customer may prefer to primarily manage certain data

in HBase, but want to join this to other data sets stored in HAWQ for analytics purposes. Or a customer may need SQL access to data in HBase or HDFS.

�  When/who to use with? –  An important feature to discuss with data and

application architects who are concerned about unifying data access patterns across the variety of Hadoop components

–  Also useful to address any concerns about HAWQ using a proprietary data format not currently readable by other Hadoop processes.

Text HBase Hive Avro

HDFS

PXF Transparent, Optimized SQL Access to non-

HAWQ formats

HDFS

Page 24: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

24 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

PXF Feature Summary ★  HBase (w/filter pushdown) ★  Hive (w/partition exclusion. various storage file types) ★  HDFS Files: read (delimited text, csv, Sequence, Avro) ★  HDFS Files: write (delimited text, csv, Sequence, various compression

codecs and options) ★  GemFireXD, JSON format, Cassandra, Accumulo (currently Beta) ★  Stats collection ★  Automatic data locality optimizations ★  Extensibility!

Page 25: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

25 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 25 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Pivotal HD and HAWQ Rapid Innovation A look at features released in 2014

Page 26: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

26 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

What’s New in PHD 1.1 �  Gemfire XD Beta

�  Orca

�  PXF: Writable HDFS Table Support

�  HAWQ Format Reader

�  UDF Support

�  Oozie

�  Vaidya

�  Kerberos Support (HDFS, HAWQ, USS)

�  Pgcrypto for HAWQ

�  Unified Storage Service: CDH4 as a data source

Page 27: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

27 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

What’s New in PHD 1.1.1 �  Automatic HD configuration via ICM

–  Manual failover of HAWQ/PXF

�  Manual NameNode HA

�  Kerberos authentication support (includes HAWQ, PXF, HBase, Hive)

�  Parameterized Hadoop environment variables

�  Backup and restore scripts for Admin node

�  Rebalance HDFS using web API

�  PiggyBank Support in Pig 0.12

�  HAWQ gp_toolkit support

Page 28: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

28 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

What’s New in PHD 2.0 �  GemfireXD GA

�  Pivotal HD Stack –  Hadoop 2.2 Rebase; Built w/JDK 1.7 –  Hive 0.12, Hbase 0.96 –  Graphlab 2.2 BETA (via Hamster/OpenMPI)

�  HAWQ –  Automated NameNode and HAWQ Master Failover –  MADlib 1.5 as separately deployable package, PL/Java, (PL/R and PL/Python from 1.1.1) –  Add Segments (HAWQ expand) –  Pluggable storage Phase 1 – Basic Parquet support –  Error Tables

�  PCC and ICM –  New ‘Read Only’ user role –  Log Management –  DCA/Isilon enhancements

Page 29: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

29 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ 1.2 Deep Scalable Analytics

�  Linear Regression �  Logistic Regression �  Multinomial Logistic Regression �  K-Means �  Association Rules �  Latent Dirichlet Allocation �  Naïve Bayes �  Elastic Net Regression �  Decision Trees / Random Forest �  Support Vector Machines �  Cox Proportional Hazards Regression �  Descriptive Statistics �  ARIMA

Page 30: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

30 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Pivotal vs. PL/R

�  Interface is R client �  Execution is in database �  Parallelism handled by PivotalR �  Supports a portion of R

PivotalR •  Interface is SQL client •  Execution is in R •  Parallelism via SQL function

invocation •  Supports all of R

PL/R

Page 31: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

31 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

More to Come…! •  PostGIS •  Enhanced Optimizer •  Query 3rd party remote clusters •  …and Much More

Page 32: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

32 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 32 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Greenplum Database and HAWQ

Page 33: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

33 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ Evolved From… �  Greenplum database re-platformed on Hadoop/HDFS

�  HAWQ provides all major features found in Greenplum database –  SQL Completeness: 2003 Extensions –  JDBC Compliant –  Robust Query Optimizer –  Row or Column-Oriented Table Storage –  Parallel Loading and Unloading –  Distributions –  Multi-level Partitioning –  High speed data redistribution

–  Views –  External Tables –  Compression –  Resource Management –  Security –  Authentication –  Management and Monitoring

Page 34: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

34 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ � GPDB on HDFS

� Not shared nothing built on a distributed file system (HDFS) –  Nodes can access shards of data on other nodes

� Built for large I/O, append-only, write-once, read-many

� Segments are stateless –  HA is one of the main drivers towards HDFS

HDFS DataNode

HDFS NameNode

HDFS DataNode HDFS DataNode

Page 35: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

35 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ Features � HAWQ provides all major features found in Greenplum

database that can be supported in Hadoop/HDFS including –  Row or Column-oriented table storage –  Distributions –  Partitioning –  Views –  External tables

� Using some features without understanding implications in HDFS may result in problems

–  We will discuss this the modules on each specific topic.

Page 36: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

36 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Architectural Differences from GPDB � Stateless Segment Hosts

–  Segments do not know what is visible or aborted in their physical data –  Segments do not know what columns are in a table

� HA model deviates from shared nothing environment –  If segment is down simply read from the replica in HDFS –  No lengthy failover process

� HDFS design doesn’t lend itself to local transaction management

–  Frequent, small bursts of I/O on HDFS perform poorly

Page 37: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

37 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Architectural Implications of Using HDFS �  To re-platform GPDB on HDFS, segment workers had to be simplified

(or made dumber) –  GPDB segment workers had their own copies of metadata,

transaction management and local storage

�  Heap storage in GPDB requires the database to make modifications to tuples on disk

–  HDFS is append only therefore heap storage cannot work on DataNodes

–  Catalog tables require 100% heap storage so segment servers cannot have a local copy of the catalog

Page 38: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

38 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Considering the architectural differences and implications of HDFS… GPDB and HAWQ Differences at a Glance

�  No Update and Delete –  Truncate is supported

�  No catalog on segment servers

�  No local transaction management at the segment level

�  No indexes

�  Local storage exists on segments but is used for temporary purposes

Page 39: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

39 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ or Greenplum Database? GPDB HAWQ

Real time random read/writes ✗

Large I/O write once, read many ✗

Petabytes of data ✗

Hadoop/HDFS platform ✗

Updates ✗

Deletes ✗

Indexes ✗

Row or columnar oriented table storage ✗ ✗

User Defined Data Distributions ✗ ✗

User Defined Partitioning ✗ ✗

Resource Management ✗ ✗

User Defined Functions (UDFs) ✗ ✗

External Tables ✗ ✗

GPText ✗

MADLib Algorithms ✗ ✗

Page 40: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

40 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 40 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Introduction to HAWQ Architecture

Page 41: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

41 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Interconnect

Basic HAWQ Architecture

Local Storage

HAWQ Master

Local TM

Query Executor

Parser Query Optimizer

Dispatch

PXF

NameNode

Local Temp Storage

Segment Host Query Executor

HDFS

PXF

Segment

[Segment …]

DataNode

Local Temp Storage

Segment Host Query Executor

HDFS

PXF

Segment

[Segment …]

DataNode

HDFS

HAWQ Standby Master

In production there will be other nodes for example, Pivotal CC/

ICM admin node, YARN Resource Manager node,

Secondary NameNode, etc.

Page 42: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

42 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ Master �  Located on a separate node from the NameNode in production

–  For a small POC cluster the HAWQ Master may run on the NameNode

�  Does not contain any user data

�  Contains Global System Catalog –  System tables that contain HAWQ metadata

�  Authenticates client connections, processes SQL, distributes work between segments, coordinates results returned by segments, presents final client results

Local Storage

HAWQ Master

Local TM

Query Executor

Parser Query Optimizer

Dispatch

Catalog

Page 43: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

43 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ Metadata � Metadata is stored only in the HAWQ Master on local file

system ▪  Catalog information makes use of heap store

� No catalog/metadata on segment nodes (DataNodes) ▪  Segment nodes are stateless ▪  No heap store

Local Storage

HAWQ Master

Local TM

Query Executor

Parser Query Optimizer

Dispatch

Catalog

Page 44: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

44 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ Segments •  A HAWQ segment within a Segment Host is an

HDFS client that runs on a DataNode •  Multiple segments per Segment Host/DataNode •  Segment is a basic unit of parallelism

•  Multiple segments work together to form a single parallel query processing system

•  Operations (scans, joins, aggregations, sorts, etc.) execute in parallel across all segments simultaneously

Local Temp Storage

Segment Host Query Executor

HDFS

PXF

Segment

[Segment …]

DataNode

Page 45: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

45 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Segments Access Data Stored in HDFS � Segments are stateless

–  Does not store database and table metadata –  HAWQ Master dispatches query plan along with related metadata

obtained from the NameNode

� Segments communicate with NameNode to obtain block lists where data is located

� Segments access data stored in HDFS

Local Temp Storage

Segment Host

Query Executor

HDFS

PXF

Segment

[Segment …]

Page 46: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

46 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Interconnect

HAWQ Parser

Local Storage

HAWQ Master

Local TM

Query Executor

Parser Query Optimizer

Dispatch

PXF

NameNode

Local Temp Storage

Segment Host Query Executor

HDFS

PXF

Segment

Segment

DataNode

Local Temp Storage

Segment Host Query Executor

HDFS

PXF

Segment

Segment

DataNode

Clients

JDBC

SQL

•  Enforces syntax and semantics •  Converts SQL query into a

parse tree data structure describing details of the query

Page 47: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

47 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Interconnect

HAWQ Parallel Query Optimizer

Local Storage

HAWQ Master

Local TM

Query Executor

Parser Query Optimizer

Dispatch

PXF

NameNode

Local Temp Storage

Segment Host Query Executor

HDFS

PXF

Segment

Segment

DataNode

Local Temp Storage

Segment Host Query Executor

HDFS

PXF

Segment

Segment

DataNode

Gather Motion

Sort

HashAggregate

HashJoin

Redistribute Motion

HashJoin

Seq Scan on lineitem Hash

Seq Scan on orders

Hash

HashJoin

Seq Scan on customer Hash

Broadcast Motion

Seq Scan on nation

Page 48: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

48 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Interconnect

HAWQ Dispatch and Query Executor

Local Storage

HAWQ Master

Local TM

Query Executor

Parser Query Optimizer

Dispatch

PXF

NameNode

Local Temp Storage

Segment Host Query Executor

HDFS

PXF

Segment

Segment

DataNode

Local Temp Storage

Segment Host Query Executor

HDFS

PXF

Segment

Segment

DataNode

1.  Dispatch communicates the query plan to segments

2.  Query Executor executes the physical steps in the plan

ScanBarsb

HashJoinb.name = s.bar

ScanSellss Filterb.city = 'San Francisco'

Projects.beer, s.price

MotionGather

MotionRedist(b.name)

ScanBarsb

HashJoinb.name = s.bar

ScanSellss Filterb.city = 'San Francisco'

Projects.beer, s.price

MotionGather

MotionRedist(b.name)

Page 49: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

49 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ Transactions �  DataNodes in HDFS do not know what is visible

–  No idea what data they have –  Visibility is defined by the NameNode

�  Therefore, segment nodes do not know what is visible –  Visibility is defined by HAWQ Master

�  No distributed transaction management –  No UPDATE or DELETE

�  Truncate is implemented to support rollback of failed transactions

�  Transaction logs present only on HAWQ Master –  For inserts, single phase commit performed on HAWQ Master

Local Storage

HAWQ Master

Local TM

Query Executor

Parser Query Optimizer

Dispatch

Catalog

Page 50: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

50 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ Interconnect Performance and Scalability �  Inter-process communication between segments

–  Standard Ethernet switching fabric

�  Uses UDP protocol (User Datagram Protocol) –  Improved performance and scalability

�  Additional packet verification and checking not performed by UDP –  Reliability equivalent to TCP

Interconnect

Local Temp Storage

Segment Host Query Executor

HDFS

PXF

Segment

Segment

DataNode

Local Temp Storage

Segment Host Query Executor

HDFS

PXF

Segment

Segment

DataNode

Page 51: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

51 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ Dynamic Pipelining tm

Local Temp Storage

Segment Host Query Executor

DataNode

PXF

Local Temp Storage

Segment Host Query Executor

DataNode

PXF

Local Temp Storage

Segment Host Query Executor

DataNode

PXF

•  Differentiating competitive advantage! •  Core execution technology from GPDB •  Parallel data flow using the high speed UDP interconnect •  No materialization

•  As performed with MapReduce

Dynamic Pipelining Interconnect

Page 52: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

52 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Dynamic Pipelining tm �  Framework that enables parallel data flow

–  Combines high speed UDP interconnect and a run time execution environment for big data workloads

–  Data from upstream components in the dynamic pipeline are transmitted to downstream components through UDP interconnect

�  Dynamic Pipelining run time layer ensures that queries complete, even for very demanding queries under heavy cluster utilization

–  Provides a seamless data partitioning mechanism which groups together parts of a data set which are often used in any given query

–  Enables queries to run without materializing contents to disk

Page 53: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

53 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 53 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Lab

Page 54: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

54 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Create tables for lab exercises HAWQ_DDL Lab

� Run the DDL script to create HAWQ database and tables

� Review HAWQ tables

Page 55: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

55 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 55 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HDFS Review

Page 56: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

56 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

What is HDFS? � HDFS uses a Java file system

–  Uses libhdfs (JNI) to access the file system

� Scalable, distributed, fault-tolerant file system

� Designed to run well on commodity hardware

� Acknowledge that components frequently fail –  Entire node may fail or, more commonly, one or more disks within a

node will fail –  Gracefully continue to run in the presence of failures (entire node or

disks within a node)

Page 57: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

57 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HDFS Basic Architecture Client

Ingest Files

Local Data Stores

HDFS

DataNode

Local Data Stores

DataNode

Local Data Stores

DataNode

Metadata

NameNode

3 x Replication (default)

Egress Files

Ethernet

Page 58: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

58 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HDFS Model �  Mostly POSIX “like” file system, with some caveats

–  Write once, read many –  Doesn’t support updates to files (simple consistency model) –  Pivotal HD supports append and truncate on its HDFS layer

�  Access patterns are well-suited to SATA disk drives –  Fewer seeks –  Read large, contiguous blocks

�  Prefer fewer, large files –  Split files up into blocks ▪  128MB default

–  Evenly distribute blocks across the cluster

Page 59: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

59 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 59 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ Data Storage and I/O

Page 60: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

60 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ Data Storage and I/O �  Segments are HDFS clients that run on DataNodes

�  Each table’s data is sharded on HDFS

�  The DataNodes are responsible for serving read and write requests from HAWQ segments

–  Data stored in HAWQ database tables

�  Data stored external to HAWQ but within the Hadoop cluster can be read using PXF external tables and are extensible

–  HDFS, Hive, HBase

�  Data stored in HAWQ can be written to HDFS for external consumption using Writable HDFS Table Support

Page 61: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

61 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Segment Files �  Each table’s data is sharded on HDFS

�  For example:

/hawq_data/gpseg<ID>/<DB OID>/<schema OID>/<table OID>.1,2,3,4,…

�  Data inserted to the same segment is always appended to the segment file

�  The maximum file size in HDFS is governed by dfs.namenode.fs-limits.max-blocks-per-file configuration in the hdfs-site.xml configuration file

–  The default is 1048576 which is 64TB

Page 62: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

62 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Data Locality �  For tables using a hash distribution, data with the same hash key will always be

handled by the same segment and is always written to the same DataNode as the segment host

�  Data locality will always be maintained unless one of the following conditions occur

–  DataNode on the segment host is at full file capacity –  DataNode on the segment host fails –  DataNode experiences a number of failed drives more than value specified

by dfs.datanode.failed.volumes.tolerated configuration parameter

�  Data locality is lost permanently when a DataNode fails for a long enough length of time for the NameNode to mark it down

Page 63: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

63 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Local Read Failures

� When there is a failure to read from a local DataNode on a segment host, reads are performed from a remote DataNode (replicated copy)

� Performance impact of approximately 70% –  This number quickly decreases with subsequent reads as a result of

caching the data –  Decreases to 10% on subsequent reads when cache is hit

Page 64: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

64 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HDFS I/O � HDFS uses a Java file system

–  libhdfs (JNI) is used to access HDFS

� Cost difference in reading thru HDFS indirection layer to read HDFS is 1.75 to 2.5 times slower than reading directly from disk

–  Cost of simply reading, doing an IPC into a java JVM and java reaching out to the file system

� The cost of reading through libhdfs in java (garbage collection + overhead) is slow

Page 65: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

65 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

libhdfs3 � Pivotal rewrote libhdfs in C++ resulting in libhdfs3

–  C based library –  Leverages protocol buffers to achieve greater performance

� Libhdfs3 is used to access HDFS from HAWQ

� There is a GUC to disable libhdfs3 but is used for internal testing and debugging by engineering

–  It should never be turned off or disabled in the field

Page 66: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

66 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ Reads and HDFS �  In HAWQ data is physically partitioned/sharded across the cluster

�  Accessing a large number of small files for a single query in HDFS is not the design point of HDFS

�  For every DataNode running HAWQ, for every segment, for every partition and for every column (if using CO) a substantial amount of metadata is needed from the NameNode

�  Typically when accessing the NameNode for a MapReduce job it reads one, large contiguous file across HDFS then carves it up into partitions at run time and executes

Page 67: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

67 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ Reads �  The HAWQ master has a centralized catalog metadata store

�  HDFS has a NameNode metadata store

�  HAWQ master must interrogate the NameNode to obtain metadata from the NameNode and dispatch it along with the query plan to each of the HAWQ segments

�  Then the segments callback to the NameNode to obtain a block location array consisting of block IDs

–  For any given shard we don’t know the actual block IDs that we need to read

Page 68: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

68 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ Data Storage Performance Considerations �  Data is still split per-segment, so there is one file, per object, per segment

�  There can be a large number of partitions depending on the partition granularity –  Every partition is a file

�  Columnar orientation on very wide tables –  Every column is a file

�  Can result in –  Many very small files –  A huge number of calls to the NameNode –  Errors (particularly when loading) and slowness (when running queries)

Page 69: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

69 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Solution � You must consider #Segments X #Columns X #Partitions

�  In general, determine the optimal number of segments on DataNodes

� Use a higher partition granularity

� Limit columnar orientation on a very wide tables –  If partition granularity requirement is low, use row-based table

orientation

� NEVER use partitioning and column orientation together!

Page 70: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

70 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 70 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ Distributions, Partitioning and Storage Options

Page 71: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

71 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ Data Distributions �  Same functionality and behavior as GPDB

–  Data locality/co-located joins, redistribution, broadcasts, etc. –  Most important is an even distribution of data!

�  Loading randomly distributed tables is faster on larger tables since data does not get hashed

�  There is no difference in sequential scans on randomly distributed tables vs. hash distributed tables

�  Complex queries (joins, aggregates, sorts) on large randomly distributed tables take longer due to re-hash of data for local joins and aggregates

Page 72: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

72 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Loading Varying Storage Options

� Loading columnar tables take approximately 5-10x longer than loading the same row based table

� Loading compressed row (or columnar) tables only introduces slight overhead, 20% or less on small tables/loads

–  And on larger tables is actually faster by 5-10% because less data (blocks) is being written to HDFS

�  zlib compression reduced storage footprint by 50% on very high cardinality data

Page 73: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

73 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Compressed Row Based vs Non-Compressed Row Based � Sequential scan operations (for example select count(x))

takes 2-6x longer with compressed tables based on table size

–  As the table size increased the difference in query time reduced

� On more complex queries with aggregates and sort operations the difference in query time is almost unnoticeable

Page 74: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

74 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Querying Row vs Columnar Based Tables �  Sequential scan selecting a few columns takes only marginally less time

to execute on small columnar tables than the same row based table –  As the table size increases there is no perceptible performance difference

�  Wide queries and joins that read all columns in a columnar table does not display significant difference in query times than the same row based table

�  Complex queries (sorts, aggregates, joins) that involve only a subset of columns in a table the difference between columnar and row based is negligible

–  The majority of time for these queries is spent on the sort/aggregation operations and not the HDFS read

Page 75: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

75 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Partition Row Based vs Non-Partition Row Based �  Load times on small partitioned tables are 5x slower than non-partitioned

tables

�  Load times on large partitioned tables were 2-3x slower than non-partitioned tables

�  Sequential scans take 130-200% longer on partitioned tables

�  Complex queries (aggregates, joins, sorts) that did not have a WHERE clause eliminating partitions, the query time was actually faster in the case of partitioned tables for larger tables

–  May is due to the increased parallelism achieved with partition tables

Page 76: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

76 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ Partitioning � There can be a large number of partitions depending on the

partition granularity – Every partition is a file in HDFS – May result in many small files which is not desired

�  In general, use partitioning on very large tables but use a higher partition granularity so there are fewer, larger files

� Do not use partitioning if load performance is critical

Page 77: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

77 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ Columnar Storage � Do not use columnar orientation on very wide tables

–  Every column is a file in HDFS

�  If partition granularity requirement is low, use row-based table orientation

� Optimally you want bigger files and fewer NameNode calls for scanning the same amount of data

� NEVER use columnar tables with partitioning! –  Very different from GPDB

Page 78: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

78 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 78 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Running Queries in HAWQ

Page 79: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

79 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

SQL Querying � Uses pipelined method of execution developed for

Greenplum Database – Efficient parallel execution – No MapReduce used behind the scenes – No intermediate materialization of data

� Only difference in operator level execution as compared to Greenplum database is the scan node

– Scan node is the operator that reads data from HDFS ▪  Versus reading data from file system in a Greenplum

database

Page 80: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

80 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

SQL Querying Caveats � SQL query support similar to Greenplum Database

–  Support for advanced SQL like OLAP, analytical functions (i.e. MADLib)

� No updates

� No deletes

� No support for indexes

� No GPText

Page 81: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

81 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Query Example �  SQL submitted to HAWQ master

–  Validates SQL and parses query –  Query Optimizer produces the plan –  HAWQ master obtains metadata from NameNode and annotates the query

plan with metadata that segments need for execution

�  HAWQ Master dispatches the plan to every segment

�  Segments callback to the NameNode to obtain a block location array consisting of block IDs

�  Libhdfs3 read operation begins, retrieving data from whichever DataNodes in the cluster it needs and returns data to upper level operators

�  Upper level operators (e.g. hash-join, hash-agg) carry on the execution using motion operators as needed

Page 82: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

82 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Query Using PXF External Tables � Data can be queried from external data sources and joined

with HAWQ data using external table methodology

� Regular external tables can be used for data residing outside of the Hadoop ecosystem

� For data residing in the Hadoop ecosystem PXF external tables can be used

–  Read HDFS, HBase, Hive and other formats using standard SQL

Page 83: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

83 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

3rd Party Application Querying

�  JDBC interface for HAWQ –  Used for queries but should not be used for inserts

�  JDBC DML operations (CREATE TABLE, TRUNCATE TABLE) fall into a transaction block

–  Meaning if you create a table in a transaction block and then rollback the transaction you’ll never see that table

� Can not perform updates, deletes, or create indexes

Page 84: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

84 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 84 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Loading and Unloading Data in HAWQ

Page 85: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

85 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Loading Data into HAWQ

� When the data sources are outside the Hadoop ecosystem – Use regular gpfdist external tables – Use COPY command for loading small data sets only

� When the data sources are in the Hadoop ecosystem – Use PXF external tables

Page 86: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

86 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ Data Loading Options

HDFS DataNode

HAWQ Segment Host

HDFS DataNode

HAWQ Segment Host

HDFS DataNode

HAWQ Segment Host . . . Query Executor Query Executor Query Executor

Clients JDBC

SQL Console

insert into <hawq-target-table> select * from <regular external table>;

HDFS Namenode

HAWQ Master Host

Query Optimizer Query Parser

Interconnect

External Data Sources

insert into <hawq-target-table> select * from <pxf external table>;

COPY command

Page 87: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

87 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ Writes and Performance �  The fastest method to write data in HAWQ is gpfdist

�  Testing gpfdist write process capped at 1GB/sec (with 1 gpfdist server and 64 segment readers)

–  This speed increases linearly with added gpfdist servers

�  Testing hadoop fs –put capped at about 130MB/sec

�  PXF external table copy to HAWQ table capped at 600MB/sec (for 64 segments)

�  Testing gpfdist external table copy is approximately 160% faster than PXF external tables

Page 88: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

88 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Write Paths �  Using gpfdist HAWQ segments read chunks of data from the gpfdist servers in

parallel, then hashes on the distribution key and sends the data to the correct segment server to be written to HDFS locally by the DataNode

�  Using PXF external tables HAWQ segments requests chunks of data from the PXF fragmenter, PXF reads data via a set of PXF accessors and returns the data to the segment, the segment then hashes on the distribution key and sends it to the correct segment (likely not on the same DataNode) for write to HDFS by the DataNode

–  The highest number of NameNode RPC calls are observed since both the PXF fragmenter and segments are engaging in NameNode calls for block locations

Page 89: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

89 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Optimizing gpfdist for Performance �  In general, maximize the parallelism as the number of

segments increase

� Spread the data evenly across as many nodes as possible

� Spread the data evenly across as many file systems as possible

–  Run two gpfdist's per file system

� Run gpfdist on as many interfaces (NICs) as possible

� Keep the work even across ALL of these resources –  In an MPP shared nothing environment loading is as fast as the slowest node

Page 90: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

90 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

gp_external_max_segs Optimization � Controls the maximum number of segments each gpfdist

serves

� Keep gp_external_max_segs and number of gpfdist processes an even factor

–  gp_external_max_segs / # of gpfdist processes should have a remainder of 0

� Default is 64

Page 91: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

91 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Error Handling � Single Row Error Handling

–  Supported in external tables and COPY command –  Define a table to catch the ‘unloadable’ rows –  Load continues—does not fail

� Reject Limit –  Capping the number of rejects –  Once limit is met, load statement fails –  Limit can be actual number or percent –  Rejects evaluated at the segment

Page 92: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

92 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Loading Recommendations

� Default recommendation is to use bulk load through gpfdist external tables

–  Suitable from HDFS perspective

� To load smaller amount of data (Example: <100,000 rows) –  COPY command can be used

� Single row inserts not recommended –  Not suitable from HDFS perspective

Page 93: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

93 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Unloading Data � Regular writable external tables can be used for scalable

unload –  Same as in GPDB

� Copy command can be used for unloading small data sets

� Example for unloading to HDFS DROP EXTERNAL TABLE IF EXISTS foo_dump; CREATE WRITABLE EXTERNAL WEB TABLE foo_dump ( LIKE foo ) EXECUTE 'hadoop fs -put - hdfs://pivhdsne:8020/dump/foo/${GP_SEGMENT_ID}.tsv' FORMAT 'TEXT' (DELIMITER E'\t'); INSERT INTO foo_dump SELECT * FROM foo;

Page 94: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

94 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 94 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Lab

Page 95: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

95 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HBASE_HAWQ_LOAD Lab

� Load dimension tables into HBase using importtsv –  Data is in HDFS

�  Load data into HAWQ tables using COPY –  Data is in DAS

� Load a HAWQ AO table using SELECT from one of the PXF external tables defined

Loading Data into HBase and HAWQ

Page 96: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

96 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 96 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

PXF External Tables

Page 97: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

97 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

PXF is...

A fast extensible framework connecting Hawq to a data

store of choice that exposes a parallel API

Page 98: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

98 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Hawq External Tables • gpfdist

–  Remote delimited text (or csv) files

•  file

–  Text files on segment filesystem

• execute

–  Script execution and produced data

• pxf

–  Text and binary data from available pxf connectors

Page 99: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

99 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

PXF �  Load data into HAWQ from Hadoop

�  Query Hadoop data without materializing it into HAWQ –  HDFS: delimited text, csv, Sequence, Avro –  HBase (w/filter pushdown) –  Hive (w/partition exclusion) ▪  Text, Sequence and RCFile formats

�  Write HAWQ data to HDFS –  Delimited text, csv, Sequence –  Various compression codecs and options

�  Extensible! –  GemFireXD, JSON format, Cassandra, Accumulo, others

Page 100: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

100 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

PXF Features �  Supports filtering through predicate push down in HBase

–  <, >, <=, >=, =, != between a column and a constant –  Can AND between these (but not OR)

�  Supports Hive table partitioning

�  Ability to analyze data stored on HDFS using a data processing system

–  HAWQ optimizer uses the statistics to generate optimal plans on PXF external tables

�  Extensible framework Java API to enable custom development for other data sources and custom formats

Page 101: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

101 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Key Use Cases �  Using analytics, SQL query functionality from HAWQ on

HDFS, HBase, or Hive data without materialization into HAWQ

�  Join dimension tables stored in HAWQ with HBase fact tables

�  Fast ingest/materialization of high value processed data from HDFS, Hive or HBase data into HAWQ

Page 102: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

102 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

PXF Differentiators �  Utilizes HAWQ fast parallel optimizer �  Applies data locality optimizations to reduce resources and network

traffic �  Extensible framework

�  Customers and partners can configure support for any new data store that will automatically support a fast and parallel data transfer

�  JSON format, Cassandra, Accumulo in beta �  Supports ANALYZE for gathering HDFS file statistics and having it

available for the query planner at run time

Page 103: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

103 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Feature Summary ★  HBase (w/filter pushdown) ★  Hive (w/partition exclusion. various storage file types) ★  HDFS Files: read (delimited text, csv, Sequence, Avro) ★  HDFS Files: write (delimited text, csv, Sequence, various compression

codecs and options) ★  GemFireXD, JSON format, Cassandra, Accumulo (currently Beta) ★  Statistics collection ★  Automatic data locality optimizations ★  Extensibility!

Page 104: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

104 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

PXF Components �  Fragmenter

–  On the NameNode –  Metadata of data source (blocks and location) is passed back to the HAWQ Master

by the Fragmenter

�  Accessor –  Responsible for reading specific data fragments and passing them to the Resolver

�  Resolver –  De-serializes the records and serializes them into list of one field objects –  One field objects converted into GPDBWritable that can be read by HAWQ

�  Analyzer –  Responsible for collecting statistics on external table data that can be used by HAWQ

optimizer

Page 105: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

105 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Page 106: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

106 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Page 107: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

107 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

PXF Loading into HAWQ � To load data into HAWQ use a variation of

–  insert into <hawq-target-table> select * from <pxf-external-table>;

� Data can be transformed in-flight before loading

� Data from Hadoop can also be joined in-flight with HAWQ data while loading

� Number of segments responsible for connecting to Pivotal HD for concurrent reading of data can be tuned

–  gp_external_max_segs GUC –  Default 64

Page 108: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

108 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

PXF Querying

� PXF external tables can be queried directly without materialization into HAWQ

� PXF data can be joined with HAWQ tables

� Ability to analyze external tables helps HAWQ optimizer to choose optimal plans

� HBase predicate push down

� Hive partitioning

Page 109: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

109 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Profiles •  Improved user experience •  Informative error messages

LOCATION(‘pxf://<host:port>/sales?fragmenter=HiveFragmenter&accessor=HiveAccessor&resolver=HiveResolver’)

LOCATION(‘pxf://<host:port>/sales?profile=Hive’)

Page 110: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

110 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

profiles.xml

<profile>

<name>HBase</name>

<description>Used for connecting to an HBase data store engine</description>

<plugins>

<fragmenter>HBaseDataFragmenter</fragmenter>

<accessor>HBaseAccessor</accessor>

<resolver>HBaseResolver</resolver>

<myidentifier>MyValue</myidentifier>

</plugins>

</profile>

Page 111: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

111 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HDFS Files Example Analyze all text files that exist inside hdfs directory ‘sales/2012/01’

CREATE EXTERNAL TABLE jan_2012_sales (!!id int, !!total int, !!comments varchar!

)!LOCATION(‘pxf://10.76.72.26:50070/sales/2012/01/items_*.csv?! profile=HdfsTextSimple )!FORMAT ‘TEXT’ (delimiter ‘,’);!

Page 112: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

112 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HBase Table Example Get data from an HBase table called‘sales’. In this example we are only interested in the rowkey, the qualifier ‘saleid’ inside column family ‘cf1’, and the qualifier ‘comments’ inside column family ‘cf8’

CREATE EXTERNAL TABLE hbase_sales (!!recordkey bytea, !!“cf1:saleid” int, !!“cf8:comments” varchar!

)!LOCATION(‘pxf://10.76.72.26:50070/sales?! profile=HBase )!FORMAT ‘custom’ (formatter='gpxfwritable_import');!

direct mapping

Page 113: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

113 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Writable PXF – Export to HDFS

•  gphdfs-like functionality but extensible –  Supports text, csv, SequenceFile –  Supports various Hadoop compression Codecs

CREATE WRITABLE EXTERNAL TABLE ... LOCATION(‘pxf://<host:port>/sales?profile=HdfsTextSimple&COMPRESSION_CODEC=org.apache.hadoop.io.compress.GzipCodec') FORMAT ‘text’(delimiter ‘,’);

can create a new profile “HdfsTextSimpleGZipped” that includes compression_codec

LOCATION(‘pxf://<host:port>/sales?profile=HdfsTextSimpleGZipped')

Page 114: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

114 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 114 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Lab

Page 115: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

115 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

PXF_PUSHDOWN Lab

� Review the HAWQ DDL

� Run the HBase query –  customers_dim table

� Using “show” check the value of the GUC pxf_enable_filter_pushdown

–  Toggle value to “off”

� Rerun the query

PXF external tables predicate pushdown

Page 116: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

116 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 116 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Lab

Page 117: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

117 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

PXF_STATS Lab

� SELECT relpages, reltuples FROM pg_class WHERE relname = 'table_name';

�  ANALYZE table_name;

�  SELECT relpages, reltuples FROM pg_class WHERE relname = 'table_name';

PXF external tables statistics

Page 118: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

118 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 118 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ, Hive, HBase Comparative Usage

Page 119: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

119 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Hive �  Hive uses a SQL-based language called HiveQL, which is a subset of

SQL and has some additional MapReduce-specific syntax

�  Hive interprets SQL into a series of native MapReduce jobs –  Materializes data to disk

�  Hive can manage its own tables or use external tables –  No inherent performance difference, just ease of management

�  Hive is typically used as the integration point for BI and ETL tools

�  Hive only has a rudimentary query optimizer

Page 120: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

120 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HBase � HBase is a scalable, sorted, columnar, key-value data store

–  Linear scalability –  Keys are sorted and partitioned, so fetching by key is fast

▪  Can support range scans

–  Data is split into column families, which are stored as separate files underneath ▪  Allows for column pruning

–  Stores data as a “doubly-nested map” key-value ▪  Row key->column family:label->data value ▪  Allows for very flexible schema as the label is arbitrary ▪  Can support hierarchical data structures easily

Page 121: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

121 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ, HBase, and Hive Comparison Item HAWQ HBase Hive

Interface ANSI SQL Java API/Shell HiveQL (SQL Subset)

Client Connection JDBC/PXF Java/REST API JDBC (Limited)

Executes as MapReduce Never Yes Yes

SQL Completeness Yes No No

Nodes (Supported) 1,000+ 1,000+ 1,000+

Restart SQL on failure No No Yes

Performance High Low Low

Rely on MapReduce No Yes Yes

Open Source No Yes Yes

DDL Yes No Yes

ANSI Data Types Yes No Yes

Indexes No Yes No

When to use Adhoc Analytics Flexible Schema/Updates Batch

Page 122: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

122 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HAWQ, HBase, and Hive Comparison Item HAWQ HBase Hive

User Defined Data Distributions Yes No Yes (limited)

Advanced Partitioning Yes Yes (limited) Yes (limited)

Robust SQL Optimizer Yes No No

Store Data on HDFS Yes Yes Yes

Has its own daemons Yes Yes No

Relational Database Yes No No

Manage own tables Yes Yes Yes

UDFs Yes No No

MADlib Yes No No

Page 123: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

123 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 123 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Lab

Page 124: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

124 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

HIVE_VS_HAWQ Lab

� Run the Hive DDL to create Hive external tables against the existing HDFS data

� Run queries against HAWQ and Hive versions of these tables

Query Performance Comparison

Page 125: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

125 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 125 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Securing PHD Clusters

Page 126: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

126 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

� Data Protection –  Data access control –  Data at rest encryption –  Masking/Tokenization for data load –  Data-In-Motion encryption

� User Management/Authentication/Authorization

� GRC (Governance, Risk, Compliance) /System Security

Security…Has Many Faces

Page 127: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

127 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Security Dashboard Support secure cluster

Supports Kerberos for Authentication

Support LDAP for Authentication

HDFS Yes Yes Linux OS supports MapReduce/Pig Yes N/A Hive Yes (standalone mode) N/A

Hiveserver No No Hiveserver2 Yes Yes Yes Hbase Yes Yes Yes HAWQ* Yes Yes Yes GemfireXD Yes Yes Yes

Page 128: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

128 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Hadoop Security �  Pivotal HD follows the Hadoop community

–  Today limited to Kerberos –  Intent is to have the ability to plugin other than Kerberos as well as open

source single sign on gateways

�  Requires KDC to manage cluster authentication

�  Once user is authenticated –  Provides authorization by enforcing HDFS file permissions –  Ensuring jobs are run as the user in a Linux container

�  We support everything Cloudera does in terms of securing a cluster

Page 129: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

129 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Data Access Control

� Kerberos for user authorization

�  Jobs will run in secure Linux containers

� Allows HDFS file permissions to be enforced –  Similar to Linux file permissions

� Prevents service and user spoofing

Page 130: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

130 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Challenges �  Important to understand security expectations and

requirements

� Many types of security are not addressed by Hadoop –  Data at rest protection

� Hadoop supports data in motion but the performance impact is high

–  On wire encryption is not recommended especially 3des

Page 131: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

131 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

3rd Party Data Protection Solutions Encryption, Masking, Tokenization, Token Management

Company Masking Tokenization Encryption Gazzang No No Yes Protegrity Yes Yes Yes DataGuise Yes Yes No

Page 132: A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional Line 18 Point Verdana ... HAWQ Architecture ! HDFS Review ! HAWQ Distribution, Partitioning

132 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Thank you [email protected]