Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect

Topics

• Tableau Data Extract Vs. Live Connect

• Performance challenges with Live Connect

• Demo: Tableau on Hadoop - ~3B rows

• Jethro Technology Overview

• Typical Jethro Use-Cases

About Jethro

SQL

Data

• Who am I?– Eli Singer, CEO JethroData, NYC based – Over 20 years experience in data tech

• What Does Jethro Do?– BI on Big Data acceleration– Reporting, dashboards, discovery, ad-

hoc

• How It Works?– Indexing and caching server– Combines columnar SQL DB design

with search-indexing technology into one product

• How it works– Extract selective /

aggregated data from any source

– Convert into a proprietary TDE format

• Columnar, compressed, highly optimized for Tableau

– Loaded into Tableau desktop / server memory for interactive analysis

BI & Big Data

ExtractedData

Tableau Data Extract

EDW

• Why you want to use it– Speed: once data is loaded,

interaction is very fast– Stability: not affected by changes

or activity at the datasource

• Limitations and challenges– Size: large extracts (many rows,

columns, high cardinality) are impractical

– Freshness: lag time between data’s availability at the source and TDE readiness

– Complexity: managing and refreshing many TDEs creates operation burden

• How it works– Data stays at the source

– Every user interaction results in Tableau sending live queries to the datasource DB

– DB filters, aggregates and sends results back to Tableau

BI & Big Data

Tableau Live Connect

EDW

• Why you want to use it– Size: enable users to interact

with any size datasets, at any needed granularity

– Freshness: enable near-real-time analytics on data within minutes of its arrival

– Simplicity: no need manage a complex system of TDE maintenance

• Limitations and challenges– Performance: datasource DBs

can be significantly slower than Tableau’s in-mem engine

Live Queries

Analytics: ETL, Predictive, Reporting, BI

SQL

10x-100x Data1/10 HW $costOpen Platform

Big Data Platforms: Hadoop vs. EDW Appliances

SQL-on-Hadoop Performance Challenges

SQL

SQL-on-Hadoop

• ETL• Predictive• Reporting

Too SLOW on Hadoop

x

The Hadoop Trade-Off: Scale & Cost Vs. Performance


A Library Analogy:Billions of books, Thousands of racks

Query: List books by author “Stephen King”

Process: Every librarian pulls out book by book from their rack and check for Author

• Hive• Impala• Presto• SparkSQL• Drill

• HAWQ/HDB• IBM/Big SQL• Actian• Tajo• …

SQL-on-Hadoop: MPP/Full-Scan Architecture


Unsuitable for BI

Query: List books by author “Stephen King”

Process: Access Author index, entry of “Stephen King”, get list of books, fetch only these books

Result: Fast, minimal resources, scalable

SQL-on-Hadoop: Index-Access Architecture


Optimal for BI

What Is Jethro for Tableau?An indexing & caching server• Tableau uses Live Connect

– Sends SQL queries via ODBC

• Jethro key performance features1. Full indexing – every column is indexed2. Result cache – every query is cached3. Auto Cubes – every repeatable pattern

• Everything stored in Hadoop– Cache, aggregations, index & column files, …

• Incrementally updated– Every day / hour / min

SQL

I/O

Cloud Storage

LIVE Demo: Tableau on Hadoop• Point browser at: tableau.jethrodata.com

– Login: demo / demo• Choose workbook: Jethro• Dashboard interaction: drill-down using any

filter combination• Data

– Based on TPC-DS benchmark– 1TB raw data – Fact table: ~2.9B rows– Dimensions: 7

Hardware Data Format Storage Compute Cluster Total RAM, CPU AWS $ per hr.Jethro Jethro indexes EFS, HDFS 2x r3.4xlarge (spot) 240GB, 32 cores $0.75

http://tableau.jethrodata.com/

http://tableau.jethrodata.com/

Performance Benchmark Results

Main page 1 filter(St=MN)

1 filter(Yr=2002)

2 filters(2002, MN)

3 filters(2002, Women,

Good)

4 filters(+ store=Woodland

bar)

5 filters(+ swimwear)

6 filters(+State=Indiana)

-

20.0

40.0

60.0

80.0

100.0

120.0

140.0

Dashboard Refresh Time

Jethro (w/cache) Jethro Impala

Jethro Avg: 6s

SQL-on-Hadoop Avg: 1m32s

SQL-on-Hadoop

Across-the-Board Consistently Fast QueriesStore State

Null AL CO FL GA IN LA MI MN MO NC NE NM NY OH PA SC SD TN TX WA WV0B

5B

10B

15B

20B

25B

30B

35B

40B

45B

50B

55B

60B

SUM([Net Profit])*-1

SUM([Net Profit])*-1 for each Store State. The data is filtered on Item Category, which keeps Children, Men and Shoes.

Filter by Product Category

- Medium filtering, repeatability- Benefits from auto micro-cubes- Auto generated, small size

Store State

LA MO NY OH PA SD WV0M

50M

100M

150M

200M

250M

300M

350M

400M

450M

500M

550M

600M


SUM([Net Profit])*-1 for each Store State. Thedata is filtered on Item Category, Customer Mari-tal Status and Sale Date Year. The Item Categoryfilter keeps Electronics. The Customer MaritalStatus filter keeps M. The Sale Date Year filterkeeps 2000. The view is filtered on Store State,which keeps 7 of 22 members.

Filter by Product Category, Customer martial status, date, state

- High filtering, low repeatability- Benefits from indexes- Direct pointer to needed rows

Store State

Null AL CO FL GA IN LA MI MN MO NC NE NM NY OH PA SC SD TN TX WA WV0B

20B

40B

60B

80B

100B

120B

140B

160B

180B

200B


Profit by State

SUM([Net Profit])*-1 for each Store State.

No Filter

- Low filtering, high repeatable- Benefits from query-result reuse- Every query result is cached

Data Node

Data Node

Data Node

Data Node

Data Node

Jethro Server1. Index Access 2. Read data only for required rows

Performance and resources based on the size of the working-set

SELECT date, SUM(sales) FROM T1 WHERE product=‘Books’ GROUP BY date

Index-Access: How it Works

Query Result Cache: How it Works

date cust, prod,

$sale

2015-12-08 $2.00

… …

2016-01-01 $4.50

… …

2016-09-30 $12.50

Customer query:

select sum(sales) from transactionswhere year=2016

Process:use index to find all rows for 2016. Sum $sale for selected rows

Response: $1,643

sales transactions (1B rows) Jethro saves actual query result in shared

Hadoop storage

select sum(sales) … where year=2016: $1,643

repeated exact query served from result cache

Response: $1,643

Incremental update

date cust, prod,

$sale

2015-12-08 $2.00

… …

2016-01-01 $4.50

… …

2016-09-30 $12.50

2016-10-01 $7.00

Process:1. Repeated exact query2. Identify new data was added3. Run query on new data

• Result: $74. Merge with stored results

• New Result: $1,650

Response: $1,650

Auto-Micro-Cubes: How it Works

state cust, prod,

…

$sale

AL $2.00

…

AK $4.50

…

AZ $1.00

…

… …

… …

WY $4.25

Customer query:

select sum(sales) … where state=‘AZ’

Process:use index to find all rows for ‘AZ’. Sum $sale for selected rows

Response: $1,643

sales transactions (1B rows)

sales-by-state (50 rows)State $sale

AL $256

AZ $1,643

… …

WY $4,654

Jethro auto gen query(move filter col into group by):

select sum(sales) … group by state

Subsequent queries served from micro-cube:

where state=‘AK’where state in (‘CA’, ‘NY’)

How Jethro auto micro-cubes are different?• Auto generated• Limited in size• Incrementally updated • Supports complex

functionality: CASE, WHEN, functions

• Supports DISTINCT

Complimentary to indexing

Avoid large and inefficient cubes by using indexing for hi-cardinality cols, multiple filters

Built for Scale: Concurrency Features

…

• Jethro servers are stateless– Can be added / dropped on the fly to

support any user volume– All data is stored centrally in Hadoop

• Automated load balancing– ODBC / JDBC clients use round-robin

mechanism to access all active servers

• Query results and Cubes are shared – All servers and users have immediate

access

• Minimal dependency on cluster performance– All compute is done on Jethro nodes– Cluster only accessed for selective I/O

• Concurrent query optimizations– Shared WHERE across active queries

System Diagram

DatasetDataset

Dataset

DatasetDataset

DatasetBI

Dataset

Jethro server Jethro

server

SQL Client

I/O

ODBC / JDBC

Custom VizBI Tool

Col Data/ Cloud Storage Dataset

Col Index

Result cache, CubesDataset

• edge node• VM

Typical Use Case• Who: Several dozen

implementations– Financial, Retail, Telco, automotive– Marketing, Internet, Tech

• Application Types:– BI Dashboards (50%), reports– Exploration, ad-hoc

• Common BI Tools: – Tableau (50%), Qlik– SAP/Biz Objects, In House / Customized

• Dataset sizes & complexity:– Average: ~5B row tables– Largest: >100B rows

• Ingestion Patterns:– Average: daily, ~50M rows– Largest: every 15min, >1B rows / day

• Performance & Concurrency– Speed: under 10 sec for dashboard– Users: ~50, simulated tests to 1,000– Avg deployment: 4 Jethro nodes

• Hadoop Distributions Supported:– HDP, CDH, Apache, MapR, EMR

• Use with other technologies:– Complements: Hive, Impala, Presto,

Drill– Replaces: Netezza, Vertica, TD,

Redshift

Top Reasons to Use Jethro 1. Consistently fast queries

– Speed up any type of BI query– Combining indexing, caching and

cube technologies

2. Data model flexibility– Focus on application needs, not

query tool limitations – Avoid: de-normalization, pre-defined cubes,

complex aggregations, forced sorting / partitioning

– Full star-schema support

3. Operational simplicity– All data stored in shared Hadoop

cluster– Incrementally updated– Self-maintained– Built for scale– Wide BI & Hadoop compatibility

4. Broad use-case range– Any BI application: dashboards,

exploration, ad-hoc, reporting– Internal, external facing– Small / large datasets, few / many

users

Top Reasons to Use Jethro

Simple Indexing Process

Indexed BI

Dataset

Any data-

source:

• Hadoop• EDW• NoSQL• Text• S3• …

Jethro Loader

$ hive –e “select * from…”Jethro Server

1. Historical(one-time)

2. On-going (incremental)

Jethro should be used selectively: only with BI-relevant datasets

• Fast: 0.5B rows / hr• Compressed: <40% of

original size (text)• Near real-time: load new

data up to every min

SQL on Hadoop – Complimentary Approaches

• Hive / Tez• Impala• Presto• SparkSQL• Drill

• HAWQ• IBM/Big SQL• Actian• Tajo• …

SQL-on-Hadoop SolutionsFull-Scan: Read all rows

• JethroData

JethroDataIndex-Access: Read ONLY needed rows

Comparison:Full-Scan: Optimal for predictive & reportingIndex-Access: Optimal for interactive BI

Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect

Technology

Transcript of Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect