Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
-
Upload
remy-rosenbaum -
Category
Technology
-
view
68 -
download
0
Transcript of Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Topics
• Tableau Data Extract Vs. Live Connect
• Performance challenges with Live Connect
• Demo: Tableau on Hadoop - ~3B rows
• Jethro Technology Overview
• Typical Jethro Use-Cases
About Jethro
SQL
Data
• Who am I?– Eli Singer, CEO JethroData, NYC based – Over 20 years experience in data tech
• What Does Jethro Do?– BI on Big Data acceleration– Reporting, dashboards, discovery, ad-
hoc
• How It Works?– Indexing and caching server– Combines columnar SQL DB design
with search-indexing technology into one product
• How it works– Extract selective /
aggregated data from any source
– Convert into a proprietary TDE format
• Columnar, compressed, highly optimized for Tableau
– Loaded into Tableau desktop / server memory for interactive analysis
BI & Big Data
ExtractedData
Tableau Data Extract
EDW
• Why you want to use it– Speed: once data is loaded,
interaction is very fast– Stability: not affected by changes
or activity at the datasource
• Limitations and challenges– Size: large extracts (many rows,
columns, high cardinality) are impractical
– Freshness: lag time between data’s availability at the source and TDE readiness
– Complexity: managing and refreshing many TDEs creates operation burden
• How it works– Data stays at the source
– Every user interaction results in Tableau sending live queries to the datasource DB
– DB filters, aggregates and sends results back to Tableau
BI & Big Data
Tableau Live Connect
EDW
• Why you want to use it– Size: enable users to interact
with any size datasets, at any needed granularity
– Freshness: enable near-real-time analytics on data within minutes of its arrival
– Simplicity: no need manage a complex system of TDE maintenance
• Limitations and challenges– Performance: datasource DBs
can be significantly slower than Tableau’s in-mem engine
Live Queries
Analytics: ETL, Predictive, Reporting, BI
SQL
10x-100x Data1/10 HW $costOpen Platform
Big Data Platforms: Hadoop vs. EDW Appliances
SQL-on-Hadoop Performance Challenges
SQL
SQL-on-Hadoop
• ETL• Predictive• Reporting
Too SLOW on Hadoop
x
The Hadoop Trade-Off: Scale & Cost Vs. Performance
SQL-on-Hadoop Performance Challenges
A Library Analogy:Billions of books, Thousands of racks
Query: List books by author “Stephen King”
Process: Every librarian pulls out book by book from their rack and check for Author
• Hive• Impala• Presto• SparkSQL• Drill
• HAWQ/HDB• IBM/Big SQL• Actian• Tajo• …
SQL-on-Hadoop: MPP/Full-Scan Architecture
SQL-on-Hadoop Performance Challenges
Unsuitable for BI
Query: List books by author “Stephen King”
Process: Access Author index, entry of “Stephen King”, get list of books, fetch only these books
Result: Fast, minimal resources, scalable
SQL-on-Hadoop: Index-Access Architecture
SQL-on-Hadoop Performance Challenges
Optimal for BI
What Is Jethro for Tableau?An indexing & caching server• Tableau uses Live Connect
– Sends SQL queries via ODBC
• Jethro key performance features1. Full indexing – every column is indexed2. Result cache – every query is cached3. Auto Cubes – every repeatable pattern
• Everything stored in Hadoop– Cache, aggregations, index & column files, …
• Incrementally updated– Every day / hour / min
SQL
I/O
Cloud Storage
LIVE Demo: Tableau on Hadoop• Point browser at: tableau.jethrodata.com
– Login: demo / demo• Choose workbook: Jethro• Dashboard interaction: drill-down using any
filter combination• Data
– Based on TPC-DS benchmark– 1TB raw data – Fact table: ~2.9B rows– Dimensions: 7
Hardware Data Format Storage Compute Cluster Total RAM, CPU AWS $ per hr.Jethro Jethro indexes EFS, HDFS 2x r3.4xlarge (spot) 240GB, 32 cores $0.75
Performance Benchmark Results
Main page 1 filter(St=MN)
1 filter(Yr=2002)
2 filters(2002, MN)
3 filters(2002, Women,
Good)
4 filters(+ store=Woodland
bar)
5 filters(+ swimwear)
6 filters(+State=Indiana)
-
20.0
40.0
60.0
80.0
100.0
120.0
140.0
Dashboard Refresh Time
Jethro (w/cache) Jethro Impala
Jethro Avg: 6s
SQL-on-Hadoop Avg: 1m32s
SQL-on-Hadoop
Across-the-Board Consistently Fast QueriesStore State
Null AL CO FL GA IN LA MI MN MO NC NE NM NY OH PA SC SD TN TX WA WV0B
5B
10B
15B
20B
25B
30B
35B
40B
45B
50B
55B
60B
SUM([Net Profit])*-1
SUM([Net Profit])*-1 for each Store State. The data is filtered on Item Category, which keeps Children, Men and Shoes.
Filter by Product Category
- Medium filtering, repeatability- Benefits from auto micro-cubes- Auto generated, small size
Store State
LA MO NY OH PA SD WV0M
50M
100M
150M
200M
250M
300M
350M
400M
450M
500M
550M
600M
SUM([Net Profit])*-1
SUM([Net Profit])*-1 for each Store State. Thedata is filtered on Item Category, Customer Mari-tal Status and Sale Date Year. The Item Categoryfilter keeps Electronics. The Customer MaritalStatus filter keeps M. The Sale Date Year filterkeeps 2000. The view is filtered on Store State,which keeps 7 of 22 members.
Filter by Product Category, Customer martial status, date, state
- High filtering, low repeatability- Benefits from indexes- Direct pointer to needed rows
Store State
Null AL CO FL GA IN LA MI MN MO NC NE NM NY OH PA SC SD TN TX WA WV0B
20B
40B
60B
80B
100B
120B
140B
160B
180B
200B
SUM([Net Profit])*-1
Profit by State
SUM([Net Profit])*-1 for each Store State.
No Filter
- Low filtering, high repeatable- Benefits from query-result reuse- Every query result is cached
Data Node
Data Node
Data Node
Data Node
Data Node
Jethro Server1. Index Access 2. Read data only for required rows
Performance and resources based on the size of the working-set
SELECT date, SUM(sales) FROM T1 WHERE product=‘Books’ GROUP BY date
Index-Access: How it Works
Query Result Cache: How it Works
date cust, prod,
$sale
2015-12-08 $2.00
… …
2016-01-01 $4.50
… …
2016-09-30 $12.50
Customer query:
select sum(sales) from transactionswhere year=2016
Process:use index to find all rows for 2016. Sum $sale for selected rows
Response: $1,643
sales transactions (1B rows) Jethro saves actual query result in shared
Hadoop storage
select sum(sales) … where year=2016: $1,643
repeated exact query served from result cache
Response: $1,643
Incremental update
date cust, prod,
$sale
2015-12-08 $2.00
… …
2016-01-01 $4.50
… …
2016-09-30 $12.50
2016-10-01 $7.00
Process:1. Repeated exact query2. Identify new data was added3. Run query on new data
• Result: $74. Merge with stored results
• New Result: $1,650
Response: $1,650
Auto-Micro-Cubes: How it Works
state cust, prod,
…
$sale
AL $2.00
…
AK $4.50
…
AZ $1.00
…
… …
… …
WY $4.25
Customer query:
select sum(sales) … where state=‘AZ’
Process:use index to find all rows for ‘AZ’. Sum $sale for selected rows
Response: $1,643
sales transactions (1B rows)
sales-by-state (50 rows)State $sale
AL $256
AZ $1,643
… …
WY $4,654
Jethro auto gen query(move filter col into group by):
select sum(sales) … group by state
Subsequent queries served from micro-cube:
where state=‘AK’where state in (‘CA’, ‘NY’)
How Jethro auto micro-cubes are different?• Auto generated• Limited in size• Incrementally updated • Supports complex
functionality: CASE, WHEN, functions
• Supports DISTINCT
Complimentary to indexing
Avoid large and inefficient cubes by using indexing for hi-cardinality cols, multiple filters
Built for Scale: Concurrency Features
…
• Jethro servers are stateless– Can be added / dropped on the fly to
support any user volume– All data is stored centrally in Hadoop
• Automated load balancing– ODBC / JDBC clients use round-robin
mechanism to access all active servers
• Query results and Cubes are shared – All servers and users have immediate
access
• Minimal dependency on cluster performance– All compute is done on Jethro nodes– Cluster only accessed for selective I/O
• Concurrent query optimizations– Shared WHERE across active queries
System Diagram
DatasetDataset
Dataset
DatasetDataset
DatasetBI
Dataset
Jethro server Jethro
server
SQL Client
I/O
ODBC / JDBC
Custom VizBI Tool
Col Data/ Cloud Storage Dataset
Col Index
Result cache, CubesDataset
• edge node• VM
Typical Use Case• Who: Several dozen
implementations– Financial, Retail, Telco, automotive– Marketing, Internet, Tech
• Application Types:– BI Dashboards (50%), reports– Exploration, ad-hoc
• Common BI Tools: – Tableau (50%), Qlik– SAP/Biz Objects, In House / Customized
• Dataset sizes & complexity:– Average: ~5B row tables– Largest: >100B rows
• Ingestion Patterns:– Average: daily, ~50M rows– Largest: every 15min, >1B rows / day
• Performance & Concurrency– Speed: under 10 sec for dashboard– Users: ~50, simulated tests to 1,000– Avg deployment: 4 Jethro nodes
• Hadoop Distributions Supported:– HDP, CDH, Apache, MapR, EMR
• Use with other technologies:– Complements: Hive, Impala, Presto,
Drill– Replaces: Netezza, Vertica, TD,
Redshift
Top Reasons to Use Jethro 1. Consistently fast queries
– Speed up any type of BI query– Combining indexing, caching and
cube technologies
2. Data model flexibility– Focus on application needs, not
query tool limitations – Avoid: de-normalization, pre-defined cubes,
complex aggregations, forced sorting / partitioning
– Full star-schema support
3. Operational simplicity– All data stored in shared Hadoop
cluster– Incrementally updated– Self-maintained– Built for scale– Wide BI & Hadoop compatibility
4. Broad use-case range– Any BI application: dashboards,
exploration, ad-hoc, reporting– Internal, external facing– Small / large datasets, few / many
users
Top Reasons to Use Jethro
Simple Indexing Process
Indexed BI
Dataset
Any data-
source:
• Hadoop• EDW• NoSQL• Text• S3• …
Jethro Loader
$ hive –e “select * from…”Jethro Server
1. Historical(one-time)
2. On-going (incremental)
Jethro should be used selectively: only with BI-relevant datasets
• Fast: 0.5B rows / hr• Compressed: <40% of
original size (text)• Near real-time: load new
data up to every min
SQL on Hadoop – Complimentary Approaches
• Hive / Tez• Impala• Presto• SparkSQL• Drill
• HAWQ• IBM/Big SQL• Actian• Tajo• …
SQL-on-Hadoop SolutionsFull-Scan: Read all rows
• JethroData
JethroDataIndex-Access: Read ONLY needed rows
Comparison:Full-Scan: Optimal for predictive & reportingIndex-Access: Optimal for interactive BI