Data warehouse evolution at InMobi
Introduction to APACHE LENS
LENS Architecture and OLAP Model
Query examples and Demo
Lens roadmap
Agenda
Challenges:
Data Scale: Loading of data taking ~ 24 hrs
Analysis only upto 3 dimensions
Heavy queries stalling other user queries
Unable to move fast with new reporting requirments
Generation 1 : Reporting data in RDBMS
Small and summarized data in Columnar Database
- Rich Dashboards
Granular data in Hadoop
- Adhoc analysis
Generation 2: Hadoop + Columnar DB
Challenges:
Maintaining two lines of independent data warehousing systems
Data discrepancies
Schema management
Learning curve for Users
Duplicate datasets
Inefficient Utilization
Proprietary OLAP and MR on Hadoop
Generation 2: Hadoop + Columnar DB
Platform to enable multi-dimensional queries in a unified way over
datasets stored in multiple warehouses
- OLAP Cube abstraction
- Data discovery by providing single metadata layer
- Unified access to data by integrating Hive with other traditional
warehouses
Generation 3: Apache Lens (earlier called Grill)
- Queries get pushed to where data resides
- Central Catalog management: All applications talk same language
- Query analytics for optimizing hot datasets
- Workload based experimentation with newer systems: AWS
Redshift, Apache Spark, Apache Tez
Generation 3: Apache Lens
Machine learning workflow in Apache Lens using
Apache Spark
Generation 4 (Future) : Advanced Analytics
Problem areas and Motivation
Introduction to APACHE LENS
LENS Architecture and OLAP Model
Query examples and Demo
Lens roadmap
Agenda
Data Layout – Dimension data
…
Subsetm (am < am-1)
…
Subset2 (a2 < a1)
Subset1 (a1 < ax)
All attributes (ax)
Cost
Data Layout – Snowflake
Aggr Factk : measures (mak <=
ma(k-1)), dims (dak < da(k-1))
…..
Aggr Fact2 : measures (ma2 <= ma1), dimensions (da2 < da1)
Aggr Fact1 : measures (ma1 <= mr), dimensions (da1 < dr)
Raw Fact: measures (mr), dimensions(dr)
Dim2_1
Dim3
Dim2
Dim4_1
Dim4
Dim1
Data Model
Fact1: d1,m2,m3,m4 (HDFS)
Fact2: d1,d3,m2(sum),m3(max),m4(mi
n) (HDFS)
Raw :d1,d2,d3,m1,m2,m3,m4 (HDFS)
Dim_table2 : id, name,
detail2 (HDFS)
Dim_table3 : id, name,detail,d2id
(DB)
Dim_table4 : id, name, d2id (HDFS, DB)
Dim_table : id, name, detail, d2id (HDFS)
SAMPLE_CUBE
SAMPLE_DIM2 SAMPLE_DIM
SAMPLE_DB_DIM
d3->id
d2id->id
d2id->id
Apache Lens : Features
Server
Query life cycle management
Catalog service
Query statistics
Scheduling queries
Authorization
Query caching
Estimate expected query time
Client
Java Client
CLI
JDBC Client
Simple UI
Execution Engine
Hive Driver
JDBC Driver
Spark Driver
Github source for Apache Lens
• https://git-wip-us.apache.org/repos/asf/incubator-lens.git
• https://github.com/apache/incubator-lens
Documentation
• Soon :http://lens.incubator.apache.org
• Right now : http://inmobi.github.io/grill/
Mailing lists
References
• SELECT ( city.name ), ( city.stateid ) FROM c2_citytable city LIMIT 100
• SELECT ( city.name ), ( city.stateid ) FROM c1_citytable city WHERE (city.dt = 'latest') LIMIT 100
cube select name, stateid from city limit 100
Example query
Example query
• SELECT (citytable.name), sum((testcube.msr2)) FROM c2_testfact testcube INNER JOIN c1_citytable city ON ((testcube.cityid)= (city.id)) WHERE (( testcube.dt='2014-03-10-03') OR (testcube.dt='2014-03-10-04') OR (testcube.dt='2014-03-10-05') OR (testcube.dt='2014-03-10-06') OR (testcube.dt='2014-03-10-07') OR (testcube.dt='2014-03-10-08') OR (testcube.dt='2014-03-10-09') OR (testcube.dt='2014-03-10-10') OR (testcube.dt='2014-03-10-11') OR (testcube.dt='2014-03-10-12') OR (testcube.dt='2014-03-10-13') OR (testcube.dt='2014-03-10-14') OR (testcube.dt='2014-03-10-15') OR (testcube.dt='2014-03-10-16') OR (testcube.dt='2014-03-10-17') OR (testcube.dt='2014-03-10-18') OR (testcube.dt='2014-03-10-19') OR (testcube.dt='2014-03-10-20') OR (testcube.dt='2014-03-10-21') OR (testcube.dt='2014-03-10-22') OR (testcube.dt='2014-03-10-23') OR (testcube.dt='2014-03-11') OR (testcube.dt='2014-03-12-00') OR (testcube.dt='2014-03-12 -01') OR (testcube.dt='2014-03-12-02') )AND (city.dt = 'latest')
GROUP BY(city.name)
cube select city.name, msr2 from testcube where timerange_in(dt, '2014-03-10-03’, '2014-03-12-03’)
Data Model – Storage
Sto
rage • Name
• End point
• Properties
• Ex : ProdCluster, StagingCluster, Postgres1, HBase1, HBase2
Data Model – Fact Table
Fact table
Cube
Fact table
StorageFact Table • Columns
• Cube that it belongs
• Storages on which it is present and the associated update periods
Data Model – Dimension table
Dim
ensio
n T
able • Columns
• Dimension to which it belongs
• Storages on which it is present and associated snapshot dump period, if any.
Dimension Table
Dimension
Dimension table
Storage
Data Model – Storage tables and partitionsS
tora
ge t
ab
le • Belongs to fact/dimension
• Associated storage descriptor
• Partitioned by columns
• Naming convention – storage name followed by fact/dimension name
• Partition can override its storage descriptor
• Fact storage table
Fact table
• Dimension storage table
Dimension table
Resolve candidate tables and storages
Automatically resolve joins, aggregations
Allows SQL over Cube QL
Queries can span multiple storages
Accepts multiple time ranges in query
All Hive QL features
Query features
Top Related