Taming the Data Lake with Scalable Metrics Model Framework

34
www.globalbigdataconference.com Twitter : @bigdataconf

Transcript of Taming the Data Lake with Scalable Metrics Model Framework

Page 1: Taming the Data Lake with Scalable Metrics Model Framework

www.globalbigdataconference.comTwitter : @bigdataconf

Page 2: Taming the Data Lake with Scalable Metrics Model Framework

“Taming the Data Lake”

2

Page 3: Taming the Data Lake with Scalable Metrics Model Framework

Intended for Knowledge Sharing only

Disclaimer: Participation in this summit is purely on personal basis and not representing VISA in any form or matter. The talk is based on learnings from work across industries and firms. Care has been taken to ensure no proprietary or work related info of any firm is used in any material.

Director, Insights at Visa, Inc. Enable Decision Making at the Executives/ Product/Marketing level via actionable insights derived from Data.

RAMKUMAR RAVICHANDRAN

Data Warehouse Architect at Visa, Inc. Architect a data-shop in Hadoop to get 360-degree view of the interaction. Technology interface for the Data Stakeholder Community.

BHARATHIRAJA CHANDRASEKHARAN

Page 4: Taming the Data Lake with Scalable Metrics Model Framework

Intended for Knowledge Sharing only

Quick recap of what it is

Intended for Knowledge Sharing only

Data Lakes – the concept

Page 5: Taming the Data Lake with Scalable Metrics Model Framework

AS THEY ARE ENVISIONED TODAY…

Intended for Knowledge Sharing only

Source: http://www.tangerine.co.th/tag/how-do-data-lake-work/

5

Page 6: Taming the Data Lake with Scalable Metrics Model Framework

DOES IT RING A BELL?

*only satiric to wake you up and not indicative of anyone or anything- any similarity is purely coincidental! 6

Page 7: Taming the Data Lake with Scalable Metrics Model Framework

& DOES THIS TOO?

*only satiric to wake you up and not indicative of anyone or anything- any similarity is purely coincidental! 7

Page 8: Taming the Data Lake with Scalable Metrics Model Framework

SO WHAT DO WE HEAR FROM OUR USERS?

We often hear these statements in the context of data lakes…

Success criteria was engineering specific – Storage/Scalability cost

saving, etc

Expensive Change Management

Complex for the end users to deal with

Analytical performance issues

Data Governance, Lineage and Management complexities

“Although the cost of Storage went down, actual cost of utilizing the data has shot up”

8

Page 9: Taming the Data Lake with Scalable Metrics Model Framework

Intended for Knowledge Sharing only

Quick recap of what it is

Intended for Knowledge Sharing only

Taking a step back

Page 10: Taming the Data Lake with Scalable Metrics Model Framework

DATA REALLY HAS GOTTEN BIG – VOLUME, VARIETY, VELOCITY & VERACITY

Each of the data source is critical either across all or multiple functions….

Intended for Knowledge Sharing only

…and are consumed either as reports, analytical deep dive insights, forward looking projections, etc.

TRANSACTION DATA

CLICK STREAM DATA (MOBILE & WEB)

SENTIMENT/SOCIAL DATA

• Are overall txns going up/down; where the txns are happening, etc..

• How are Consumers interacting with the website/app – drop-offs, clicks, Time spent, etc..

• Social Media, NPS surveys, Media mentions helps in gauging true Consumer reactions

DATA SOURCES TYPES OF INSIGHTS

SERVER LOGS DATA • How are consumers reacting with various functions on the front end?

LOCATION DATA • Are consumers using the product in-store or on the move?

PROMOTIONS DATA • How are consumers reacting to various marketing campaigns?

INDUSTRY DATA • Benchmarking against industry performance

10

Page 11: Taming the Data Lake with Scalable Metrics Model Framework

EVERYONE NEEDS DATA…

Intended for Knowledge Sharing only

How are we doing today?

BIWhere will be

tomorrow? What if we do this?

What can we do?

ANALYTICS

Did the initiative work?

A/B TESTING

How do Customers feel about us?

USER RESEARCH

Where should we invest?

STRATEGY

11

Page 12: Taming the Data Lake with Scalable Metrics Model Framework

…AND DISTRIBUTED DATA SYSTEMS HAD THEIR OWN ISSUES

Intended for Knowledge Sharing only

Inconsistent (and/or conflicting) definitions of data and numbers

Varying granularities

Multiple methodologies

Different BU = (different KPIs or same KPIs different priorities)

Lack of visibility/understanding outside of the BUs

“Slow & inefficient, Non-scalable, Difficulties rolling up, Trust issues,

Cascading mistakes”

12

Page 13: Taming the Data Lake with Scalable Metrics Model Framework

AND IT THEN JUST HAPPENED…

Intended for Knowledge Sharing only

TRANSACTION DATA

CLICK STREAM DATA (MOBILE & WEB)

SENTIMENT DATA

DATA SOURCES

SERVER LOGS DATA

LOCATION DATA

CAMPAIGN DATA

INDUSTRY DATA

Source: http://www.adamadiouf.com/2013/03/22/bigdata-vs-enterprise-data-warehouse/

As if all prayers were answered Hadoop arrived in a big way & poof all problems seemed to disappear…

13

Page 14: Taming the Data Lake with Scalable Metrics Model Framework

Intended for Knowledge Sharing only

Quick recap of what it is

Intended for Knowledge Sharing only

All problems solved? No wait...

Page 15: Taming the Data Lake with Scalable Metrics Model Framework

WE FOCUSED ON OUR SPOUSE BUT FORGOT THE IN-LAWS…

Inform Reports on

KPIs with high level

drilldowns

ActDeep dives

via Business Analytics

Predict Identify Causal

relationships via Advanced

Analytics

OptimizeExperiments

to verify which one

works via A/B Testing

Maturity phases of Analytics Practice

Valu

e A

ddit

ion

Intended for Knowledge Sharing only

MineMachine Learning

Focus on the 20% Data consumers (Reports) and assumption was that 80% Data Consumers will either love it or at least figure it out…

5%

50%

15%

20%

10%

15

Page 16: Taming the Data Lake with Scalable Metrics Model Framework

HIGH DEVELOPMENT/MODIFICATION COSTS

Intended for Knowledge Sharing only

Rigid Structure and scale of operations make dynamism difficult…

16

Data Modeling/Schema

ETL; Metadata

Raw Data

Page 17: Taming the Data Lake with Scalable Metrics Model Framework

NOT ONLY IS THE AUDIENCE CHANGING…

Intended for Knowledge Sharing only

Stakeholders Needs

Reports, Insights & Drilldowns

Datamart Documentation

Executives- Reports- High level drilldown- Unified summary- “On the go*”

Marketing & PR

- Campaign performance- Infographics- Deep dives- Testing

Sales / RM- Sales performance- Prospecting- Competitive- Infographics

Product

- Product performance- Deep dive- Mining- Testing- Research

Technology / AE /

Operations

- Platform performance- Deep dive- Forecasting- Real time alerting

FP & A

- Consolidated Initiative readouts (E2E)

- Deduping- Drill downs - Forecasting

17

Page 18: Taming the Data Lake with Scalable Metrics Model Framework

…BUT ALSO THE NEEDS ARE EVER CHANGING

Intended for Knowledge Sharing only

“In mail”

Recommendations with supporting

graphs, tables, etc.

“Story Deck”

Full deck with the pitch and supporting arguments, numbers,

graphs, charts

“On-the-go”

-Mobile App, On the Cloud,

Subscriptions-Reports,

Dashboards, Infographics

Algorithm/Model

Ready to be deployed

How to decide? Customer needs; Turnaround Speed;

One time/reuse; Deployment on Front end; Strategic Doc;

Quick read/research doc18

Page 19: Taming the Data Lake with Scalable Metrics Model Framework

Intended for Knowledge Sharing only

Quick recap of what it is

Intended for Knowledge Sharing only

Getting to the point – what do we propose?

Page 20: Taming the Data Lake with Scalable Metrics Model Framework

WE BRING TO YOU THE SCALABLE METRICS MODEL (SMM)…

EDW

Aggregated Cubes

Every attempt to bring the best of the most used models…

20

ACID, Fast, Stable

Rigid, Cost, Resourcing

Scalable Metrics Model

(Pre-Aggregated Metrics +

Primary-Foreign Keys)

Cost, Flexibility, Scalability

Performance, Reliability

Performance, Easy to understand

Reporting only

Page 21: Taming the Data Lake with Scalable Metrics Model Framework

TACTICAL DETAILS: WHERE DO WE START?

An illustrative example from Retail domain…

21

• Defined Granularity & associated Info: Determined by Core Objectives, e.g., Customer level table for Customer Engagement team

CUSTOMER•Primary Key: Customer id•Foreign Keys: Sign Up Partner, Promotion Id, First Txn id•Customer Level Info: Email, Phone, Number, Geo, etc. •Metrics:

• Lifetime Spend, Txns• Behavioral Bucket• RFM Bucket

•Recommended Action items:• Next Best Product• CLV• Target Offers• Call Center Agent Reco

• Defined Foreign Keys & Common Dimensions: As required for extensibility

• Defined Metrics: KPIs as required• Identify Value Add Metrics for Decisioning :

Recommendations, CLV, etc.

Page 22: Taming the Data Lake with Scalable Metrics Model Framework

TACTICAL DETAILS: DATA MODEL

An illustrative example from Retail domain…

22

id Dimensions foreign_keys metrics

Customer_id

NameEmail

Address,etc.

signup_partner_id

promotion_id

Lifetime Spend, TxnsBehavioral Bucket

RFM BucketRecommended Action

items:Next Best Product

CLVTarget Offers

Call Center Agent Reco

11234

{"name":"John", "Email" : "john@email.

com" , "Address":"12

3 nowhereblvd"

}

{"signup_partner_id":"666YYY", "promotion" : "YAH123" }

{"Lifetime Spend":"3400", "Txns":"150",

"Behavioural Bucket" : "repeat user" ,

"RFM Bucket":"","recommended Product

id":"PRD789","CLV":"??",

"Target Offer":"OFF789","CallCenterAgentReco":"1234

"}

Wha

t it

con

tain

sSa

mpl

e da

ta

Page 23: Taming the Data Lake with Scalable Metrics Model Framework

TACTICAL DETAILS: ETL FRAMEWORK

An illustrative example from Retail domain…

23

STEP I:QUERIES

STEP II:FRAMEWOR

KRUNS

•Write separate queries/code to get metrics on the defined granularity and in the optimal framework

STEP III:IMPLEMENTMODULARIT

Y

STEP IV:USER

INTERFACE

•Adding a new metric is just adding a new query/code for that metric alone•Can change an existing logic for a metric will impact that metric alone

•Reporting/Business Analytics: Connect via Tableau/QlikView (Cached)•Deep dive Business Analytics: Physical Impala tables for interactive querying or Views for abstraction & end-user access.•Advanced Analytics: Connect with SAS/R/Python for Advanced Analytics

•Framework runs each of these queries and populate respective keys

Page 24: Taming the Data Lake with Scalable Metrics Model Framework

24

DATA BUS EXTENSIBILITY

CUSTOMER•Primary Key: Customer id•Foreign Keys: Sign Up Partner, Promotion Id, First Txn id•Customer Level Info: Email, Phone, Number, Geo, etc. •Metrics:

• Lifetime Spend, Txns• Behavioral Bucket• RFM Bucket

•Recommended Action items:

• Next Best Product• CLV• Target Offers• Call Center Agent

Reco

SELLERS•Primary Key: Seller id•Foreign Keys: Product id, Operating Channel•Customer Level Info: Name, Operating Region, Annual Sales •Metrics:

• Lifetime Sales, Txns• Performance Bucket• Special Category Flag

•Recommended Action items:

• Next Best Product• Next Co-Marketing• RM action

TXNS•Primary Key: Txn id•Foreign Keys: Custid, Sellerid, Channel, •Txn Level Info: Amt, Type, Date, •Flags:

• Buyer/Seller Type• Deviation Metrics• Fraud/Good• Agent Verification• Next Best Offer

+ CLICKSTREAM+ PROMOTIONS+ PARTNERS+ PRODUCTS+ SENTIMENT+ LOGS+ 3rd PARTY+ ETC ETC…

Common Dimensions or Foreign Keys

Page 25: Taming the Data Lake with Scalable Metrics Model Framework

LEAN DEVELOPMENT

25

Source: http://www.ga.businessgrowthservice.greatbusiness.gov.uk/?attachment_id=10890

SMM Model is conceptualized for the Feedback based Iterative development…

…the speed with which new components can be added, existing logic can be modified and historical reloads can be done exemplifies the value of SMM

Page 26: Taming the Data Lake with Scalable Metrics Model Framework

THE SALIENT FEATURES

26

• Fit for wide variety of Solution Sets & audiences: Optimal data model to support all three needs – Reporting, Analytics & Data Mining.

• Best of all worlds: Scalable Metrics Model is a hybrid approach,• ACID Strengths: performance, stability and reliability of RDBMS. • Non ACID Strengths: scalability, flexibility, versatility of Hadoop.

• Needs Optimized Model: Highest premium is provided to needs of the user – easy to incorporate changes as they come along (view like). Refresh cycle is easy and changed logics easily get incorporated in the next run.

• Data Governance & Lineage: Operates with a modular approach – break down complex problems into smaller items and integrate in a bigger scheme of things. This eases better Data Governance and Lineage.

• Extensibility: • Caching: Easy integration with buffering technologies to optimize on

performance.• Visualization: Easier integration with visualization tools like Tableau.• Coding Interface: Additional drilldowns, analyses, data analysis via

HIVE/SAS/R.● MODULAR ● EXTENSIBLE ● SCALABLE

Page 27: Taming the Data Lake with Scalable Metrics Model Framework

FOUR DIMENSIONS OF SUCCESSFUL EXECUTION

27

PEO

PLE

• Business Analysts: Details on Business needs like Timing(Immediate/ near/medium/long term), Priority (Critical/Urgent/Important/Good to have), Frequency (Regular/once-in-a-while/rare), Real-time, Delivery & Users.

• Technical Architects: Understand the raw data structure, flow mechanisms & pipelines, security/legal/storage/resourcing constraints, feasibility assessments.

PRO

CESS

• Matching & Gap Analysis: Is the technology available to handle all business needs (possible/not enough RoI/deferred); Contingency, resourcing & budgeting.

• Project Planning: Milestone based delivery, Deep Stakeholder involvement in development & validation, Communications Management

• Execution: Schema on read efficient, Aggregates, Tight Metadata, reporting/analytics layer, Tables/Partitions/File types/Compression, Metadata

TECH

• PIG: ETL• HIVE/Impala: Schema & Table creation• Java/Streaming:• SAS/Python/R: Statistical Modeling

CULT

URE

• Customer Needs Focused• Need for a smart vision, sound planning and able change

management• Outcome Focused Organization (common business goal)• SAS/Python/R: Statistical Modeling

Page 28: Taming the Data Lake with Scalable Metrics Model Framework

WHY DO WE THINK THE TIME IS NOW?

Evolution in the value prop of Analysts: What/where/how much -> what can happen ->what should we do ?

Audience has broadened (A numbers middle man -> Front line Managers)Luxury of time has evaporated

Nature of questions have drastically changed (Expectation of being able to connect the dots in “Data Lake” world).

Overselling potential before getting “there”

28

KPI of Analytics has changed from Turn-Around-Time (TAT) to Time-to-Action (TTA)

Page 29: Taming the Data Lake with Scalable Metrics Model Framework

Intended for Knowledge Sharing only

Quick recap of what it is

Intended for Knowledge Sharing only

Putting it all together

Page 30: Taming the Data Lake with Scalable Metrics Model Framework

SWOT ANALYSIS OF SMM

STRENGTHS

OPPORTUNITIES

WEAKNESSES

THREATS

• Need sensitive model• Cost of development, modification &

refresh reduced• Easy for Analysts/End Users to

understand and play with • Data Governance & Lineage: Break

down bigger problems into smaller manageable

• Integration with front end tools that can simplify UX.

• Tools that buffer the backend data to ensure speedy delivery.

• Good vision of future Analytical requirements is paramount.

• Full refresh every time it runs again.

• Maximum granularity needs to be pre-fixed.

• Learning Curve on Coding language/syntax.

• Non-normalized data model.

• Not for real-time insights delivery

• No Slowly Changing Dimensions

30

Page 31: Taming the Data Lake with Scalable Metrics Model Framework

THE FIVE COMMANDMENTS

31

• “Know” that it caters to most frequent and not all needs.

• “Must have” as good & farther as possible Analytics vision/needs and Outcome Focused approach.

• “Ensure” Deeper Stakeholder involvement in the development. Test & Learn approach must. And be ready to modify if needed.

• “Develop” modularity in delivery.

• “Prepare” for ever more increasing dependencies from Analytics and other stakeholders.

Page 32: Taming the Data Lake with Scalable Metrics Model Framework

Intended for Knowledge Sharing only

Quick recap of what it is

Intended for Knowledge Sharing only

Appendix

Page 33: Taming the Data Lake with Scalable Metrics Model Framework

THANK YOU!

Intended for Knowledge Sharing only

Would love to hear from you on any of the following forums…

https://twitter.com/decisions_2_0

http://www.slideshare.net/RamkumarRavichandran

https://www.youtube.com/channel/UCODSVC0WQws607clv0k8mQA/videos

http://www.odbms.org/2015/01/ramkumar-ravichandran-visa/

https://www.linkedin.com/pub/ramkumar-ravichandran/10/545/67a

https://www.linkedin.com/in/dataisbig

http://bigdatadw.blogspot.com/

BHARATHIRAJA CHANDRASEKHARAN

RAMKUMAR RAVICHANDRAN

33

Page 34: Taming the Data Lake with Scalable Metrics Model Framework

34

RESEARCH/LEARNING RESOURCES

Intended for Knowledge Sharing only

• Alternative approach by Martin Fowler:http://martinfowler.com/bliki/DataLake.html• Teradata/Hortonworks Data Lake Whitepaper:http://hortonworks.com/wp-content/uploads/2014/05/TeradataHortonworks_Datalake_White-Paper_20140410.pdf• Teradata/Hortonworks Data Lake Whitepaper:http://hortonworks.com/wp-content/uploads/2014/05/TeradataHortonworks_Datalake_White-Paper_20140410.pdf• EMC Data Lake:https://www.youtube.com/watch?v=o2fs02h_LEo

34