Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015

13
© 2014 VMware Inc. All rights reserved. Virtualized Big Data Platform @ VMware Corp IT Rajit Saha Hadoop Development Lead VMware Corp IT Data Solution and Delivery An Enterprise Data Warehouse meets an Elephant

Transcript of Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015

Page 1: Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015

© 2014 VMware Inc. All rights reserved.

Virtualized Big Data Platform@ VMware Corp IT

Rajit Saha

Hadoop Development LeadVMware Corp IT Data Solution and Delivery

An Enterprise Data Warehouse meets an Elephant

Page 2: Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015

2

Business Use Case for Big Data Analytics@ VMware BI Space

Personalized Marketing & Customer Targeting

Personalized Campaign Content Strategy

MyVMware Log AnalyticsCombine User Level data - logins and other activities with Clickstream Data and Product Data

VMware Product’s List Price Optimization and Deal Analytics for VMware Pricing Team

- Complex ETL, Bigger Joins- Flattening Star Schema Tables- Propensity Modeling

EDW

- Deeper Learning of VMware Product Issues- Build highly intelligent recommendation System to fix Customer Issues with faster turn around time

GSS Service Request Logs Analytics- High Volume ~ 400TB- A lot of Variety of data- Complex parsing

Clickstream Data Analytics• Path analysis – First user visit to buy

product • Propensity Modeling • Predictive Analytics - which

product user will buy• Customer Lifetime Value

Analysis

554 columns1.5B Rows20TB Data

( 2yrs)

Variety

VolumeVelocity

BIG

DATA

Page 3: Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015

3

• This Big Data Cluster is fully Virtualized • based on vSphere 6.0 and VMware Big Data Extensions 2.2

• We used EMC Isilon 7.2.0.2 with two patches for HDFS Storage

• We used Pivotal Big Data Suite 3.0 for Hadoop 2.6 and HAWQ 1.3 • We used Pivotal Spring XD 1.2 for Data Ingestion to Hadoop

• We integrated this with Alpine Data Lab 5.4 for running • Deeper Analytic Functions• Machine Learning Algorithms

• We integrated HUE 2.6 for GUI based HIVE/PIG Query execution client

Components of Big Data Cluster

Page 4: Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015

4

Page 5: Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015

5

On-Prem Big Data Production Datacenter

Page 6: Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015

6

Apache Ambari – The Hadoop Cluster Management Console

Management &

Monitoring - HDFS - Yarn/Map reduce - Hive - HAWQ - Spring XD

Page 7: Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015

Clickstream

ftps.vmware.comraw data filesfir

ewal

l

Daily push of Clickstream Logs

Data Ingestion to Isilon HDFS via Spring XD

Lookup Logs

Clickstream Logs

Adv. Analytics

Users

• Data Cleaning• Better Consumable

Structured data• Data Partitioning • Schema Building• Faster Analytic Power

- Daily 2M Clickstream Records ( ~10GB ) ares being ingested from Adobe Omniture to Isilon HDFS

- 1.5Billion Records and 554 columns and ~20TB of data

- Data Cleanup and Pre Processing using PIG, Hadoop Streaming and Python Scripts

- Fit the Data into HIVE/HAWQ Schema

- End Users ( Data Scientists ) consume via HUE/pgAdmin/Alpine Data Lab

python

Data Processing Pipeline – Click Stream Data

Page 8: Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015

8

Data Consumption – pgAdmin3 ( via HAWQ Database) ….

Page 9: Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015

9

And visualize the results ..

37%

7%

7%6%

6%

6%

4%

4%

3%

3%2%

2%2%

2%2% 1%1%1% 1%1%

Top 20 Countries with unique vmware.com Visits

on 2015 Q1 usajpndeugbrchnindcanfraauskorespbraitanldruschetwnpolmexswe

34%

7%

7%6%

10%

6%

3%

3%

2%

4%

3%

3%

2%2%

2% 1% 1% 1% 1%1%

Top 20 Countries with unique vmware.com Visitors

on 2015 Q1 usajpndeugbrchnindcanfraauskorespbraitanldruschetwnpolmexswe

Disclaimer : This is based on Synthesized Dataset for demo purpose, not Real Data

Page 10: Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015

10

Data Consumption – HUE

Hive Query to find out unique visits in VMware

site 2015 Q120

14-01

2014

-02

2014

-03

2014

-04

2014

-05

2014

-06

2014

-07

2014

-08

2014

-09

2014

-10

2014

-11

2014

-12

2015

-01

2015

-02

2015

-03

2015

-04

2015

-05

2015

-06

2015

-070

2000000

4000000

6000000

8000000

10000000

12000000

14000000

Unique Visits in 2014 and 2015 month wise

visits

Month

Visi

t Cou

ntDisclaimer : This is based on Synthesized Dataset for demo

purpose, not Real Data

Page 11: Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015

11

Advanced Data Analytics by Alpine Data Lab

Time Series Analysis on Jan 2015 Clickstream Data

Page 12: Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015

12

At VMware IT, we have established the fact that an Enterprise Big Data Analytics Platform can be successfully built and run on top of VMware Virtual Infrastructure with EMC Isilon and PHD 3.0

-with great performance

Take Away …

Page 13: Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015

13

Thank You

QA