Big data, data science & fast data

92
Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Big Data Analytics, Data Science & Fast Data 1 Kunal Joshi [email protected]

description

 

Transcript of Big data, data science & fast data

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Big Data Analytics, Data Science & Fast Data

1

Kunal [email protected]

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Agenda

1. Introduction to Big Data Analytics

2. Big Data Analytics - Use Cases

3. Technologies for Big Data Analytics

4. Introduction to Data Science

5. Data Science - Use Cases

6. Introduction to Fast Data

7. Fast Data - Use Cases

8. Fast Data meets Big Data

BIG DATA

DATA SCIENCE

FAST DATA

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Agenda

1. Introduction to Big Data Analytics

2. Big Data Analytics - Use Cases

3. Technologies for Big Data Analytics

4. Introduction to Data Science

5. Data Science - Use Cases

6. Introduction to Fast Data

7. Fast Data - Use Cases

8. Fast Data meets Big Data

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Big Data Pioneers

1,000,000,000 Queries A Day

250,000,000 New Photo’s / Day

290,000,000 Updates / Day

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Other Companies using Big Data

4,000,000 Claims / Day

2,800,000,000 Trades / Day

31,000,000,000 Interactions / Day

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Moore’s LawGordon Moore (Founder of Intel)

Number of transistors that can be placed in a processor DOUBLES in approximately every TWO years.

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Introduction to Big Data Analytics

What is Big Data?

What makes data, “Big” Data?

7

Your Thoughts?

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

8Copyright © 2011 EMC Corporation. All Rights Reserved.

• “Big Data” is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value. Requires new data architectures, analytic sandboxes New tools New analytical methods Integrating multiple skills into new role of data scientist

• Organizations are deriving business benefit from analyzing ever larger and more complex data sets that increasingly require real-time or near-real time capabilities

Big Data Defined

Source: McKinsey May 2011 article Big Data: The next frontier for innovation, competition, and productivity

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

9Copyright © 2011 EMC Corporation. All Rights Reserved.

1. Data Volume 44x increase from 2010 to 2020

(1.2zettabytes to 35.2zb)

2. Processing Complexity Changing data structures Use cases warranting additional transformations and

analytical techniques

3. Data Structure Greater variety of data structures to mine and analyze

Key Characteristics of Big Data

Module 1: Introduction to BDA

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Big Data Characteristics: Data StructuresData Growth is Increasingly Unstructured

Module 1: Introduction to BDA 10

Structured

Semi-Structure

d

“Quasi” Structured

Unstructured

• Data containing a defined data type, format, structure

• Example: Transaction data and OLAP

• Data that has no inherent structure and is usually stored as different types of files.

• Example: Text documents, PDFs, images and video

• Textual data with erratic data formats, can be formatted with effort, tools, and time

• Example: Web clickstream data that may contain some inconsistencies in data values and formats

• Textual data files with a discernable pattern, enabling parsing

• Example: XML data files that are self describing and defined by an xml schema

Mo

re S

tru

ctu

red

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Four Main Types of Data Structures

Module 1: Introduction to BDA 11

http://www.google.com/#hl=en&sugexp=kjrmc&cp=8&gs_id=2m&xhr=t&q=data+scientist&pq=big+data&pf=p&sclient=psyb&source=hp&pbx=1&oq=data+sci&aq=0&aqi=g4&aql=f&gs_sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.,cf.osb&fp=d566e0fbd09c8604&biw=1382&bih=651

The Red Wheelbarrow, by William Carlos Williams

View Source

Structured Data

Semi-Structured Data

Quasi-Structured Data

Unstructured Data

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Driver Examples

Desire to optimize business operations Sales, pricing, profitability, efficiency

Desire to identify business risk Customer churn, fraud, default

Predict new business opportunities

Upsell, cross-sell, best new customer prospects

Comply with laws or regulatory requirements

Anti-Money Laundering, Fair Lending, Basel II

Business Drivers for Big Data Analytics

1

2

3

4

Current Business Problems Provide Opportunities for Organizations to Become More Analytical & Data Driven

Module 1: Introduction to BDA 12

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Challenges with a Traditional Data Warehouse

DepartmentalWarehouse

EnterpriseApplications

Reporting

Non-Prioritized Data Provisioning

Non-Agile Models

“SpreadMarts”

DataSources

SiloedAnalytics

Static schemasaccrete over time

PrioritizedOperational Processes

Errant data & marts

DepartmentalWarehouse

1

2

3

13

4

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Implications of a Traditional Data Warehouse

14

• High-value data is hard to reach and leverage• Predictive analytics & data mining activities are last

in line for data Queued after prioritized operational processes

• Data is moving in batches from EDW to local analytical tools In-memory analytics (such as R, SAS, SPSS, Excel) Sampling can skew model accuracy

• Isolated, ad hoc analytic projects, rather than centrally-managed harnessing of analytics Non-standardized initiatives Frequently, not aligned with corporate business goals

Slow “time-to-insight”

& reduced

business impact

Module 1: Introduction to BDA

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Opportunities for a New Approach to Analytics

New Applications Driving Data Volume

Module 1: Introduction to BDA 15

2000’s(CONTENT & DIGITAL ASSET

MANAGEMENT)

1990’s(RDBMS & DATA

WAREHOUSE)

2010’s(NO-SQL & KEY/VALUE)

VO

LUM

E O

F IN

FOR

MATIO

N

LARGE

SMALL

MEASURED IN

TERABYTES1TB = 1,000GB

MEASURED IN

PETABYTES1PB = 1,000TB

WILL BE MEASURED IN

EXABYTES1EB = 1,000PB

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Considerations for Big Data Analytics

1. Speed of decision making

2. Throughput

3. Analysis flexibility

Analytic SandboxData assets gathered from multiple sources

and technologies for analysis

• Enables high performance analytics using in-db processing

• Reduces costs associated with data replication into "shadow" file systems

• “Analyst-owned” rather than “DBA owned”

Criteria for Big Data Projects New Analytic Architecture

1. Speed of decision making

2. Throughput

3. Analysis flexibility

16

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

State of the Practice in Analytics: Mini-Case StudyBig Data Enabled Loan Processing at XYZ bank

Income

Verifica

tion

Un

der

writ

ing

Ris

k

Employment

History

Credit S

corin

g

And Hist

oryAppra

isal

TraditionalUnderwriting

Risk Level

TRADITIONAL DATA LEVERAGED BIG DATA LEVERAGED

Big Data Enabled UnderwritingRisk Level

17Module 1: Introduction to BDA

Your Thoughts?

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Agenda

1. Introduction to Big Data Analytics

2. Big Data Analytics - Use Cases

3. Technologies for Big Data Analytics

4. Introduction to Data Science

5. Data Science - Use Cases

6. Introduction to Fast Data

7. Fast Data - Use Cases

8. Fast Data meets Big Data

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Big Data Analytics: Industry Examples

Module 1: Introduction to BDA 19

Health Care •Reducing Cost of Care

Public Services•Preventing Pandemics

Life Sciences•Genomic Mapping

IT Infrastructure•Unstructured Data Analysis

Online Services•Social Media for Professionals

RetailPhone/TV

Government Internet

Medical

Financial

DataCol lectors

1

2

3

4

5

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Big Data Analytics: Healthcare

Use of Big Data

Key Outcomes

Situation

•Poor police response and problems with medical care, triggered by shooting of a Rutgers student

•The event drove local doctor to map crime data and examine local health care

•Dr. Jeffrey Brenner generated his own crime maps from medical billing records of 3 hospitals

•City hospitals & ER’s provided expensive care, low quality care•Reduced hospital costs by 56% by realizing that 80% of city’s medical costs came from 13% of its residents, mainly low-income or elderly

•Now offers preventative care over the phone or through home visits

1

20Module 1: Introduction to BDA

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Big Data Analytics: Public Services

Use of Big Data

Key Outcomes

Situation

•Threat of global pandemics has increased exponentially

•Pandemics spreads at faster rates, more resistant to antibiotics

•Created a network of viral listening posts •Combines data from viral discovery in the field, research in disease hotspots, and social media trends

•Using Big Data to make accurate predications on spread of new pandemics

• Identified a fifth form of human malaria, including its origin

• Identified why efforts failed to control swine flu

•Proposing more proactive approaches to preventing outbreaks

2

Module 1: Introduction to BDA 21

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Big Data Analytics: Life Sciences

Use of Big Data

Key Outcomes

Situation •Broad Institute (MIT & Harvard) mapping the Human Genome

• In 13 yrs, mapped 3 billion genetic base pairs; 8 petabytes

•Developed 30+ software packages, now shared publicly, along with the genomic data

•Using genetic mappings to identify cellular mutations causing cancer and other serious diseases

• Innovating how genomic research informs new pharmaceutical drugs

3

Module 1: Introduction to BDA 22

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Big Data Analytics: IT Infrastructure

Use of Big Data

Key Outcomes

Situation•Explosion of unstructured data required new technology to analyze quickly, and efficiently

•Doug Cutting created Hadoop to divide large processing tasks into smaller tasks across many computers

•Analyzes social media data generated by hundreds of thousands of users

•New York Times used Hadoop to transform its entire public archive, from 1851 to 1922, into 11 million PDF files in 24 hrs

•Applications range from social media, sentiment analysis, wartime chatter, natural language processing

4

Module 1: Introduction to BDA 23

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Big Data Analytics: Online Services

Use of Big Data

Key Outcomes

Situation •Opportunity to create social media space for professionals

•Collects and analyzes data from over 100 million users

•Adding 1 million new users per week

•LinkedIn Skills, InMaps, Job Recommendations, Recruiting

•Established a diverse data scientist group, as founder believes this is the start of Big Data revolution

5

Module 1: Introduction to BDA 24

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Agenda

1. Introduction to Big Data Analytics

2. Big Data Analytics - Use Cases

3. Technologies for Big Data Analytics

4. Introduction to Data Science

5. Data Science - Use Cases

6. Introduction to Fast Data

7. Fast Data - Use Cases

8. Fast Data meets Big Data

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Greenplum Unified Analytic Platform

Partner Tools & Services

GREENPLUM CHORUS – Analytic Productivity Layer

Greenplum gNet

GREENPLUM DATABASE

Data Scientist

Data Engineer

Data Analyst Bl Analyst

LOB User

Data Platform Admin

DA

TA S

CIE

NC

E T

EA

M

Cloud, x86 Infrastructure, or Appliance

GREENPLUMHD

Unify your team

Drive Collaboration

Keep Your Options Open

The Power of Data Co-Processing

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Greenplum Hadoop

STRUCTURED UNSTRUCTURED

HiveMapReduce

PigXML, JSON, … Flat files

Schema on load

Directories

No ETLJava

SequenceFile

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Greenplum Database

STRUCTURED UNSTRUCTURED

SQL

RDBMS

Tables and Schemas

GreenplumMapReduce

Indexing

PartitioningBI Tools

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

• A framework for handling big data An implementation of the MapReduce paradigm Hadoop glues the storage and analytics together and provides reliability,

scalability, and management

What do we Mean by Hadoop

Storage (Big Data) HDFS – Hadoop Distributed

File System Reliable, redundant,

distributed file system optimized for large files

MapReduce (Analytics) Programming model for

processing sets of data Mapping inputs to outputs and

reducing the output of multiple Mappers to one (or a few) answer(s)

Two Main Components

30Module 5: Advanced Analytics - Technology and Tools

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Hadoop Distributed File System

31Module 5: Advanced Analytics - Technology and Tools

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

MapReduce and HDFS

Task TrackerTask Tracker Task Tracker

Job Tracker

Hadoop Distributed File System (HDFS)

Client/Dev

Large Data Set(Log files, Sensor Data)

Map Job

Reduce Job

Map Job

Reduce Job

Map Job

Reduce Job

Map Job

Reduce Job

Map Job

Reduce Job

Map Job

Reduce Job

2

1

3

4

32Module 5: Advanced Analytics - Technology and Tools

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

• As you move from Pig to Hive to HBase, you are increasingly moving away from the mechanics of Hadoop and get an RDBMS view of the Big Data world

Components of Hadoop

HBase Queries against defined tables

Hive SQL-based language

Pig Data flow language & Execution environment

More HadoopVisible

Less HadoopVisible

DBMS View

Mechanics of Hadoop

33Module 5: Advanced Analytics - Technology and Tools

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Greenplum DatabaseExtreme Performance for Analytics

• Optimized for BI and analytics Deep integration with statistical packages

High performance parallel implementations

• Simple and automatic parallelization Just load and query like any database

Tables are automatically distributed across nodes

No need for manual partitioning or tuning

• Extremely scalable

MPP* shared-nothing architecture All nodes can scan and process in parallel

Linear scalability by adding nodes where each node adds storage, query & load performance

*MPP – Massive Parallel Processing

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Greenplum DB & HD

Massively Parallel Access and Movement

Maximize Solution Flexibility

Minimize Data Duplication

Access Hadoop Data in Real Time From Greenplum DB

Import and export in Text, Binary and Compressed Formats

Custom formats via user-written MapReduce Java program And GPDB Format classes

gNet

10Gb Ethernet

Greenplum DB Hadoop

Node 1

Node 2

Node 3

Segment 1

Segment 2

Segment 3

GP DB Master Host

MapReduce

User-Defined

Binary

TextExternal

Tables

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Analytical Software

Exploiting ParallelismIn-Database Analytics

Analytic Results

Interconnect

Storage

Independent Segment Processors

Independent Memory

Independent Direct Storage

Connection

Master Segment Processor

Interconnect Switch

Math & Statistical Functions

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Agenda

1. Introduction to Big Data Analytics

2. Big Data Analytics - Use Cases

3. Technologies for Big Data Analytics

4. Introduction to Data Science

5. Data Science - Use Cases

6. Introduction to Fast Data

7. Fast Data - Use Cases

8. Fast Data meets Big Data

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Big Data Requires Data Science

Data Science

• Predictive analysis

• What if…..?

Business Intelligence

• Standard reporting

• What happened?

High

FuturePast

TIME

BUSINESS VALUE

Business Intelligence

Data Science

Low

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Data science and business intelligence

“BIG DATA ANALYTICS”

“TRADITIONAL BI”

GBs to 10s of TBs

Operational

Structured

Repetitive

10s of TB to Pb’s

External + Operational

Mostly Semi-Structured

Experimental, Ad Hoc

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Profile of a Data Scientist

Module 1: Introduction to BDA 46

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

People and Ecosystem

Domain

Data Science as a Process

Data PrepVariable Selection

ModelBuilding

Model Execution

Communication &

Operationalization

Evaluate

• People• Scientists / Analysts• Business Analysts• Consumers of analysis• Stakeholders• EMC sales and services

• Ecosystem• Sector (Telecom, banking, security agency etc.)• Modeling software and other tools used by

analysts (MADlib, SAS, R etc.)• Database (Greenplum) & Data Sources

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

People and Ecosystem

Domain

Data Science as a Process

Data PrepVariable Selection

ModelBuilding

Model Execution

Communication &

Operationalization

Evaluate

Discovery & prioritized identification of opportunities• Customer Retention• Fraud detection• Pricing• Marketing effectiveness and

optimization• Product Recommendation• Others……

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

People and Ecosystem

Domain

Data Science as a Process

Data PrepVariable Selection

ModelBuilding

Model Execution

Communication &

Operationalization

Evaluate

• What are the data sources?• Do we have access to them?• How big are they?• How often are they updated?• How far back do they go?• Which of these data sources are being used

for analysis? Can we use a data source which is currently unused? What problems would that help us solve?

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

People and Ecosystem

Domain

Data Science as a Process

Data PrepVariable Selection

ModelBuilding

Model Execution

Communication &

Operationalization

Evaluate

• Selection of raw variables which are potentially relevant to problem being solved

• Transformations to create a set of candidate variables

• Clustering and other types of categorization which could provide insights

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

People and Ecosystem

Domain

Data Science as a Process

Data StepVariable Selection

ModelBuilding

Model Execution

Communication &

Operationalization

Evaluate

Pick suitable statistics, or suitable model form and algorithm and build model

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

People and Ecosystem

Domain

Data Science as a Process

Data PrepVariable Selection

ModelBuilding

Model Execution

Communication &

Operationalization

Evaluate

The model needs to be executable in database on big data with reasonable execution time

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

People and Ecosystem

Domain

Data Science as a Process

Data PrepVariable Selection

ModelBuilding

Model Execution

Communication &

Operationalization

Evaluate

The model results need to be communicated & operationalized to have a measurable impact on the business

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

People and Ecosystem

Domain

Data Science as a Process

Data PrepVariable Selection

ModelBuilding

Model Execution

Communication &

Operationalization

Evaluate

• Accuracy of results and forecasts• Analysis of real-world experiments• A/B testing on target samples • End-user and LOB feedback

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Agenda

1. Introduction to Big Data Analytics

2. Big Data Analytics - Use Cases

3. Technologies for Big Data Analytics

4. Introduction to Data Science

5. Data Science - Use Cases

6. Introduction to Fast Data

7. Fast Data - Use Cases

8. Fast Data meets Big Data

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Use Case 1 Trip modeling

Problem: Analyze behaviour of visitors to MakeMyTrip.com

Particularly interested in unregistered visitors– About 99% of total visitor traffic

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Applications of model• Tailor promotions for popular types of trips

Most popular types probably already well-known; potential in next tier down

• ... and for different types of customers

• Present customised promotions to visitors based on clicks

• Ad optimization: present ads based on modelled behavior

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Hypertargeting• Serving content to customers based on individual

characteristics and preferences, rather than broad generalizations

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Available data• Data available from server:

Date/time IP address Parts of site visited

• Geographic location can be obtained via geo lookup on IP

• Personal information available for registered visitors only

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Approach• Use clustering to identify trip/visitor types

Sport (IPL,F1, Football, etc) Festivals Other seasonal movements

• Decision trees to predict which type of trip a visitor is likely to make Based on successively more information as they move

through the site

• Use registered visitor info to augment models

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Use Case 2 Municipal traffic analysis

• Client domain: Municipal city government

• Available data:Cross-city loop detectors measuring traffic volumeDetailed city bus movement information from Bluetooth devicesVideo detection of traffic volume, velocity

• Goal: Exploit available data for unrealized business insights and values

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Data loading and manipulation

• Parallel data loading– Data loaded from local file system and distributed across Greenplum

servers in parallel.– Loading 9 months of traffic volume data (16 GB, 464 million rows) in 69.4

seconds.

• SQL data manipulation– Standard SQL permits city personnel to use existing skillsets.– Greenplum SQL extensions offer the control over data distribution.– Open source packages (e.g. in Python, R) can be conveniently deployed

within Greenplum for visualization and analytics purposes.

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Basic reporting on traffic volume• Easy generation of reports via straightforward user-defined functions

• Standard graphing utilities called from within Greenplum to create figures

• Detector downtimes can be clearly spotted in the figure, or via an SQL query, thus mitigating maintenance challenges

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Basic reporting on city buses• Data from Bluetooth devices has a wealth of information on city

buses that we can report on: Travel route of each bus Deviations of arrival times compared to provided timetable Occurrences of driver errors (e.g. taking a wrong turn) and possible

causes Occurrences where the same bus service arrives at the same stop

within seconds of each other Whether new bus services translates into lower traffic volume on

introduced roads

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Result visualizations (Google Earth)

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Applications for traffic network modelling

• Compute the fastest path between any two locations at a

future time point

• Identify potential bottlenecks in the traffic

• Identify phase transition points for massive traffic congestion

using simulation techniques

• Study the likely impact of new roads and traffic policies,

without having to observe real disruptive events to

determine the impact

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

• Greenplum’s parallel architecture permits traffic network analysis on a city scale

• Travel time can be predicted via model learning, involving hundreds of thousands of optimizations in parallel, across the entire traffic network

• Variables that can be considered include Distance between two locations Concurrent traffic volume Time of day Weather Construction work

• Computationally prohibitive for traditional non-parallel database environments

Parallel traffic network modelling

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Use Case 3 - Product Recommendation Analysis

• Eight banks became one Branches across the US

• Consolidation of products and customers Employees faced with new products and

customers Visibility into churn and retention was

challenged

• Analytics focus was historically reporting-centric Descriptive “hindsight”`

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Customer Segmentation

Customer segments– First, define a measurement of

customer value– Then create clusters of

customers based on customer value, and then product profiles.

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Association Rules

Product associations– Now find products that are

common in the segment, but not owned by the given household.

Product AProduct B

Product XProduct YProduct Z

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Product Recommendations

Next best offer– Now, filter down to products

associated with high-value customers in the same segment.

Product AProduct B

Product XProduct YProduct Z

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Increased customer value

Customer Comments– “The Greenplum Solution has

scaled from 6 to 11 TB of data.”– Moved from 7 hours /month of

data to 7.5 hours / 2.5 years of data

Product Recommender

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Agenda

1. Introduction to Big Data Analytics

2. Big Data Analytics - Use Cases

3. Technologies for Big Data Analytics

4. Introduction to Data Science

5. Data Science - Use Cases

6. Introduction to Fast Data

7. Fast Data - Use Cases

8. Fast Data meets Big Data

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Module #: Module Name 74

Ferrari Freight Train

0-100 KMPH 2.3 seconds 100 seconds

Top Speed 360 KMPH 140 KMPH

Stops / hr 1000 5

Horse Power 660 bhp 16,000 bhp

Throughput 220 KG in 27 mins 55000000 KG in 60 mins

VS

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Module #: Module Name 75

Fast Data Big Data

Transactions / Second

100000+ per second n.a

Concurrent hits 10000 + per sec 10 per second

Update Patterns Read / Write Appends

Data Complexity Simple Joins on a few tables

Can be highly complex

Data Volumes GB’s / TB PB to ZB

Access Tools GemFire / SQLFire GP DB, GP Hadoop

VSFast Data Big Data

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Not a fast OLTP DB!

APPLICATION(S)

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Fast Data is • More than just an OLTP DB

• Super Fast access to Data

• Server side flexibility

• Data is HA

• Supports transactions

• Setup is fault tolerant

• Can handle thousands of concurrent hits

• Distributed hence horizontally scalable

• Runs on cheap x86 hardware

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

CAP TheoremA distributed system can only achieve TWO out of the three qualities of Consistency, Availability and Partition Tolerance

C A Ponsistency vailability artition Tolerence

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Fast Data =

• Service Loose Coupling

• Data Transformation

• System Integration

+ Service Bus

• Guaranteed Delivery

• Event Propagation

• Data Distribution

+ Messaging System

• Event Driven Architectures

• Real-time Analysis

• Business Event Detection

+ Complex Event Processor

Fast Data combines select features from all of these products and combines them into a low-latency, linearly scalable, memory-based data fabric

• Storage

• Persistence

• Transactions

• Queries

Database• High Availability

• Load Balancing

• Data Replication

• L1 Caching

• Map-Reduce, Scatter-Gather

• Distributed Task Assignment

• Task Decomposition

+ Grid Controller

• Result Summarization

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

A Typical Fast Data Setup

Web Tier

Application Tier

Load BalancerAdd/remove web/application/data servers

Add/remove storage

Database Tier

Storage Tier

Disks may be direct or network attached

Optional reliable, asynchronous feedto a Big Data Store

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Memory-based Performance

PerformFast Data uses memory on a peer machine to make data updates durable, allowing the updating thread to return 10x to 100x faster than updates that must be written through to disk, without risking any data loss. Typical latencies are in the few hundreds of microseconds instead of in the tens to hundreds of milliseconds.

One can optionally write updates to disk / data warehouse / big data store asynchronously and reliably.

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

WAN Distribution

Distribute

Fast Data can keep clusters that are distributed around the world synchronized in real-time and can operate reliably in Disconnected, Intermittent and Low-Bandwidth network environments.

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Distributed Events

Targeted, guaranteed delivery, event notification and Continuous Queries

Notify

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Parallel Queries

Batch Controller or Client

Scatter-Gather (Map-Reduce) Queries

Compute

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Data-Aware Routing

Execute

Fast Data provides ‘data aware function routing’ – moving the behavior to the correct data instead of moving the data to the behavior.

Batch Controller or Client

Data Aware Function

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Accessing Fast Data

Stores Objects (Java, C++, C#, .NET) or unstructured data

Spring-GemFire

Stores Relational Data with SQL interface

Supports JDBC, ODBC, Java and .NET interfaces

Key-Value store with OQL Queries

Uses existing relational tools

Order

Order Line Item

Quantity

Discount

Product

SKU

Unit Price

L2 Cache plugin for Hibernate

HTTP Session replication module

GemFire

SQLFire

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Agenda

1. Introduction to Big Data Analytics

2. Big Data Analytics - Use Cases

3. Technologies for Big Data Analytics

4. Introduction to Data Science

5. Data Science - Use Cases

6. Introduction to Fast Data

7. Fast Data - Use Cases

8. Fast Data meets Big Data

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Use Cases

Applying the technology

A few examples of Fast Data technology

applied to real business cases

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

A mainframe-based, nightly customer account reconciliation batch run

Mainframe Migration

min

0 12060

I/O Wait9%

CPU Busy15%

Mainframe

CPU Unavailable76%

COTS ClusterBatch now runs in 60 seconds

93% Network Wait! Time could have been reduced further with higher network bandwidth

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Mainframe Migration

So What? So the batch runs faster – who cares?

1. It ran on cheaper, modern, scalable hardware

2. If something goes wrong with the batch, you only wait 60 seconds to find out

3. Now, the hardware and the data are available to do other things in the remaining 119 minutes:

• Fraud detection

• Regulatory compliance

• Re-run risk calculations with 119 different scenarios

• Up sell customers

4. You can move from batch to real-time processing!

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Online Betting

A popular online gambling site attracts new players through ads on affiliate sites

Customized Banner Ad on affiliate site

Affiliate's Web Server

1 Banner Ad Server

23

4

In a fraction of a second, the banner ad sever must:Generate a tracking id specific to the request

Apply temporal, sequential, regional, contractual and other policies in order to decide which banner to deliver

Customize the banner

Record that the banner ad was delivered

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Online Betting (Contd.)

Their initial RDBMS-based system

Limited their ability to sign up new affiliates

Limited their ability to add new products on their site

Limited the delivery performance experienced by their affiliates and their customers

Limited their ability to add additional internal applications and policies to the process

Their new Fast Data based systemResponded with sub-millisecond latency

Met their target of 2500 banner ad deliveries per second

Provides for future scalability

Improved performance to the browser by 4x

Cost less

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Asset/Position Monitoring

Centralized data storage was not possible

Multi-agency, multi-force integration

Numerous Applications needed access to multiple data sources simultaneously

Networks constantly changing, unreliable, mobile deployments

Upwards of 60,000 object updates each minute

Over 70 data feeds

Needed a real-time situational awareness system to track assets that could be used by the war fighters in theatre

Northrop Grumman (integrator) investigated the following technologies before deciding on GemFire• RDBMS – Oracle, Sybase, Postgres, TimesTen, MySQL

• ODBMS - Objectivity

• jCache – GemFire, Oracle Coherence

• JMS – SonicMQ, BEA Weblogic, IBM, jBoss

• TIBCO Rendezvous

• Web Services

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Asset/Position Monitoring

655 sites, 11 thousand users

Real-time, 3 dimensional, NASA World Wind User Interface

60,000 Position updates per minute

Real time info available on the desk of

President of the United States

US Secretary of Defense

Each of the Joint Chiefs of Staff

Every commander in the US Military

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Low-latency trade insertionPermanent Archival of every tradeKept pace with fast ticking market dataRapid, Event Based Position CalculationDistribution of Position Updates GloballyConsistent Global Views of PositionsPass the BookRegional Close-of-dayHigh AvailabilityDisaster RecoveryRegional Autonomy

The project achieved:

Global Foreign Exchange Trading System

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Global Foreign Exchange Trading System

In that same application, Fast Data replaced:

Sybase Database In Every RegionStill need 1 instance for archival purposes

TIBCO Rendezvous for Local Area MessagingIBM MQ Series for WAN DistributionVeritas N+1 Clustering for H/AIn fact, we save the physical +1 node itself

3DNS or Wide IPAdmin personnel reduced from 1.5 to 0.5

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Agenda

1. Introduction to Big Data Analytics

2. Big Data Analytics - Use Cases

3. Technologies for Big Data Analytics

4. Introduction to Data Science

5. Data Science - Use Cases

6. Introduction to Fast Data

7. Fast Data - Use Cases

8. Fast Data meets Big Data

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Application High Level Overview

APPLICATION(S)

Single DB cant handle bothOLTP and OLAP

workloads

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC2 PROVEN PROFESSIONAL

Copyright © 2011 EMC Corporation. All Rights Reserved.

Big Data Setup

AP

PLIC

ATIO

N(S

)

How to get the best of Fast & Big Data

Fast DataSetup

In case record isn't available

Concurrent hits