Big Data Analytics Platform @ Nokia - Hadoop … Data Analytics Platform @Nokia − Who we are −...

Post on 19-Mar-2018

215 views 1 download

Transcript of Big Data Analytics Platform @ Nokia - Hadoop … Data Analytics Platform @Nokia − Who we are −...

Selecting the Right Tool for the Right Workload

Yekesa Kosuru Nokia

Location & Commerce

Strata + Hadoop World NY - Oct 25, 2012

Big Data Analytics Platform @ Nokia

1 1

•Big Data Analytics Platform @Nokia −Who we are −Use case data flows −Big data platform −Big data challenges

•Selecting the Right Tool for the Right Workload −Hadoop VS SQL −Which analytical database −Why InfiniDB

Agenda

2 2

Nokia Internal Use Only

Great Mobile Products That Sense the World

WIN IN SMART DEVICES

CONNECT THE NEXT BILLION

INVEST IN FUTURE DISRUPTIONS

CREATE A LEADING “WHERE” PLATFORM

3 3

Nokia Internal Use Only

Apps

Smart Data Platform

Content

Positions Maps Traffic Places Directions Guidance

One Platform, Enabling Contextually Rich Mobile Experiences

4 4

Click to edit Master title style Big DATA ANALYTICS Platform @Nokia

5 5

6

Business Challenges • Data silos, missing semantics

• Multiple sources - overlapping, conflicting

• Timely processing of large volumes of data

• Partial, insufficient, inaccurate, inconsistent.. data

• Security, privacy and other policies unknown

Central Analytics Platform created!

7

Statistics • 10’s PB of data all across Nokia

• Multi-tenant, multi-petabyte analytics cluster

• 10-20K+ jobs per day

• 600+ internal users

• 250M+ KV queries

• Over a terabyte flowing every day

• Multiple data centers around the world

Nokia Internal Use Only

Places Data Store (POI)- Use Case

Search

Platform

Data flow

Cloud Infrastructure

Account Management

Places Manager

Suppliers Places API Transaction

al Data HDFS Analytical DB

BI

Place CRUD

2

Supplier Uploads Data

3 Updated Blend

Record

6

Places Data Analytics

5

ETL and Blend places

4

Places Extract Portal

Delivered to OnlineSystems

7

Access Control

Authentication

User Logs In

1

Data Intake

Data Processing

FTP Oozie Sched

MR Blend

Hive Pig

MR SQL

Places Content

Analytics

K-V Store

8

Nokia Internal Use Only

Reports

Analytical DB

Analytics Cluster

Big Data Analytics Platform Data Flows

Data Asset Catalog

Oracle

Dashboards

Data Discovery

InfiniDB

Interactive Queries

Batch Queries

Web Applications

Activity Logs

VShards (NoSQL)

Reference Data

Device Applications

Probes

3rd Party

Device

User Profile

POI, Map

Activity Sensor

Dat

a In

take

ETL,

Alg

orith

ms

Agg

rega

tion

HDFS

9

Nokia Internal Use Only 10

Big Data Analytics Platform Data Flows

Analytical DB

Analytics Cluster

Data Asset Catalog

Oracle

Data Discovery

InfiniDB

Interactive Queries

Batch Queries

Dat

a In

take

ETL,

Alg

orith

ms

Agg

rega

tion

HDFS

• Logical Tiers −Technology Platform −Data Platform −End User Layer (not shown)

Big Data Analytics Platform

ETL,

Alg

orith

ms

Agg

rega

tion

Data Asset Catalog

Data

Dat

a In

take

HDFS

Technology

Analytical DB

Technology Platform

12

Hadoop R VShards (KV) Scribe, FTP Hive, Pig InfiniDB,

Oracle

Export/ Import

Workflow Engine

Config./ Deploy Monitor Alerts Archiver Scheduler

Security/Kerberos & ACL

Cloud Infrastructure

Data Platform

13

Self Serve Tools

ETL, Agg Algorithms Data Quality Data Asset

Catalog

Data, Metadata, Operational Data

Workflow Orchestration

Technology Platform

14

Data Platform – Analytics Lifecycle

Self Serve Tools

ETL, Agg Algorithms Data Quality Data Asset

Catalog

Data, Metadata, Operational Data

Collect Ingest Organize Analyze Deliver

Technology Platform

Nokia Internal Use Only

Data Platform: Managing the Data Asset • Data Quality - garbage in , garbage out − Rules for validating, cleaning data, other heuristics − Trusting your insights − Process Quality − Light weight governance (semantics, integrity, privacy and

quality)

• Data Asset Catalog – describe your data − Capture essential metadata and logical domain models for

assets −physical model, logical model, policies, classifications −dependencies with other assets

− Serves as a entry-point to data browsing and asset discovery − Insulates subject matter experts from physical details of data

asset

Nokia Internal Use Only

Big Data Challenges

• At every level - capture, curate, storage, process, visualize..

• Hadoop or SQL ? − Performance of analytical database ? − Batch or Interactive analysis − Neither SQL nor MR fits all problems

• Data & Metadata Fragmentation

Click to edit Master title style Selecting the Right Tool for the Right Workload

17 17

Nokia Internal Use Only 18

Hadoop VS SQL/Analytical DB

SQL/DW • Discover the question • Interactive/Fast • No coding • Standard industry tools • Mutable (Type 1 SCD) • Schema on Write • Analyst • Time to Wisdom

SQL/Analytical DB • Standard industry

tools • Interactive/Fast

(secs) • No coding, e.g. built-

in functions • Reasonable complex • Discover the

question

Hadoop/Hive/MR • ETL on steroids,

Scale • Batch/slow • Bunch of coding,

arbitrary complex • Harvest & load

into DW • Discover the

answer

19

Why InfiniDB ?

• Works with BI tools (standard JDBC driver)

• Column oriented, MPP, clean architecture

• Horizontal and vertical partitioning, clever pruning

• Stream based MR like processing

• Efficient joins

• No indexes

• Impressive benchmarks

• Cloud deployment model

Nokia Internal Use Only

InfiniDB vs Hive Performance

0

500

1000

1500

2000

2500

A B C D E F G

infiniDB (sec)

Hive (sec)

Query InfiniDB (sec) Hive (sec) A B 76.32 2155.92 C 25.59 1181.48 D 59.72 1497.22 E 1.8 446.5 F 12.38 1307.38 G 24.32 1886.81

Analytic Queries

Exe

cutio

n Ti

me(

secs

)

Click to edit Master title style InfiniDB Under the Hood

21 21

What is InfiniDB?

22

®

Scalable

Fast

Simple

Analytics Data Platform Foundation

23

Analytics Data Platform

Columnar Performance Efficiency

MapReduce style Query Execution

Widely used MySQL Interface

®

InfiniDB Building Blocks

24

Purpose built for big data analytics. •User Module (UM)

•Performance Module (PM)

or …

Single Server

InfiniDB Building Blocks

25

Purpose built for big data analytics. •User Module (UM)

Understands SQL •Performance Module (PM)

Operates on data blocks

or …

Single Server

Nokia Internal Use Only

InfiniDB M/R Style Distribution of Work “Map-Reduce Inside”

InfiniDB DoW Hadoop M/R Scalability Linear Linear

N-squared Problem Avoided Avoided

Latency Low Medium-High

Intermediate Results Handling

Stream-based File-based

Report Language SQL Erlang M/R, Hive, Pig

Tuning Automatic Manual

Real-Time Analytics Real-time access to granular data

Access to pre-defined aggregates

Ad-Hoc Full Ad-Hoc performance None

Data Storage Structured Unstructured

26

Independent InfiniDB Benchmark

Q1 Series 2 table Joins

Q2 Series 3 table Joins

Q3 Series 4 table Joins

Q4 Series 5 table Joins

27

28

Takeaways

• Hadoop is good but….

• Pay attention to data quality

• Hadoop or SQL

• Describe your data

THANK YOU Yekesa Kosuru Distinguished Architect, Nokia yekesa.kosuru@nokia.com www.nokia.com @Nokia Jim Tommaney CTO, Calpont jtomanney@calpont.com www.calpont.com @Calpont, @InfiniDB