NATC 2013 - Big Data as a Service

20
The Big Data SaaS Company | The Big Data SaaS Company Big Data as a Service Joydeep Sen Sarma

description

NASSCOM Annual Technology Conference 2013 Speaker: Joydeep Sen Sarma, Co-Founder, Quobole

Transcript of NATC 2013 - Big Data as a Service

Page 1: NATC 2013 - Big Data as a Service

The Big Data SaaS Company|

The Big Data SaaS Company

Big Data as a Service

Joydeep Sen Sarma

Page 2: NATC 2013 - Big Data as a Service

The Big Data SaaS Company|

Who’s Qubole

• Founded 10/2011: – Ashish Thusoo & Joydeep Sen Sarma, Apache Hive, Facebook– +Alumni - Oracle, GreenPlum, Vertica, Aster, Karmasphere,

TerraCotta, Microsoft

• Rapidly growing:– Engineering: Palo Alto (5), Bangalore (16)– Business: Palo Alto (4)

• Series-A from LightSpeed and Charles River

Page 3: NATC 2013 - Big Data as a Service

The Big Data SaaS Company|

Thesis

Managed

Big Data as a Service in the

Cloud

• SaaS will displace shipped software• Cloud will displace bare-metal• Big Data already displacing Rdbms

Page 4: NATC 2013 - Big Data as a Service

The Big Data SaaS Company|

Big Data Puzzle

Cloud Orchestration(Whirr) orCompute + Storage Hadoop

Schedular(Oozie)

Hive/PIGMahout/Weka

OperationsDashboard

GUI(Hue)

Interfaces(ODBC/JDBC)

Data Connectors(MongoAdaptor..)

Page 5: NATC 2013 - Big Data as a Service

The Big Data SaaS Company|

Meet “Qubole”

Cloud Orchestration(Whirr) orCompute + Storage

Hadoop Schedular(Oozie)Hive/PIGMahout/Weka

OperationsDashboard

GUI(Hue) Interfaces(ODBC/JDBC)

Data Connectors(MongoAdaptor..)

• Fully Integrated Big Data Service• Users Focus on Analyzing and building Data Driven apps• Qubole manages infrastructure, cloud provisioning

Page 6: NATC 2013 - Big Data as a Service

The Big Data SaaS Company|

Customers

Page 7: NATC 2013 - Big Data as a Service

The Big Data SaaS Company|

Use Cases

• Summarizing Logs and Reporting

• Data Integration

• Ad-Hoc analysis of Historical Data

• Preparing Data for Data Mining

• Indexing Data for Search

• Users

– Developers (of end-products) – Java/C++/Python

– ETL and Data Engineers – SQL/Java/Python

– Analysts – SQL / R

Page 8: NATC 2013 - Big Data as a Service

The Big Data SaaS Company|

Hadoop Hadoop

8

Qubole Data Service

SqoopSqoopOozieOozie PigPigHiveHive

AWS S3

AWS EC2

Integrate – Analyze – Schedule – Visualize

S3://adco/logs

Mysql

Vertica

Presto!Presto!

Page 9: NATC 2013 - Big Data as a Service

The Big Data SaaS Company|

Now on GCE!

Page 10: NATC 2013 - Big Data as a Service

The Big Data SaaS Company|

What Users Like

• Simplicity

– Great Visual User Interface– Zero Operations– Accessible to Analysts (ie. non-Engineers)

• Efficiency– Significantly faster than competition (in most cases)– Cluster Consolidation is game changer– Spot Instance integration

Page 11: NATC 2013 - Big Data as a Service

The Big Data SaaS Company|

What Users Like

• Managed Service Model– Constantly Upgrading software– Support when needed– Dealing with AWS issues

• Nine-Course Meal

– Seamless integration of Hadoop/Hive/Pig/..

– Unified Command/Workflow model (also Simplicity)

– Less things to learn/manage:

• “Please help us avoid Pentaho, Tableau, …”

Page 12: NATC 2013 - Big Data as a Service

The Big Data SaaS Company|

Core Technology• Auto-Scaling Hadoop Clusters in Cloud

– Including OpenStack, Rackspace, GCE etc

• Fastest Hive SaaS– Numerous Optimizations for Cloud Storage – 5x faster than EMR

• Connectors– RDBMS, MongoDB/NoSql, GA – Incremental Data Scrapes

• Job Scheduler– Dependencies, Workflows, Incremental Jobs

Page 13: NATC 2013 - Big Data as a Service

The Big Data SaaS Company| 13

Auto-Scaling

select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county;

insert overwrite table dest

select a.id, a.zip, count(distinct b.uid)

from ads a join LARGE_TABLE b on (a.id=b.ad_id) group by a.id, a.zip;

hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache…

AdCo Hadoop

Page 14: NATC 2013 - Big Data as a Service

The Big Data SaaS Company|

insert overwrite table dest

select … from ads join campaigns on …group by …;

14

Scaling Up

StarCluster

Map Tasks

ReduceTasks

Demand

Supply

AWS

Progress

Master

Slaves

Job Tracker

Page 15: NATC 2013 - Big Data as a Service

The Big Data SaaS Company|

Scaling Down1. On hour boundary – check if node is required:

– Can’t remove nodes with map-outputs (today)– Don’t go below minimum cluster size

1. Remove node from Map-Reduce Cluster

2. Request HDFS Decomissioning – fast!– Delete affected cache files instead of re-replicating– One surviving replica and we are Done.

1. Delete Instance

Page 16: NATC 2013 - Big Data as a Service

The Big Data SaaS Company|

Fastest Hive SaaS

• Works with Small Files!– Faster Split Computation (8x)– Prefetching S3 files (30%)

• Direct writes to S3– HIVE-1620

• Multi-Tenant Hive Server– HIVE-4226

• Stable JVM Reuse!– Fix re-entrancy issues

– 1.2-2x speedup

• Columnar Cache– Use HDFS as cache for S3

– Upto 5x faster for JSON data

• 5x faster than EMR in TPCH against S3

Page 17: NATC 2013 - Big Data as a Service

The Big Data SaaS Company|

Spot Instance Integration

Upto 90% off

Page 18: NATC 2013 - Big Data as a Service

The Big Data SaaS Company|

Spot Instance Integration

• Can lose Spot nodes anytime– Disastrous for HDFS– Hybrid Mode: Use mix of On-Demand and Spot– Hybrid Mode: Keep one replica in On-Demand nodes

• Spot Instances may not be available– Timeout and use On-Demand nodes as fallback

Page 19: NATC 2013 - Big Data as a Service

The Big Data SaaS Company|

Closing Thoughts

• AWS (/Cloud) is the new BIOS• Large multi-tenant [I/S]aaS is the new mainframe

– Feedback loop is not available to average developers– Will be dominated by a few large companies

• Open Source is the ocean that lifts SaaS Boat– But Boat has proprietary stuff– SaaS requires software innovation at different pace

• SaaS has network effects– Static software cannot keep up with rapidly evolving SaaS

Page 20: NATC 2013 - Big Data as a Service

The Big Data SaaS Company|

Questions?

Me:[email protected]

Us: @Qubole

Free Trial: www.qubole.com