Post on 02-Jul-2015
description
The Big Data SaaS Company|
The Big Data SaaS Company
Big Data as a Service
Joydeep Sen Sarma
The Big Data SaaS Company|
Who’s Qubole
• Founded 10/2011: – Ashish Thusoo & Joydeep Sen Sarma, Apache Hive, Facebook– +Alumni - Oracle, GreenPlum, Vertica, Aster, Karmasphere,
TerraCotta, Microsoft
• Rapidly growing:– Engineering: Palo Alto (5), Bangalore (16)– Business: Palo Alto (4)
• Series-A from LightSpeed and Charles River
The Big Data SaaS Company|
Thesis
Managed
Big Data as a Service in the
Cloud
• SaaS will displace shipped software• Cloud will displace bare-metal• Big Data already displacing Rdbms
The Big Data SaaS Company|
Big Data Puzzle
Cloud Orchestration(Whirr) orCompute + Storage Hadoop
Schedular(Oozie)
Hive/PIGMahout/Weka
OperationsDashboard
GUI(Hue)
Interfaces(ODBC/JDBC)
Data Connectors(MongoAdaptor..)
The Big Data SaaS Company|
Meet “Qubole”
Cloud Orchestration(Whirr) orCompute + Storage
Hadoop Schedular(Oozie)Hive/PIGMahout/Weka
OperationsDashboard
GUI(Hue) Interfaces(ODBC/JDBC)
Data Connectors(MongoAdaptor..)
• Fully Integrated Big Data Service• Users Focus on Analyzing and building Data Driven apps• Qubole manages infrastructure, cloud provisioning
The Big Data SaaS Company|
Customers
The Big Data SaaS Company|
Use Cases
• Summarizing Logs and Reporting
• Data Integration
• Ad-Hoc analysis of Historical Data
• Preparing Data for Data Mining
• Indexing Data for Search
• Users
– Developers (of end-products) – Java/C++/Python
– ETL and Data Engineers – SQL/Java/Python
– Analysts – SQL / R
The Big Data SaaS Company|
Hadoop Hadoop
8
Qubole Data Service
SqoopSqoopOozieOozie PigPigHiveHive
AWS S3
AWS EC2
Integrate – Analyze – Schedule – Visualize
S3://adco/logs
Mysql
Vertica
Presto!Presto!
The Big Data SaaS Company|
Now on GCE!
The Big Data SaaS Company|
What Users Like
• Simplicity
– Great Visual User Interface– Zero Operations– Accessible to Analysts (ie. non-Engineers)
• Efficiency– Significantly faster than competition (in most cases)– Cluster Consolidation is game changer– Spot Instance integration
The Big Data SaaS Company|
What Users Like
• Managed Service Model– Constantly Upgrading software– Support when needed– Dealing with AWS issues
• Nine-Course Meal
– Seamless integration of Hadoop/Hive/Pig/..
– Unified Command/Workflow model (also Simplicity)
– Less things to learn/manage:
• “Please help us avoid Pentaho, Tableau, …”
The Big Data SaaS Company|
Core Technology• Auto-Scaling Hadoop Clusters in Cloud
– Including OpenStack, Rackspace, GCE etc
• Fastest Hive SaaS– Numerous Optimizations for Cloud Storage – 5x faster than EMR
• Connectors– RDBMS, MongoDB/NoSql, GA – Incremental Data Scrapes
• Job Scheduler– Dependencies, Workflows, Incremental Jobs
The Big Data SaaS Company| 13
Auto-Scaling
select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county;
insert overwrite table dest
select a.id, a.zip, count(distinct b.uid)
from ads a join LARGE_TABLE b on (a.id=b.ad_id) group by a.id, a.zip;
hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache…
AdCo Hadoop
The Big Data SaaS Company|
insert overwrite table dest
select … from ads join campaigns on …group by …;
14
Scaling Up
StarCluster
Map Tasks
ReduceTasks
Demand
Supply
AWS
Progress
Master
Slaves
Job Tracker
The Big Data SaaS Company|
Scaling Down1. On hour boundary – check if node is required:
– Can’t remove nodes with map-outputs (today)– Don’t go below minimum cluster size
1. Remove node from Map-Reduce Cluster
2. Request HDFS Decomissioning – fast!– Delete affected cache files instead of re-replicating– One surviving replica and we are Done.
1. Delete Instance
The Big Data SaaS Company|
Fastest Hive SaaS
• Works with Small Files!– Faster Split Computation (8x)– Prefetching S3 files (30%)
• Direct writes to S3– HIVE-1620
• Multi-Tenant Hive Server– HIVE-4226
• Stable JVM Reuse!– Fix re-entrancy issues
– 1.2-2x speedup
• Columnar Cache– Use HDFS as cache for S3
– Upto 5x faster for JSON data
• 5x faster than EMR in TPCH against S3
The Big Data SaaS Company|
Spot Instance Integration
Upto 90% off
The Big Data SaaS Company|
Spot Instance Integration
• Can lose Spot nodes anytime– Disastrous for HDFS– Hybrid Mode: Use mix of On-Demand and Spot– Hybrid Mode: Keep one replica in On-Demand nodes
• Spot Instances may not be available– Timeout and use On-Demand nodes as fallback
The Big Data SaaS Company|
Closing Thoughts
• AWS (/Cloud) is the new BIOS• Large multi-tenant [I/S]aaS is the new mainframe
– Feedback loop is not available to average developers– Will be dominated by a few large companies
• Open Source is the ocean that lifts SaaS Boat– But Boat has proprietary stuff– SaaS requires software innovation at different pace
• SaaS has network effects– Static software cannot keep up with rapidly evolving SaaS
The Big Data SaaS Company|
Questions?
Me:joydeep@qubole.com
Us: @Qubole
Free Trial: www.qubole.com