Transforming Big Data with Spark and Shark

Michael Franklin and Matei Zaharia – UC Berkeley

UC BERKELEY

Sources Driving Big Data

It’s All Happening On-line

Every:ClickAd impressionBilling eventFast Forward, pause,…Friend RequestTransactionNetwork messageFault…

User Generated (Web, Social & Mobile)

Internet of Things / M2M Scientific Computing

Big Data: The Challenges

Terabytes Petabytes+Volume

Structured Unstructured Variety

Batch Real-TimeVelocity

Our view: More data should mean better answers

• Must deal with vertical and horizontal growth

• Must balance Cost, Time, and Answer Quality

AMP Expedition

Resources for Making Sense at Scale

Algorithms: Machine Learning and

Analytics

Machines:

Cloud Computing

People:

CrowdSourcing &

Human Computation

Massive and Diverse

UC BERKELEY

The AMPLab Big Bets• New “Big Data” stacks are limited by traditional intellectual

borders• Need Machine Learning/Systems/Database Co-Design• Requires Cohabitation and Real Collaboration

• Opportunity to rethink fundamental design points:• Low Latency• Variable Consistency• Cloud-based Elastic Resources• Desire for New Solutions in the Marketplace

• Consider role of people throughout the entire analytics lifecycle6

AMPLab FactsAn integration of Faculty Interests (*Directors):

+ ~50 amazing students, post-docs, staff & visitors

Alex Bayen (Mobile Sensing) Anthony Joseph (Sec./ Privacy)

Ken Goldberg (Crowdsourcing) Randy Katz (Systems)

*Michael Franklin (Databases) Dave Patterson (Systems)

Armando Fox (Systems) *Ion Stoica (Systems)

*Mike Jordan (Machine Learning) Scott Shenker (Networking)

Organized for Collaboration:

AMP Facts (continued)

• Launched February 2011; 6 Year Duration• Strong industry and government support

• NSF Expedition and Darpa XData• BDAS stack components released

as BSD/Apache Open Source (e.g. Spark, Shark, Mesos)

App: Carat - Detection of Smartphone Energy Bugs

> 450,000downloads

App: Cancer Tumor Genomics• Vision: Personalized Therapy

“…10 years from now, each cancer patient is going to want to get a genomic analysis of their cancer and will expect customized therapy based on that information.” Director, The Cancer Genome Atlas (TCGA), Time Magazine, 6/13/11

• UCSF cancer researchers + UCSC cancer genetic database + AMP Lab + Intel Cluster@TCGA: 5 PB = 20 cancers x 1000 genomes

• Sequencing costs (150X) Big Data

David Patterson, “Computer Scientists May Have What It Takes to Help Cure Cancer,” New York Times, 12/5/2011

$100.0

$1,000.0

$10,000.0

$100,000.0

$K per genome

2001 - 2014

• See Dave Patterson’s Talk: Thursday 3-4, BDT205

MLBase (Declarative Machine Learning)

BlinkDB (approx QP)

BDAS: The Berkeley Data Analytics System

Shark (SQL) + Streaming

AMPLab (released)3rd party AMPLab (in progress)

Streaming

Hadoop MRMPI

Graphlabetc. Spark

Shared RDDs (distributed memory) Mesos (cluster resource manager)

BDAS: Where we’re going• Time, Cost, Quality tradeoffs using sampling and the “Bag

of Little Bootstraps”• Refactoring Distributed Memory layer for sharing• Low-latency (real-time) processing via discretized streams• Graph Processing and Asynchronous computation • Declarative Machine Learning libraries that utilize these

interfaces for scalability• A “logical plan” level to serve as the narrow waist for these

and future components.• Integration of the “People” component (e.g., CrowdDB)

For More Informationamplab.cs.berkeley.edu

• Papers and project pages• News updates and blogs

Spark User Group and MeetupGithub and Apache Mesos

Deep Dive: Spark and Shark

What is Spark?

• Fast, MapReduce-like engine• In-memory storage for very fast iterative queries• General execution graphs• Up to 100x faster than Hadoop (2-10x even for on-disk data)

• Compatible with Hadoop’s storage APIs• Can access HDFS, HBase, S3, SequenceFiles, etc

Lightning-Fast Cluster Computing

What is Shark?

• Port of Apache Hive to run on Spark

• Compatible with existing Hive data, metastores, and queries (HiveQL, UDFs, etc)

• Can be more than 100x faster

Project History

• Spark started in 2009, open sourced 2010

• Shark released spring 2012

• In use at Yahoo!, Klout, Airbnb, Foursquare, Conviva, Quantifind & others

• 400+ member meetup, 20+ developers

• Language-integrated API in Scala, Java and soon Python

• Can be used interactively from Scala and Python shells

• Lets users manipulate distributed collections (“resilient distributed datasets”, or RDDs) with parallel operations

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()Block 1

Block 2

Block 3

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

results

Cache 1

Cache 2

Cache 3

Base RDDTransformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data)

Fault Tolerance

RDDs track the series of transformations used to build them (their lineage) to recompute lost data

E.g:messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc = _.contains(...)

MappedRDDfunc = _.split(…)

Example: Logistic Regression

Goal: find best line separating two sets of points

– ––

––

target

random initial line

Example: Logistic Regression

val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}

println("Final w: " + w)

Initial parameter vector

Repeated MapReduce stepsto do gradient descent

Load data in memory once

1 10 20 300

HadoopSpark

Number of Iterations

Logistic Regression Performance

110 s / iteration

first iteration 80 sfurther iterations 1 s

Spark in Java and Python

JavaRDD<String> lines = sc.textFile(...);

lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();

lines = sc.textFile(...)

lines.filter(lambda x: x.contains('error')) \ .count()

Java API(out now)

PySpark(coming soon)

User Applications

• In-memory analytics on Hive data (Conviva)

• Interactive queries on data streams (Quantifind)

• Business intelligence (Yahoo!)

• Traffic estimation w/ GPS data (Mobile Millennium)

• DNA sequence analysis (SNAP)

Conviva GeoReport

• Group aggregations on many keys with same filter

• 40× gain over Hive from avoiding repeated reading, deserialization and filtering

0 2 4 6 8 10 12 14 16 18 20

Time (hours)

Shark: SQL on Spark

• Hive is great, but Hadoop’s execution engine makes even the smallest queries take minutes

• Can we extend Hive to run on Spark?

Hive Architecture

Meta store

Client

Driver

SQL Parser

Query Optimizer

Physical Plan

Execution

CLI JDBC

MapReduce

Shark Architecture

Meta store

Client

Driver

SQL Parser

Physical Plan

Execution

CLI JDBC

Cache Mgr.

Query Optimizer

Column-Oriented Storage

• Caching Hive records as Java objects is inefficient• Instead, use arrays of primitive types for columns

• Similar size to serialized form, but 5x faster to process• Columnar compression can further reduce size by 5x

Column Storage

john mike sally

4.1 3.5 6.4

Row Storage

1 john 4.1

2 mike 3.5

3 sally 6.4

Other Shark Optimizations

• Dynamic join algorithm selection based on the data

• Runtime selection of # of reducers

• Partition pruning using range statistics

• Controllable table partitioning across nodes

Using Shark

CREATE TABLE latest_logs TBLPROPERTIES ("memory"=true)AS SELECT * FROM logs WHERE date > now()-3600;

Then, just run HiveQL against it!

Shark Results

Selection0

Shark Shark (disk) Hive

100 m2.4xlarge nodes2.1 TB benchmark (Pavlo et al)

SELECT pageURL, pageRankFROM rankingsWHERE pageRank > X;

Shark Results: Group By

SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue)FROM uservisitsGROUP BY SUBSTR(sourceIP, 1, 7);

Group By0

Shark Results: Join

SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS totalRevenueFROM rankings r, uservisits vWHERE r.pageURL = v.destURL AND v.visitDate BETWEEN Date(’2000-01-15’) AND Date(’2000-01-22’)GROUP BY v.sourceIP;

Shark (copartitioned)

Shark (disk)

User Queries

Yahoo!, Conviva report 40-100x speedups over Hive

Query 10

Query 20

Query 30

100 m2.4xlargenodes, 1.7 TBConviva dataset

Getting Started

• Spark and Shark both have scripts for launching on EC2

• Work with data in HDFS, HBase, S3, and existing Hive warehouses and metastores

• Local execution mode for testing

spark-project.org amplab.cs.berkeley.edu

UC BERKELEY

We are sincerely eager to hear your feedback on this

presentation and on re:Invent.

Please fill out an evaluation form when you have a

chance.

Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Transcript of Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Documents

Transcript of Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

BDT - 9-11 - Sept 2011

BDT between Au leads

PAGE DE GARDE - · PDF file3, 1. CONTENTS 3,45% . BDT 52 Weeks . 43% BDT 26 Weeks 3,48% BDT 3 weeks . 3,81% BDT 5 years . 2009 . Economic Environment. Treasury Financing Strategy

BDT Manual

CR590 BDT - Business Data Toolset CR590clickforinterview.com/abap_material/Enhancements/BDT - Business... · CR590 BDT - Business Data Toolset CR590 BDT - Business Data Toolset 2002/Q3

AWS re:Invent Hackathon

Re:Invent announcements 2014

BDT 9.125X11 - 1Q2012 - color

BDT meeting #3

SAP CAA2 Screen Enhance - BDT

Aug 29 2012 Ads BDT

Activities of BDT for BSG

ITU/BDT International seminar, Kenya

AWS re:Invent 2017 · Are re:Invent passes required to use this space? • The meeting host must be a registered re:Invent pass holder. Meeting participants who do not have a re:Invent

BDT Model with Forward Induction

Customer Enhancements With BDT

PRODUCT DATA SHEETS - Home - BDT Energy Groupbdtenergygroup.com/wp-content/uploads/2016/05/BDT-NP-Neptun-LED... · PRODUCT DATA SHEETS LED Canopy, Garage, & Gas Station Fixtures BDT

tonghop bdt cua mathscope

Blu-ray Disc Player BDT-101CI - OPPOdownload.oppodigital.com/BDT101CI/BDT-101CI User Manual v1.1.pdf · USER MANUAL READ CAREFULLY BEFORE OPERATION Blu-ray Disc Player BDT-101CI

BDT Seminar Paper