Apache Spark as Cross-Over Hit for Data Science

Post on 11-Aug-2014

443 views 2 download

Tags:

description

 

Transcript of Apache Spark as Cross-Over Hit for Data Science

1

Apache Spark as Cross-over Hit for Data ScienceSean Owen / Director of Data Science / Cloudera

2

Investigative vs Operational Analytics

Data ScientistExploratory Analytics

Predictive Data ProductsOperational Analytics

Tools of the Trade

3

Trade-offs of the Tools

4

Production DataLarge-ScaleShared Cluster

ContinuousOperationOnline

Throughput, QPS

Few, Simple

Systems LanguagePerformance

Historical SubsetSample

Workstation

Ad HocInvestigation

Offline

Accuracy

Many, Sophisticated

Scripting, High LevelEase of Development

Data

Context

Metrics

Library

Language

Inve

stiga

tive O

perational

R

5

Production DataLarge-ScaleShared Cluster

ContinuousOperationOnline

Throughput, QPS

Few, Simple

Systems LanguagePerformance

Historical SubsetSample

Workstation

Ad HocInvestigation

Offline

Accuracy

Many, Sophisticated

Scripting, High LevelEase of Development

Data

Context

Metrics

Library

Language

Inve

stiga

tive O

perational

Python + scikit

6

Production DataLarge-ScaleShared Cluster

ContinuousOperationOnline

Throughput, QPS

Few, Simple

Systems LanguagePerformance

Historical SubsetSample

Workstation

Ad HocInvestigation

Offline

Accuracy

Many, Sophisticated

Scripting, High LevelEase of Development

Data

Context

Metrics

Library

Language

Inve

stiga

tive O

perational

MapReduce, Crunch, Mahout

7

Production DataLarge-ScaleShared Cluster

ContinuousOperationOnline

Throughput, QPS

Few, Simple

Systems LanguagePerformance

Historical SubsetSample

Workstation

Ad HocInvestigation

Offline

Accuracy

Many, Sophisticated

Scripting, High LevelEase of Development

Data

Context

Metrics

Library

Language

Inve

stiga

tive O

perational

Spark: Something For Everyone

8

• Now Apache TLP• From UC Berkeley

• Scala-based• Expressive, efficient• JVM-based

• Scala-like API• Distributed works

like local• Like Crunch is

Collection-like

• REPL• Interactive

• Distributed• Hadoop-friendly

• Integrate with where data already is

• ETL no longer separate• MLlib

Spark

9

Production DataLarge-ScaleShared Cluster

ContinuousOperationOnline

Throughput, QPS

Few, Simple

Systems LanguagePerformance

Historical SubsetSample

Workstation

Ad HocInvestigation

Offline

Accuracy

Many, Sophisticated

Scripting, High LevelEase of Development

Data

Context

Metrics

Library

Language

Inve

stiga

tive O

perational

Stack Overflow Tag Recommender Demo

10

• Questions have tags like java or mysql

• Recommend new tags to questions

• Available as data dump• Jan 20 2014 Posts.xml

• 24.4GB• 2.1M questions • 9.3M tags (34K unique)

11

<row Id="4" PostTypeId="1" AcceptedAnswerId="7” CreationDate="2008-07-31T21:42:52.667" Score="251" ViewCount="15207" Body="&lt;p&gt;I want to use a track-bar to change a form's opacity.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;This is my code:&lt; p&gt;&#xA;&#xA;&lt;pre&gt;&lt;code&gt;decimal trans = trackBar1.Value / 5000;&#xA;this.Opacity = trans;&#xA;&lt; code&gt;&lt;/pre&gt;&#xA;&#xA;&lt;p&gt;When I try to build it, I get this error:&lt;/p&gt;&#xA;&#xA;&lt;blockquote&gt;&#xA; &lt;p&gt;Cannot implicitly convert type 'decimal' to 'double'.&lt; p&gt;&#xA;&lt;/blockquote&gt;&#xA;&#xA;&lt;p&gt;I tried making &lt;strong&gt;trans&lt;/strong&gt; to &lt;strong&gt;double&lt;/ strong&gt;, but then the control doesn't work. This code has worked fine for me in VB.NET in the past. &lt;/p&gt;&#xA;" OwnerUserId="8” LastEditorUserId="2648239" LastEditorDisplayName="Rich B” LastEditDate="2014-01-03T02:42:54.963" LastActivityDate="2014-01-03T02:42:54.963" Title="When setting a form's opacity should I use a decimal or double?” Tags="&lt;c#&gt;&lt;winforms&gt;&lt;forms&gt;&lt;type- conversion&gt;&lt;opacity&gt;" AnswerCount="13" CommentCount="25" FavoriteCount="23" CommunityOwnedDate="2012-10-31T16:42:47.213" />

Stack Overflow Tag Recommender Demo

12

• CDH 5.0.1• Spark 0.9.0

• Standalone mode• Install libgfortran

• 1 master• 5 workers

• 24 cores• 64GB RAM

13

14

val postsXML = sc.textFile( "hdfs:///user/srowen/SparkDemo/Posts.xml")

postsXML: org.apache.spark.rdd.RDD[String] = MappedRDD[13] at textFile at <console>:15

postsXML.count...res1: Long = 18066983

15

<row Id="4" ... Tags="...c#...winforms..."/>

(4,"c#")(4,"winforms") ...

(4,3104,1.0)(4,2148819,1.0) ...

16

val postIDTags = postsXML.flatMap { line => val idTagRegex = "Id=\"(\\d+)\".+Tags=\"([^\"]+)\"".r val tagRegex = "&lt;([^&]+)&gt;".r idTagRegex.findFirstMatchIn(line) match { case None => None case Some(m) => { val postID = m.group(1).toInt val tagsString = m.group(2) val tags = tagRegex.findAllMatchIn(tagsString) .map(_.group(1)).toList if (tags.size >= 4) tags.map((postID,_)) else None } }}

17

def nnHash(tag: String) = tag.hashCode & 0x7FFFFFvar tagHashes = postIDTags.map(_._2).distinct.map(tag => (nnHash(tag),tag))

import org.apache.spark.mllib.recommendation._val alsInput = postIDTags.map(t => Rating(t._1, nnHash(t._2), 1.0))

val model = ALS.trainImplicit(alsInput, 40, 10)

18

19

def recommend(questionID: Int, howMany: Int = 5): Array[(String, Double)] = { val predictions = model.predict( tagHashes.map(t => (questionID,t._1))) val topN = predictions.top(howMany) (Ordering.by[Rating,Double](_.rating)) topN.map(r => (tagHashes.lookup(r.product)(0), r.rating))}

recommend(7122697).foreach(println)

20

(sql,0.1666023080230586)(database,0.14425980384610013)(oracle,0.09742911781766687)(ruby-on-rails,0.06623183702418671)(sqlite,0.05568507618047555)

I have a large table with a text field, and want to make queries to this table, to find records that contain a given substring, using ILIKE. It works perfectly on small tables, but in my case it is a rather time-consuming operation, and I need it work fast, because I use it in a live-search field in my website. Any ideas would be appreciated...postgresql query-optimization substring text-search

stackoverflow.com/questions/7122697/how-to-make-substring-matching-query-work-fast-on-a-large-table

21

blog.cloudera.com/blog/2014/ 03/why-apache-spark-is-a- crossover-hit-for-data- scientists/

goo.gl/4K5YEIsowen@cloudera.com