Apache Spark as Cross-Over Hit for Data Science

Apache Spark as Cross-over Hit for Data ScienceSean Owen / Director of Data Science / Cloudera

Investigative vs Operational Analytics

Data ScientistExploratory Analytics

Predictive Data ProductsOperational Analytics

Tools of the Trade

Trade-offs of the Tools

Production DataLarge-ScaleShared Cluster

ContinuousOperationOnline

Throughput, QPS

Few, Simple

Systems LanguagePerformance

Historical SubsetSample

Workstation

Ad HocInvestigation

Offline

Accuracy

Many, Sophisticated

Scripting, High LevelEase of Development

Context

Metrics

Library

Language

tive O

perational

Throughput, QPS

Few, Simple

Workstation

Ad HocInvestigation

Offline

Accuracy

Many, Sophisticated

Context

Metrics

Library

Language

tive O

perational

Python + scikit

Throughput, QPS

Few, Simple

Workstation

Ad HocInvestigation

Offline

Accuracy

Many, Sophisticated

Context

Metrics

Library

Language

tive O

perational

MapReduce, Crunch, Mahout

Throughput, QPS

Few, Simple

Workstation

Ad HocInvestigation

Offline

Accuracy

Many, Sophisticated

Context

Metrics

Library

Language

tive O

perational

Spark: Something For Everyone

• Now Apache TLP• From UC Berkeley

• Scala-based• Expressive, efficient• JVM-based

• Scala-like API• Distributed works

like local• Like Crunch is

Collection-like

• REPL• Interactive

• Distributed• Hadoop-friendly

• Integrate with where data already is

• ETL no longer separate• MLlib

Throughput, QPS

Few, Simple

Workstation

Ad HocInvestigation

Offline

Accuracy

Many, Sophisticated

Context

Metrics

Library

Language

tive O

perational

Stack Overflow Tag Recommender Demo

• Questions have tags like java or mysql

• Recommend new tags to questions

• Available as data dump• Jan 20 2014 Posts.xml

• 24.4GB• 2.1M questions • 9.3M tags (34K unique)

<row Id="4" PostTypeId="1" AcceptedAnswerId="7” CreationDate="2008-07-31T21:42:52.667" Score="251" ViewCount="15207" Body="I want to use a track-bar to change a form's opacity.

This is my code:

<pre><code>decimal trans = trackBar1.Value / 5000;
this.Opacity = trans;
< code></pre>

When I try to build it, I get this error:

<blockquote>
Cannot implicitly convert type 'decimal' to 'double'.
</blockquote>

I tried making trans to double, but then the control doesn't work. This code has worked fine for me in VB.NET in the past. 
" OwnerUserId="8” LastEditorUserId="2648239" LastEditorDisplayName="Rich B” LastEditDate="2014-01-03T02:42:54.963" LastActivityDate="2014-01-03T02:42:54.963" Title="When setting a form's opacity should I use a decimal or double?” Tags="<c#><winforms><forms><type- conversion><opacity>" AnswerCount="13" CommentCount="25" FavoriteCount="23" CommunityOwnedDate="2012-10-31T16:42:47.213" />

Stack Overflow Tag Recommender Demo

• CDH 5.0.1• Spark 0.9.0

• Standalone mode• Install libgfortran

• 1 master• 5 workers

• 24 cores• 64GB RAM

val postsXML = sc.textFile( "hdfs:///user/srowen/SparkDemo/Posts.xml")

postsXML: org.apache.spark.rdd.RDD[String] = MappedRDD[13] at textFile at <console>:15

postsXML.count...res1: Long = 18066983

(4,"c#")(4,"winforms") ...

(4,3104,1.0)(4,2148819,1.0) ...

val postIDTags = postsXML.flatMap { line => val idTagRegex = "Id=\"(\\d+)\".+Tags=\"([^\"]+)\"".r val tagRegex = "<([^&]+)>".r idTagRegex.findFirstMatchIn(line) match { case None => None case Some(m) => { val postID = m.group(1).toInt val tagsString = m.group(2) val tags = tagRegex.findAllMatchIn(tagsString) .map(_.group(1)).toList if (tags.size >= 4) tags.map((postID,_)) else None } }}

def nnHash(tag: String) = tag.hashCode & 0x7FFFFFvar tagHashes = postIDTags.map(_._2).distinct.map(tag => (nnHash(tag),tag))

import org.apache.spark.mllib.recommendation._val alsInput = postIDTags.map(t => Rating(t._1, nnHash(t._2), 1.0))

val model = ALS.trainImplicit(alsInput, 40, 10)

def recommend(questionID: Int, howMany: Int = 5): Array[(String, Double)] = { val predictions = model.predict( tagHashes.map(t => (questionID,t._1))) val topN = predictions.top(howMany) (Ordering.by[Rating,Double](_.rating)) topN.map(r => (tagHashes.lookup(r.product)(0), r.rating))}

recommend(7122697).foreach(println)

(sql,0.1666023080230586)(database,0.14425980384610013)(oracle,0.09742911781766687)(ruby-on-rails,0.06623183702418671)(sqlite,0.05568507618047555)

I have a large table with a text field, and want to make queries to this table, to find records that contain a given substring, using ILIKE. It works perfectly on small tables, but in my case it is a rather time-consuming operation, and I need it work fast, because I use it in a live-search field in my website. Any ideas would be appreciated...postgresql query-optimization substring text-search

stackoverflow.com/questions/7122697/how-to-make-substring-matching-query-work-fast-on-a-large-table

blog.cloudera.com/blog/2014/ 03/why-apache-spark-is-a- crossover-hit-for-data- scientists/

goo.gl/4K5YEIsowen@cloudera.com

Apache Spark as Cross-Over Hit for Data Science

Data & Analytics

Transcript of Apache Spark as Cross-Over Hit for Data Science

Apache Spark and Distributed Programming - CS-E4110 ... · Apache Spark Apache Spark Distributed programming framework for Big Data processing Based on functional programming Implements

Using Apache Spark

NEW ARCHITECTURES FOR APACHE SPARK TM AND BIG DATA · NEW ARCHITECTURES FOR APACHE SPARK AND BIG DATA The Apache Spark Platform for Big Data The Apache Spark platform is an open-source

Accelerator for Apache Spark Functional Specification · Accelerator for Apache Spark – Functional Specification 12 Table 1: Accelerator for Apache Spark Components Component Software

Using Apache Spark, Apache Kafka and Apache Cassandra...USING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS | 02 Apache Cassandra is well known

A Tutorial on Apache Spark - Michael Hahslermichael.hahsler.net/SMU/EMIS8331/tutorials/Tutorial_Apache_Spark.pdf · A Tutorial on Apache Spark ... •Apache Spark is considered to

Apache Spark & Hadoop

[@NaukriEngineering] Apache Spark

Apache Spark Operations

Introduction to Cassandra • Why Spark - Apache Cassandra | Apache Kafka | Apache Spark · 2017. 12. 20. · • Introduction to Cassandra • Why Spark + Cassandra • Problem background

Apache Spark - LMU

KNIME Extension for Apache Spark Installation Guide · Apache Livy (recommended) Spark Job Server (deprecated) Supported Spark and Hadoop distributions KNIME Extension for Apache

Budapest Spark Meetup - Apache Spark @enbrite.ly

Apache Spark PDF

Developing Apache Spark Applications - Cloudera · Apache Spark Quick Start Apache Spark Overview Apache Spark Programming Guide Using the Spark DataFrame API A DataFrame is a distributed

Managed Solutions Apache Spark® · Apache Spark® Apache Spark™ is a high performing engine for large-scale analytics and data processing, While Apache Spark™ provides advanced

TeachYourself Apache Spark...HOUR 1 Introducing Apache Spark..... 1 2 Understanding Hadoop ... Part II: Programming with Apache Spark HOUR 6: Learning the Basics of Spark Programming

Apache Spark Overview

Developing Apache Spark Applications · Apache Spark Introduction Introduction Apache Spark enables you to quickly develop applications and process jobs. Apache Spark is designed

Apache Spark Briefing