Introduction to Spark: Or how I learned to love 'big data' after all.

12
Introduction to Introduction to Peadar Coyle @springcoil Luxembourg - Early 2016

Transcript of Introduction to Spark: Or how I learned to love 'big data' after all.

Page 1: Introduction to Spark: Or how I learned to love 'big data' after all.

Introduction to Introduction to

Peadar Coyle @springcoil

Luxembourg - Early 2016

Page 2: Introduction to Spark: Or how I learned to love 'big data' after all.

Aims of this talkAims of this talk

Explain what Spark is.Explain what Spark is.I'm more a data scientist than an engineer...

Page 3: Introduction to Spark: Or how I learned to love 'big data' after all.

Who am I?Who am I?Math and Data nerdInterested in machine learning and data processingSpeaker at PyData/ PyCons throughout Europe

Page 4: Introduction to Spark: Or how I learned to love 'big data' after all.

'Big data' so far'Big data' so far

Page 5: Introduction to Spark: Or how I learned to love 'big data' after all.

Why care?Why care?big data analytics in memoryResilient Distributed Datasets (RDD)Flexible programming modelscomplements Hadoopbetter performance than Hadoophttps://github.com/springcoil/scalable_ml

Page 6: Introduction to Spark: Or how I learned to love 'big data' after all.

Who uses it?Who uses it?Current tech not future tech!Current tech not future tech!

Page 7: Introduction to Spark: Or how I learned to love 'big data' after all.

Supported LanguagesSupported Languages

Page 8: Introduction to Spark: Or how I learned to love 'big data' after all.

val data = Array(1, 2, 3, 4, 5)val distData = sc.parallelize(data)

sc is 'Spark ContextHere is one RDD

CodeCode

Page 9: Introduction to Spark: Or how I learned to love 'big data' after all.

https://github.com/springcoil/scalable_ml/

package scalable_ml

import org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.mllib.linalg.DenseVectorimport org.apache.spark.rdd.RDDimport breeze.linalg.{DenseVector => BDV}import breeze.linalg.{DenseMatrix => BDM}

class LeastSquaresRegression { def fit(dataset: RDD[LabeledPoint]): DenseVector = { val features = dataset.map { _.features }

val covarianceMatrix: BDM[Double] = features.map { v => val x = BDM(v.toArray) x.t * x }.reduce(_ + _) val featuresTimesLabels: BDV[Double] = dataset.map { xy => BDV(xy.features.toArray) * xy.label }.reduce(_ + _)

val weight = covarianceMatrix \ featuresTimesLabels

new DenseVector(weight.data) }}

Page 10: Introduction to Spark: Or how I learned to love 'big data' after all.

Resilient DistributedResilient DistributedDatasets (RDD)Datasets (RDD)

Process in parallelActions on RDDs = transformations and actionspersistance: Memory, Disk, Memory and Disk

Page 11: Introduction to Spark: Or how I learned to love 'big data' after all.

Spark EcosystemSpark Ecosystem

Spark streamingSpark SQL - Really the creation of a data frameMore stuff will come soon... IBM and others heavily investing in this.

Page 12: Introduction to Spark: Or how I learned to love 'big data' after all.

Any questions?Any questions?