Apache Spark REX Heuritech for La Poste

17
APACHE SPARK REX

Transcript of Apache Spark REX Heuritech for La Poste

Page 1: Apache Spark REX Heuritech for La Poste

APACHE SPARK REX

Page 2: Apache Spark REX Heuritech for La Poste

ABOUT MEDidier Marin

PhD in Computer Science (UPMC) Machine Learning, Reinforcement Learning & RoboticsCo-founder of HeuritechLikes functional programming and distributed computing

Page 3: Apache Spark REX Heuritech for La Poste

We develop tools to make sense from raw text dataCustomer insight using the text of visited web pages

Page 4: Apache Spark REX Heuritech for La Poste

Data Analytics Platform Qualify users using their web logs50M lines/dayMatch CRM and web data

Page 5: Apache Spark REX Heuritech for La Poste
Page 6: Apache Spark REX Heuritech for La Poste

WHY SPARK ?Performance, in particular whenbatch size < total RAM in clusterMore general than MR, high-level APIExtensions (ML, streaming) andconnectors (Cassandra)Growing community

Page 7: Apache Spark REX Heuritech for La Poste

PARSING LOGSdef parseLine(line: String): Either[ParsingError, LogData] = ???

val logs = sc.textFile("logfile").map(parseLine(_))

val validLogs = logs.flatMap(_.right.toOption)

Page 8: Apache Spark REX Heuritech for La Poste

LAMBDA ARCHITECTURE

Page 9: Apache Spark REX Heuritech for La Poste

IMPLEMENTATION

Page 10: Apache Spark REX Heuritech for La Poste

CLUSTER CONFIGURATIONLXC + saltN containers : 1 master/executor + (N-1) executorsCassandra node for each Spark executorUsing an "uber"-JAR to submit jobsSharing data through NFS

Page 11: Apache Spark REX Heuritech for La Poste
Page 12: Apache Spark REX Heuritech for La Poste

MANAGING SPARK'S MEMORYDefault: 40 % working memory, 60 % cache20 % of cache used to unroll blocks

Explicit caching for huge RDDs we reuse:validLogs.persist(StorageLevel.MEMORY_AND_DISK)

Partition tuning may be necessary (spark.default.parallelism)

Page 13: Apache Spark REX Heuritech for La Poste

AGGREGATIONval words = sc.parallelize(List("a","b","a","c"))

words.groupBy(x=>x).mapValues(_.size).collect

// Array((a,2), (b,1), (c,1))

words.map(x=>(x,1)).reduceByKey(_+_).collect

// Array((a,2), (b,1), (c,1))

Page 14: Apache Spark REX Heuritech for La Poste

AGGREGATIONgroupBy

Page 15: Apache Spark REX Heuritech for La Poste

see also &

AGGREGATIONreduceByKey

combineByKey foldByKey

Page 16: Apache Spark REX Heuritech for La Poste

Databricks knowledge base

Spark users mailing list

Parsing Apache logs with Spark (Scala)

USEFUL LINKSgithub.com/databricks/spark-knowledgebase

apache-spark-user-list.1001560.n3.nabble.com

alvinalexander.com/scala/analyzing-apache-access-logs-files-spark-scala

Page 17: Apache Spark REX Heuritech for La Poste

THANK YOU !

[email protected]