Apache Spark REX Heuritech for La Poste

APACHE SPARK REX

ABOUT MEDidier Marin

PhD in Computer Science (UPMC) Machine Learning, Reinforcement Learning & RoboticsCo-founder of HeuritechLikes functional programming and distributed computing

We develop tools to make sense from raw text dataCustomer insight using the text of visited web pages

Data Analytics Platform Qualify users using their web logs50M lines/dayMatch CRM and web data

WHY SPARK ?Performance, in particular whenbatch size < total RAM in clusterMore general than MR, high-level APIExtensions (ML, streaming) andconnectors (Cassandra)Growing community

PARSING LOGSdef parseLine(line: String): Either[ParsingError, LogData] = ???

val logs = sc.textFile("logfile").map(parseLine(_))

val validLogs = logs.flatMap(_.right.toOption)

LAMBDA ARCHITECTURE

IMPLEMENTATION

CLUSTER CONFIGURATIONLXC + saltN containers : 1 master/executor + (N-1) executorsCassandra node for each Spark executorUsing an "uber"-JAR to submit jobsSharing data through NFS

MANAGING SPARK'S MEMORYDefault: 40 % working memory, 60 % cache20 % of cache used to unroll blocks

Explicit caching for huge RDDs we reuse:validLogs.persist(StorageLevel.MEMORY_AND_DISK)

Partition tuning may be necessary (spark.default.parallelism)

AGGREGATIONval words = sc.parallelize(List("a","b","a","c"))

words.groupBy(x=>x).mapValues(_.size).collect

// Array((a,2), (b,1), (c,1))

words.map(x=>(x,1)).reduceByKey(_+_).collect

// Array((a,2), (b,1), (c,1))

AGGREGATIONgroupBy

see also &

AGGREGATIONreduceByKey

combineByKey foldByKey

Databricks knowledge base

Spark users mailing list

Parsing Apache logs with Spark (Scala)

USEFUL LINKSgithub.com/databricks/spark-knowledgebase

apache-spark-user-list.1001560.n3.nabble.com

alvinalexander.com/scala/analyzing-apache-access-logs-files-spark-scala

THANK YOU !

[email protected]

Apache Spark REX Heuritech for La Poste

Software

Transcript of Apache Spark REX Heuritech for La Poste