Download - Spark dataframe

Spark DataframeSpark User Group

11 Juin 2015

Julien Buret

Formateur

Twitter @JulienBuret

Sparkpublic class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }}

object WordCount { def wc(in: String, out: String) = { val conf = new SparkConf().setAppName("word_count") val spark = new SparkContext(conf) val textFile = spark.textFile(in) val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile(out) }}

3 fois moins de code3 à 100 fois plus rapide

Spark RDD

def groupBy_avg(in: String) = { val conf = new SparkConf().setAppName("groupby_avg") val spark = new SparkContext(conf) val csv = spark.textFile(in).map(_.split(",")) csv.map(rec => ((rec(0), (rec(1).toInt, 1)))) .reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2)) .map(x => (x._1, x._2._1 / x._2._2)) .collect()}

• Souvent les données sont structurées (CSV, Json, ...)

• Les PairRDD nécessite de manipuler des tuple ou des case class

Date,Visitor_ID,Visit,...2015_01_01,1123638584_657538536,1,... 2015_01_01,1123638584_657538536,2,... 2015_01_02,1123638584_657538536,1,... ...

SchemaRDDdef groupByAvgSSQL(in: String) = { val conf = new SparkConf().setAppName("groupbyAvgDF") val spark = new SparkContext(conf) val sqlCtx = new SQLContext(spark) val file = spark.textFile(in) // Remove first line if necessary val csv = file.map(l => Row.fromSeq(l.split(','))) val csvWithSchema = sqlCtx.applySchema(csv, StructType(Seq( StructField("name", StringType), StructField("age", IntegerType)))) csvWithSchema.groupBy("name").avg("age").collect()}

• Permet de manipuler des RDD de données structuré (Spark 1.0)• Expérimental• API type SQL• Peu de reader

Dataframe

• Collection de 'Rows' organisé avec un 'Schema'

• Abstraction et API inspiré de R, Pandas pour manipulé des données structurés

Spark Dataframe

def groupByAvgDF(in: String) = { val conf = new SparkConf().setAppName("groupbyAvgDF") val spark = new SparkContext(conf) val sqlCtx = new SQLContext(spark) val csv = sqlCtx.load("com.databricks.spark.csv", Map("path" -> in)) csv.groupBy("name").avg("age").collect()}

def groupByAvg(in: String) = { val conf = new SparkConf().setAppName("groupbyAvg") val spark = new SparkContext(conf) val csv = spark.textFile(in).map(_.split(",")) csv.map(rec => ((rec(0), (rec(1).toInt, 1)))) .reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2)) .map(x => (x._1, x._2._1 / x._2._2)) .collect()}

Spark Dataframe

• Spark pour les data-analystes

• Spark est maintenant, presque, aussi simple à utiliser que des librairies de type Pandas

• Performance des jobs quasi identique en Java, Scala, Python, R

• Sauf pour les udf

Spark 1.3

SchemaRDD vers DataFrame

Nouveau reader pour les données structurées

Adaptation de spark ML

Spark DataFrame from the trench

• Utilisation quasi quotidienne en "prod" depuis 3 mois

• Majoritairement de l'exploration de données parquet ou csv avec pyspark sur notebook

• Par des profils "non développeur"

Spark DataFrame from the trench

• Complexité cachée par Catalyst

• OutOfMemory

• Nombres de partitions à tuner très fréquemment

• StackTrace incompréhensible pour le neophyte

• Peu de documentation

Autres features

• Partition discovery

• Schema merging

• JSON / JDBC

• Passage implicite de RDD vers Dataframe

Nice to have

• Faciliter la création de nouvelles expressions, les UDAF

• Déterminer le nombre de partitions automatiquement

• Faciliter la manipulation des dataframes (drop, ...)

• Catalyst encore plus intelligent

• Utilisation des statistiques

• CBO

Quelques références

• RDDs are the new bytecode of apache spark. O.Girardot

• Introducing Dataframes in Spark for Large Scale Data Science. Databrick

• Deep Dive into Spark SQL’s Catalyst Optimizer. Databrick