Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

32
Quark: A Purely-Functional Scala DSL for Data Processing & Analytics John A. De Goes @jdegoes - http://degoes.net

Transcript of Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Page 1: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Quark: A Purely-Functional Scala DSL for Data Processing & AnalyticsJohn A. De Goes

@jdegoes - http://degoes.net

Page 2: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Apache Spark

Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

val textFile = sc.textFile("hdfs://...")val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)

Page 3: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Spark Sucks

— Functional-ish

— Exceptions, typecasts

— SparkContext

— Serializable

— Unsafe type-safe programs

— Second-class support for databases

— Dependency hell (>100)

— Painful debugging

— Implementation-dependent performance

Page 4: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Why Does Spark Have to Suck?Computation

val textFile = sc.textFile("hdfs://...")val counts = textFile.flatMap(line => line.split(" ")) <---- Where Spark goes wrong .map(word => (word, 1)) <---- Where Spark goes wrong .reduceByKey(_ + _) <---- Where Spark goes wrong

Page 5: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

WWFPD?

— Purely functional

— No exceptions, no casts, no nulls

— No global variables

— No serialization

— Safe type-safe programs

— First-class support for databases

— Few dependencies

— Better debugging

— Implementation-independent performance

Page 6: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Rule #1 in Functional ProgrammingDon't solve the problem, describe the solution.

AKA the "Do Nothing" rule

=> Don't compute, embed a compiled language into Scala

Page 7: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

QuarkCompilation

Quark is a Scala DSL built on Quasar Analytics, a general-purpose compiler for translating data processing over semi-structured data into efficient plans that execute 100% inside the target infrastructure.

val textFile = Dataset.load("...")val counts = textFile.flatMap(line => line.typed[Str].split(" ")) .map(word => (word, 1)) .reduceByKey(_.sum)

Page 8: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

More QuarkCompilation

val dataset = Dataset.load("/prod/profiles")

val averageAge = dataset.groupBy(_.country[Str]).map(_.age[Int]).reduceBy(_.average)

Page 9: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Quark TargetsOne DSL to Rule Them All

— MongoDB

— Couchbase

— MarkLogic

— Hadoop / HDFS

— Add your connector here!

Page 10: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Both Quark and Quasar Analytics are purely-functional, open source projects written in 100% Scala.

https://github.com/quasar-analytics/

Page 11: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

How To DSLAdding Integers

sealed trait Exprfinal case class Integer(v: Int) extends Exprfinal case class Addition(v: Expr, v: Expr) extends Expr

def int(v: Int): Expr = Integer(v)def add(l: Expr, r: Expr): Expr = Addition(l, r)

add(add(int(1), int(2)), int(3)) : Expr

def interpret(e: Expr): Int = e match { case Integer(v) => v case Addition(l, r) => interpret(l) + interpret(r)}def serialize(v: Expr): Json = ???def deserialize(v: Json): Expr = ???

Page 12: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

How To DSLAdding Strings

sealed trait Exprfinal case class Integer(v: Int) extends Exprfinal case class Addition(l: Expr, r: Expr) extends Expr // Uh, oh!final case class Str(v: String) extends Exprfinal case class StringConcat(l: Expr, r: Expr) extends Expr // Uh, oh!

Page 13: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

How To DSLPhantom Type

sealed trait Expr[A]final case class Integer(v: Int) extends Expr[Int]final case class Addition(l: Expr[Int], r: Expr[Int]) extends Expr[Int]final case class Str(v: String) extends Expr[String]final case class StringConcat(l: Expr[String], r: Expr[String]) extends Expr[String]

def interpret[A](e: Expr[A]): A = e match { case Integer(v) => v case Addition(l, r) => interpret(l) + interpret(r) case Str(v) => v case StringConcat(l, r) => interpret(l) ++ interpret(r)}def serialize[A](v: Expr[A]): Json = ???def deserialize[Z](v: Json): Expr[A] forSome { type A } = ???

Page 14: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

How To DSLGADTs in Scala still have bugs

SI-8563, SI-9345, SI-6680

FRIENDS DON'T LET FRIENDS USE GADTS IN SCALA.

Page 15: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

How To DSLFinally Tagless

trait Expr[F[_]] { def int(v: Int): F[Int] def str(v: String): F[String] def add(l: F[Int], r: F[Int]): F[Int] def concat(l: F[String], r: F[String]): F[String]}

trait Dsl[A] { def apply[F[_]](implicit F: Expr[F]): F[A]}

def int(v: Int): Dsl[Int] = new Dsl[Int] { def apply[F[_]](implicit F: Expr[F]): F[Int] = F.int(v)}

def add(l: Dsl[Int], r: Dsl[Int]): Dsl[Int] = new Dsl[Int] { def apply[F[_]](implicit F: Expr[F]): F[Int] = F.add(l.apply[F], r.apply[F])}// ...

Page 16: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

How To DSLFinally Tagless

type Id[A] = A

def interpret: Expr[Id] = new Expr[Id] { def int(v: Int): Id[Int] = v def str(v: String): Id[String] = v def add(l: Id[Int], r: Id[Int]): Id[Int] = l + r def concat(l: Id[String], r: Id[String]): Id[String] = l + r}

add(int(1), int(2)).apply(interpret) // Id(3)

final case class Const[A, B](a: A)

def serialize: Expr[Const[Json, ?]] = ???def deserialize[F[_]: Expr](json: Json): F[A] forSome { type A } = ???

Page 17: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Quark 101The Building Blocks

— Type. Represents a reified type of an element in a dataset.

— **Dataset[A]**. Represents a dataset, produced by successive application of set-level operations (SetOps). Describes a directed-acyclic graph.

— **MappingFunc[A, B]**. Represents a function from A to B that is produced by successive application of mapping-level operations (MapOps) to the input.

— **ReduceFunc[A, B]**. Represents a reduction from A to B, produced by application of reduction-level operations (ReduceOps) to the input.

Page 18: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Let's Build Us a Mini-Quark!

Page 19: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Mini-QuarkType System

sealed trait Typeobject Type { final case class Unknown() extends Type final case class Timestamp() extends Type final case class Date() extends Type final case class Time() extends Type final case class Interval() extends Type final case class Int() extends Type final case class Dec() extends Type final case class Str() extends Type final case class Map[A <: Type, B <: Type](key: A, value: B) extends Type final case class Arr[A <: Type](element: A) extends Type final case class Tuple2[A <: Type, B <: Type](_1: A, _2: B) extends Type final case class Bool() extends Type final case class Null() extends Type type UnknownMap = Map[Unknown, Unknown] val UnknownMap : UnknownMap = Map(Unknown(), Unknown())

type UnknownArr = Arr[Unknown] val UnknownArr : UnknownArr = Arr(Unknown())

type Record[A <: Type] = Map[Str, A] type UnknownRecord = Record[Unknown]}

Page 20: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Mini-QuarkSet-Level Operations

sealed trait SetOps[F[_]] { def read(path: String): F[Unknown]}

Page 21: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Mini-QuarkDataset

sealed trait Dataset[A] { def apply[F[_]](implicit F: SetOps[F]): F[A]}object Dataset { def read(path: String): Dataset[Unknown] = new Dataset[Unknown] { def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path) }}

Page 22: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Mini-QuarkMapping

sealed trait SetOps[F[_]] { def read(path: String): F[Unknown]

def map[A, B](v: F[A], f: ???) // What goes here?}

Page 23: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Mini-QuarkMapping: Attempt #1

sealed trait SetOps[F[_]] { def read(path: String): F[Unknown]

def map[A, B](v: F[A], f: F[A] => F[B]) // Doesn't really work...}

Page 24: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Mini-QuarkMapping: Attempt #2

sealed trait MappingFunc[A, B] { def apply[F[_]](v: F[A])(implicit F: MappingOps[F]): F[B]}trait MappingOps[F[_]] { def str(v: String): F[Type.Str]

def project[K <: Type, V <: Type](v: F[Type.Map[K, V]], k: F[K]): F[V]

def add(l: F[Type.Int], r: F[Type.Int]): F[Type.Int]

def length[A <: Type](v: F[Type.Arr[A]]): F[Type.Int]

...}object MappingOps { def id[A]: MappingFunc[A, B] = new MappingFunc[A, A] { def apply[F[_]](v: F[A])(implicit F: MappingOps[F]): F[A] = v }}

Page 25: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Mini-QuarkMapping: Attempt #2

trait SetOps[F[_]] { def read(path: String): F[Unknown]

def map[A, B](v: F[A], f: MappingFunc[A, B]): F[B] // Yay!!!}

Page 26: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Mini-QuarkDataset: Mapping

sealed trait Dataset[A] { def apply[F[_]](implicit F: SetOps[F]): F[A]

def map[B](f: ???): Dataset[B] = ??? // What goes here???}object Dataset { def read(path: String): Dataset[Unknown] = new Dataset[Unknown] { def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path) }}

Page 27: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Mini-QuarkDataset: Mapping Attempt #1

sealed trait Dataset[A] { self => def apply[F[_]](implicit F: SetOps[F]): F[A]

def map[B](f: MappingFunc[A, B]): Dataset[B] = new Dataset[B] { def apply[F[_]](implicit F: SetOps[F]): F[B] = F.map(self.apply, f) }}object Dataset { def read(path: String): Dataset[Unknown] = new Dataset[Unknown] { def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path) }}

// dataset.map(_.length) // Cannot ever work!// dataset.map(v => v.profits[Dec] - v.losses[Dec]) // Cannot ever work!

Page 28: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Mini-QuarkDataset: Mapping Attempt #2

sealed trait Dataset[A] { def apply[F[_]](implicit F: SetOps[F]): F[A]

def map[B](f: MappingFunc[A, A] => MappingFunc[A, B]): Dataset[B] = new Dataset[B] { def apply[F[_]](implicit F: SetOps[F]): F[B] = F.map(self.apply, f(MappingFunc.id[A])) }}object Dataset { def read(path: String): Dataset[Unknown] = new Dataset[Unknown] { def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path) }}

// dataset.map(_.length) // Works with right methods on MappingFunc!// dataset.map(v => v.profits[Dec] - v.losses[Dec]) // Works with right methods on MappingFunc!

Page 29: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Mini-QuarkDataset: Mapping Binary Operators

val netProfit = dataset.map(v => v.netRevenue[Dec] - v.netCosts[Dec])

Page 30: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Mini-QuarkMappingFuncs Are Arrows!

trait MappingFunc[A <: Type, B <: Type] extends Dynamic { self => import MappingFunc.Case

def apply[F[_]: MappingOps](v: F[A]): F[B]

def >>> [C <: Type](that: MappingFunc[B, C]): MappingFunc[A, C] = new MappingFunc[A, C] { def apply[F[_]: MappingOps](v: F[A]): F[C] = that.apply[F](self.apply[F](v)) }

def + (that: MappingFunc[A, B])(implicit W: NumberLike[B]): MappingFunc[A, B] = new MappingFunc[A, B] { def apply[F[_]: MappingOps](v: F[A]): F[B] = MappingOps[F].add(self(v), that(v)) }

def - (that: MappingFunc[A, B])(implicit W: NumberLike[B]): MappingFunc[A, B] = new MappingFunc[A, B] { def apply[F[_]: MappingOps](v: F[A]): F[B] = MappingOps[F].subtract(self(v), that(v)) } ...}

Page 31: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Mini-QuarkApplicative Composition

MappingFunc[A, B] A -----------------------------B \ / \ / \ / \ / MappingFunc[A, B ⊕ C] \ /MappingFunc[A, C] \ / \ / C

Page 32: Quark: A Purely-Functional Scala DSL for Data Processing & Analytics

Learn More

— Finally Tagless: http://okmij.org/ftp/tagless-final/

— Quark: https://github.com/quasar-analytics/quark

— Quasar: https://github.com/quasar-analytics/quasar

THANK YOU

@jdegoes - http://degoes.net