Big Data Scala by the Bay: Interactive Spark in your Browser
-
Upload
gethue -
Category
Data & Analytics
-
view
3.889 -
download
1
Transcript of Big Data Scala by the Bay: Interactive Spark in your Browser
INTERACTIVE SPARK IN YOURBROWSER
Romain Rigaux [email protected] Tryzelaar [email protected]
GOALOF HUEWEB INTERFACE FOR ANALYZING DATA WITH APACHE HADOOP
SIMPLIFY AND INTEGRATE
FREE AND OPEN SOURCE
—> WEB “EXCEL” FOR HADOOP
VIEW FROM30K FEET
Hadoop Web Server
You, your colleagues and even that friend that uses IE9 ;)
WHY SPARK?
SIMPLER (PYTHON, STREAMING, INTERACTIVE…)
OPENS UP DATA TO SCIENCE
SPARK —> MR
Apache Spark
Spark Streaming
MLlib(machine learning)
GraphX(graph)
Spark SQL
WHYIN HUE? MARRIED WITH FULL HADOOP ECOSYSTEM (Hive Tables, HDFS, Job Browser…)
WHYIN HUE? Multi user, YARN, Impersonation/SecurityNot yet-another-app-to-install
...
• It works
HISTORYV1: OOZIE THE GOOD
• Submit through Oozie
• Slow
THE BAD
• It works better
HISTORYV2: SPARK IGNITER THE GOOD
• Compiler Jar
• Batch
THE BAD
• It works even better
• Scala / Python / R shells
• Jar / Py batches
• Notebook UI
• YARN
HISTORYV3: NOTEBOOK THE GOOD
• Still new
THE BAD
GENERALARCHITECTURE
Livy
Spark
Spark
Spark
YARN
Backend partWeb part
GENERALARCHITECTURE
Livy
Spark
Spark
Spark
YARN
Backend partWeb part
Notebook with snippets
WEBARCHITECTURE
Server
Spark
ScalaCommon API
Pig Hive
Livy … HS2
Scala
Hive
Specific APIs
AJAXcreate_session()execute()…
REST Thrift
OpenSession()ExecuteStatement()
/session/sessions/{sessionId}/statements
LIVY SPARK SERVER
• REST Web server in Scala
• Interactive Spark Sessions and Batch Jobs
• Type Introspection for Visualization
• Running sessions in YARN local
• Backends: Scala, Python, R
• Open Source: https://github.com/cloudera/hue/tree/master/apps/spark/java
LIVYSPARK SERVER
LIVY WEB SERVERARCHITECTUREYARN
MasterSpark Client
YARNNode
SparkInterpreter
SparkContext
YARNNode
SparkWorker
YARNNode
SparkWorker
Livy Server
Scalatra
Session Manager
Session
LIVY WEB SERVERARCHITECTURE
Livy Server
YARN Master
Scalatra
Spark Client
Session Manager
Session
YARNNode
SparkInterpreter
SparkContext
YARNNode
SparkWorker
YARNNode
SparkWorker
1
LIVY WEB SERVERARCHITECTUREYARN
MasterSpark Client
YARNNode
SparkInterpreter
SparkContext
YARNNode
SparkWorker
YARNNode
SparkWorker
1
2
Livy Server
Scalatra
Session Manager
Session
LIVY WEB SERVERARCHITECTUREYARN
MasterSpark Client
YARNNode
SparkInterpreter
SparkContext
YARNNode
SparkWorker
YARNNode
SparkWorker
1
2
3
Livy Server
Scalatra
Session Manager
Session
LIVY WEB SERVERARCHITECTUREYARN
MasterSpark Client
YARNNode
SparkInterpreter
SparkContext
YARNNode
SparkWorker
YARNNode
SparkWorker
1
2
3
4Livy Server
Scalatra
Session Manager
Session
LIVY WEB SERVERARCHITECTUREYARN
MasterSpark Client
YARNNode
SparkInterpreter
SparkContext
YARNNode
SparkWorker
YARNNode
SparkWorker
1
2
3
4
5
Livy Server
Scalatra
Session Manager
Session
LIVY WEB SERVERARCHITECTUREYARN
MasterSpark Client
YARNNode
SparkInterpreter
SparkContext
YARNNode
SparkWorker
YARNNode
SparkWorker
1
2
3
4
5
6Livy Server
Scalatra
Session Manager
Session
LIVY WEB SERVERARCHITECTUREYARN
MasterSpark Client
YARNNode
SparkInterpreter
SparkContext
YARNNode
SparkWorker
YARNNode
SparkWorker
1 7
2
3
4
5
6Livy Server
Scalatra
Session Manager
Session
SESSION CREATIONAND EXECUTION
% curl -XPOST localhost:8998/sessions \ -d '{"kind": "spark"}'{ "id": 0, "kind": "spark", "log": [...], "state": "idle"}
% curl -XPOST localhost:8998/sessions/0/statements -d '{"code": "1+1"}'{ "id": 0, "output": { "data": { "text/plain": "res0: Int = 2" }, "execution_count": 0, "status": "ok" }, "state": "available"}
LIVY INTERPRETERSScala, Python, R…
INTERPRETERS
• Pipe stdin/stdout to a running shell
• Execute the code / send to Spark workers
• Perform magic operations
• One interpreter by language
• “Swappable” with other kernels (python, spark..)
Interpreter
> println(1 + 1)2
println(1 + 1)
2
INTERPRETER FLOW
CURL
Hue
Livy Server Livy Session Interpreter
1+1
2
{ “data”: { “application/json”: “2” }}
1+1
2
INTERPRETER FLOW CHART
Receive lines Split lines
Send outputto server
Success
Incomplete Merge withnext lineError
Execute LineMagic!
Linesleft?
Magic line?
No
Yes
NoYes
LIVY INTERPRETERS
trait Interpreter { def state: State def execute(code: String): Future[JValue] def close(): Unit}
sealed trait State case class NotStarted() extends State case class Starting() extends Statecase class Idle() extends Statecase class Running() extends Statecase class Busy() extends Statecase class Error() extends Statecase class ShuttingDown() extends Statecase class Dead() extends State
LIVY INTERPRETERS
trait Interpreter { def state: State def execute(code: String): Future[JValue] def close(): Unit}
sealed trait Statecase class NotStarted() extends Statecase class Starting() extends Statecase class Idle() extends Statecase class Running() extends Statecase class Busy() extends Statecase class Error() extends Statecase class ShuttingDown() extends Statecase class Dead() extends State
SPARK INTERPRETER
class SparkInterpeter extends Interpreter { … private var _state: State = NotStarted() private val outputStream = new ByteArrayOutputStream() private var sparkIMain: SparkIMain = _ def start() = { ... _state = Starting() sparkIMain = new SparkIMain(new Settings(), new JPrintWriter(outputStream, true)) sparkIMain.initializeSynchronous() ...
Interpreter
new SparkIMain(new Settings(), new JPrintWriter(outputStream, true))
SPARK INTERPRETER
private var sparkContext: SparkContext = _def start() = { ... val sparkConf = new SparkConf(true) sparkContext = new SparkContext(sparkConf) sparkIMain.beQuietDuring { sparkIMain.bind("sc", "org.apache.spark.SparkContext", sparkContext, List("""@transient""")) } _state = Idle()}
sparkIMain.bind("sc", "org.apache.spark.SparkContext",sparkContext, List("""@transient"""))
EXECUTING SPARKprivate def executeLine(code: String): ExecuteResult = { code match { case MAGIC_REGEX(magic, rest) => executeMagic(magic, rest) case _ => scala.Console.withOut(outputStream) { sparkIMain.interpret(code) match { case Results.Success => ExecuteComplete(readStdout()) case Results.Incomplete => ExecuteIncomplete(readStdout()) case Results.Error => ExecuteError(readStdout()) } ...
case MAGIC_REGEX(magic, rest) =>
case _ =>
INTERPRETER MAGIC
private val MAGIC_REGEX = "^%(\\w+)\\W*(.*)".r
private def executeMagic(magic: String, rest: String): ExecuteResponse = { magic match { case "json" => executeJsonMagic(rest) case "table" => executeTableMagic(rest) case _ => ExecuteError(f"Unknown magic command $magic") }}
case "json" => executeJsonMagic(rest) case "table" => executeTableMagic(rest) case _ => ExecuteError(f"Unknown magic command $magic")
INTERPRETER MAGICprivate def executeJsonMagic(name: String): ExecuteResponse = { sparkIMain.valueOfTerm(name) match { case Some(value: RDD[_]) => ExecuteMagic(Extraction.decompose(Map( "application/json" -> value.asInstanceOf[RDD[_]].take(10))))
case Some(value) => ExecuteMagic(Extraction.decompose(Map( "application/json" -> value)))
case None => ExecuteError(f"Value $name does not exist") }}
case Some(value: RDD[_]) => ExecuteMagic(Extraction.decompose(Map( "application/json" -> value.asInstanceOf[RDD[_]].take(10))))
case Some(value) => ExecuteMagic(Extraction.decompose(Map( "application/json" -> value)))
TABLE MAGIC
"application/vnd.livy.table.v1+json": { "headers": [ { "name": "count", "type": "BIGINT_TYPE" }, { "name": "name", "type": "STRING_TYPE" } ], "data": [ [ 23407, "the" ], [ 19540, "I" ], [ 18358, "and" ], ... ]}
val lines = sc.textFile("shakespeare.txt");val counts = lines. flatMap(line => line.split(" ")). map(word => (word, 1)). reduceByKey(_ + _). sortBy(-_._2). map { case (w, c) => Map("word" -> w, "count" -> c) }%table counts%table counts
TABLE MAGIC
"application/vnd.livy.table.v1+json": { "headers": [ { "name": "count", "type": "BIGINT_TYPE" }, { "name": "name", "type": "STRING_TYPE" } ], "data": [ [ 23407, "the" ], [ 19540, "I" ], [ 18358, "and" ], ... ]}
val lines = sc.textFile("shakespeare.txt");val counts = lines. flatMap(line => line.split(" ")). map(word => (word, 1)). reduceByKey(_ + _). sortBy(-_._2). map { case (w, c) => Map("word" -> w, "count" -> c) }%table counts
JSON MAGIC
val lines = sc.textFile("shakespeare.txt");val counts = lines. flatMap(line => line.split(" ")). map(word => (word, 1)). reduceByKey(_ + _). sortBy(-_._2). map { case (w, c) => Map("word" -> w, "count" -> c) }%json counts
{ "id": 0, "output": { "application/json": [ { "count": 506610, "word": "" }, { "count": 23407, "word": "the" }, { "count": 19540, "word": "I" }, ... ] ...}%json counts
JSON MAGIC
val lines = sc.textFile("shakespeare.txt");val counts = lines. flatMap(line => line.split(" ")). map(word => (word, 1)). reduceByKey(_ + _). sortBy(-_._2). map { case (w, c) => Map("word" -> w, "count" -> c) }%json counts
{ "id": 0, "output": { "application/json": [ { "count": 506610, "word": "" }, { "count": 23407, "word": "the" }, { "count": 19540, "word": "I" }, ... ] ...}
• Stability and Scaling• Security• iPython/Jupyter backends
and file format
COMING SOON
DEMO TIME
@gethue
USER GROUP
hue-user@
WEBSITE
http://gethue.com
LEARN
http://learn.gethue.com
THANKS!