Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A...
Transcript of Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A...
![Page 1: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/1.jpg)
Elastic Spark Programming Framework (ESPF)
A Dependency-Injection Based Programming Framework for Spark Applications
Bruce Kuo, Software Engineer, APAC Data, email: [email protected]
1
![Page 2: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/2.jpg)
Outline
■ Motivation & Related Work■ Prerequisite■ Programming Framework■ Integration with Components■ Conclusion■ Q&A
2
![Page 3: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/3.jpg)
Motivation & Related Work
3
![Page 4: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/4.jpg)
Native Spark Applicationpublic class GainsChartDataGeneration {
public static void main(String[] args) {
String sortedPredictionResultTable = args[0];
String gainTable = args[1];
SparkConf conf = new SparkConf();
JavaSparkContext sc = new JavaSparkContext(conf);
HiveContext sqlContext = new HiveContext(sc.sc());
DataFrame dataFrame =
sqlContext.table(sortedPredictionResultTable)
.select("target", "score");
// Generate schema
...
StructType schema = DataTypes.createStructType(newFields);
long totalCount = dataFrame.count();
List<Row> seqList = new ArrayList<>();
for (long i = 100; i <= totalCount; ++i) {
long curCount =
dataFrame
.limit((int) i)
.filter("target=1")
.count();
seqList.add(RowFactory.create(i, curCount));
}
JavaRDD<Row> resultRDD = sc.parallelize(seqList);
sqlContext
.createDataFrame(resultRDD, schema)
.write()
.mode(SaveMode.Overwrite)
.saveAsTable(gainTable);
}
}
Initialization
Main logic
Output
4
![Page 5: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/5.jpg)
Native Spark Application (Cont.)
5
Handle arguments setting in every application
String sortedPredictionResultTable = args[0];
String gainTable = args[1];
![Page 6: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/6.jpg)
Native Spark Application (Cont.)
6
Initialize Spark environment
SparkConf conf = new SparkConf();
JavaSparkContext sc = new JavaSparkContext(conf);
HiveContext sqlContext = new HiveContext(sc.sc());
![Page 7: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/7.jpg)
Submit Spark Application
With spark-submit shell command■ Need to specify application settings every time■ It is kind of wordy■ Not intuitive enough to know the function of an argument
spark-submit --master yarn-client --driver-memory 12G --class com.yahoo.ecdata.generation.GainsChartDataGeneration --num-executors 300 --executor-memory 12G --conf spark.executor.cores=4--conf spark.ui.view.acls=* --conf spark.kryoserializer.buffer.max.mb=1024 --conf spark.akka.frameSize=1024 --queue adhoc experiment.jar mining_predict_result stats_gains_table
Environment configuration settings in every submission
7
![Page 8: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/8.jpg)
If an Application is a Module...
■ Change application arguments in command line or script● e.g., change the data source and output path for ETL applications or model
arguments for machine learning applications
8
Experment 1 - weight = 0.1, min_val = 1
spark-submit --master yarn-client ......experiment.jar 0.1 1 path_1
Experment 2 - weight = 0.3, min_val=2
spark-submit --master yarn-client ......experiment.jar 0.3 2 path_2
1. Need to read document to know which argument is for weight or min_val
2. Change the value in script is not intuitive
![Page 9: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/9.jpg)
If an Application is a Module… (Cont.)
9
■ When the number of applications is large, there are many triggering scripts in the system
● It may cause huge maintenance effort if developers want to change a configuration
spark-submit --master yarn-client mesos...--class Xexperiment.jar mining_predict_result stats_gains_table
spark-submit --master yarn-client mesos ...--class Yexperiment.jar A B C
Need to change many scripts one by one
![Page 10: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/10.jpg)
Oozie Spark
<arg>16g</arg>
<arg>--driver-memory</arg>
<arg>16g</arg>
<arg>--queue</arg>
<arg>adhoc</arg>
<arg>experiment.jar</arg>
<arg>mining_predict_result</arg>
<arg>stats_gains_table</arg>
<capture-output/>
</java>
...
</action>
<kill name="fail">...</kill>
<end name='end' />
</workflow-app>
<workflow-app xmlns="uri:oozie:workflow:0.2" name="spark_oozie_wf">
<start to="spark-node"/>
<action name="spark-node">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>...</prepare>
<configuration>...</configuration>
<main-class>org.apache.spark.deploy.SparkSubmit</main-class>
<arg>--master</arg>
<arg>yarn-client</arg>
<arg>com.yahoo.ecdata.generation.GainsChartDataGeneration</arg>
<arg>--properties-file</arg>
<arg>spark-defaults.conf</arg>
<arg>--num-executors</arg>
<arg>300</arg>
<arg>--executor-memory</arg>
10
![Page 11: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/11.jpg)
Airflow: Bash Operator + Jinja Template
templated_command = (
"spark-submit --master yarn-client --queue adhoc --num-executors 300 \
--driver-memory 16g --class {{ params.main_class }} {{ params.jar_file }} \
{{ params.args }}"
)
def i_am_a_function(param1, param2, **kwargs):
print(kwargs.get('execution_date')) ##airflow macro
with DAG('dag_name', default_args=default_args, schedule_interval='0 5 * * *')
as dag:
(
PythonOperator(
task_id = 'task1',
python_callable = i_am_a_function,
op_args = [param1, param2],
provide_context = True)
<< [BashOperator(
task_id = 'task2',
bash_command = templated_command,
params = {
'jar': 'experiment.jar',
'main_class':
'com.yahoo.ecdata.generation.GainsChartDataGeneration',
'args': 'mining_predict_result stats_gains_table'
}
)
]
)
parameterizedarguments
11
![Page 12: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/12.jpg)
Airflow: Bash Operator + Jinja Template (Cont.)
12
Template command
templated_command = ( "spark-submit --master yarn-client --queue adhoc --num-executors 300 \ --driver-memory 16g --class {{ params.main_class }} {{ params.jar_file }} \ {{ params.args }}")
![Page 13: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/13.jpg)
Airflow: Bash Operator + Jinja Template (Cont.)
13
Parameterized arguments
params = { 'jar': 'experiment.jar', 'main_class': 'com.yahoo.ecdata.generation.GainsChartDataGeneration', 'args': 'mining_predict_result stats_gains_table'}
![Page 14: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/14.jpg)
Possible Problems in Oozie & Airflow
14
■ Need to change default environment settings
■ Hard to know how many arguments and the meaning of these arguments
■ Is it possible to generate application configuration automatically for different purpose?
![Page 15: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/15.jpg)
Prerequisite
15
![Page 16: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/16.jpg)
Dependency Injection
■ In short words, objects are configured by an external entity● e.g., Dao dao = new HiveDao(...); // HiveDao extends Dao
■ Benefits● Reduced dependencies, e.g., any dao implementation can use in a data-access code● More testable code, e.g., Dao dao = new TestDao(...);● More reusable code● More readable code
■ Dependency injection frameworks● Spring● Google Guice
16
![Page 17: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/17.jpg)
Java Annotation
■ Add metadata to a variable, a method or a class.
■ Using reflection can help program know the attributes of the fields listed above and provide basic control at runtime.
@Column(length = 32) // Truncate column value to 32 characters.
private String name;
17
![Page 18: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/18.jpg)
Programming Framework
18
![Page 19: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/19.jpg)
Pseudo Solution
public class SparkApp {
// All class variables are injected automatically
String input; // args[0]
String output; // args[1]
Double weight; // args[2]
public int execute() {
// Using arguments and Spark context directly
sparkContext.load(input);
// Computing logics
...
output.saveAsText(output);
...
}
}19
![Page 20: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/20.jpg)
How to Make a Class Knows Parameters?
■ Annotation + Runtime Injection!
@Input(name="input")
String input;
@Output(name="output")
String output;
@ModelParam(name="weight", type="double")
Double weight;
20
![Page 21: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/21.jpg)
Application Initialization
■ Spark Environment Initialization
● SparkConf● SparkContext● SQLContext or HiveContext if
application needs
■ Variable Initialization
● Inject variables with the corresponding argument
● Handle type casting
21
![Page 22: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/22.jpg)
Base Class
■ Initialize Spark environment
■ Inject variables with corresponding arguments
■ Run Spark code section
22
![Page 23: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/23.jpg)
Definition of Base Class - SparkApplication
public abstract class SparkApplication {
....
public void initialize() { // initialize spark related configuration }
protected abstract int execute() throws Exception; // put your code here
// use annotation to help user setting configuration protected void setArguments(String[] args) throws Exception { ... }
public static final void main(String[] args) throws Exception { // make main function final to avoid override initialize();
setArgs(args);
execute();
....
}
}
23
![Page 24: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/24.jpg)
Programming Framework
SparkApplication
Application 1 Application 2 Application k
ExtendExtend Extend
24
![Page 25: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/25.jpg)
Supported Annotations
■ @Input
● name● type: table or file path
■ @Output
● name● type: table or file path
■ @TableParam
● table name● column● datatype: use for type casting
■ @ModelParam● name● required
25
![Page 26: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/26.jpg)
Example: Gains Chart Data Generation
■ Definition of gains chart● The gains chart plots the values in the Gains(%) column from the table. ● Gains are defined as the proportion of hits in each increment relative to the total
number of hits in the tree, using the equation.
26
![Page 27: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/27.jpg)
Example: Gains Chart Data Generation (Cont.)
Target Score
1 880
1 724
1 676
1 556
0 480
0 368
Sorted Predicting Results Gains Table
Count Target Count
100 36
200 54
300 66
400 76
500 85
600 90
27
![Page 28: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/28.jpg)
Example: Gains Chart Data Generation (Cont.)
28
![Page 29: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/29.jpg)
Example: Gains Chart Data Generation (Cont.)
public class GainsChartDataGeneration extends SparkApplication {
@Input(name="srcTable", type="hive")
String sortedPredictionResultTable;
@Output(name="destTable", type="hive")
String gainTable;
@Override
protected void execute() throws Exception {
DataFrame dataFrame =
sqlContext.table(sortedPredictionResultTable)
.select("target”, score”);
// Generate schema
...
StructType schema = DataTypes.createStructType(newFields);
long totalCount = dataFrame.count();
List<Row> seqList = new ArrayList<>();
for (long i = 100; i <= totalCount; ++i) {
long curCount = dataFrame.limit((int) i)
.filter("target=1").count();
seqList.add(RowFactory.create(i, curCount));
}
JavaRDD<Row> resultRDD = sparkContext.parallelize(seqList);
DataFrame resultDf = sqlContext
.createDataFrame(resultRDD, schema)
writeOutput(resultDf, gainTable, SaveMode.Overwrite);
}
}
29
![Page 30: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/30.jpg)
Example: Gains Chart Data Generation (Cont.)
30
Initialize variables with annotation. The framework set variables.
@Input(name="srcTable", type="hive")
String sortedPredictionResultTable;
@Output(name="destTable", type="hive")
String gainTable;
![Page 31: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/31.jpg)
Example: Gains Chart Data Generation (Cont.)
31
Only override execute method for (long i = 100; i <= totalCount; ++i) {
long curCount = dataFrame.limit((int) i)
.filter("target=1").count();
seqList.add(RowFactory.create(i, curCount));
}
JavaRDD<Row> resultRDD =
sparkContext.parallelize(seqList);
DataFrame resultDf = sqlContext
.createDataFrame(resultRDD, schema)
writeOutput(resultDf, gainTable, SaveMode.Overwrite);
}
@Override
protected void execute() throws Exception {
DataFrame dataFrame =
sqlContext.table(sortedPredictionResultTable)
.select("target”, score”);
// Generate schema
...
StructType schema = DataTypes.createStructType(newFields);
long totalCount = dataFrame.count();
List<Row> seqList = new ArrayList<>();
![Page 32: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/32.jpg)
Framework Sugar
■ Improve code readability
■ Semantic programming
■ Ease effort of unrelated logics
32
![Page 33: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/33.jpg)
Integration with Components
33
![Page 34: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/34.jpg)
Power of Annotations
■ Scan / inject class fields at runtime
■ Help other programs easily get arguments
34
![Page 35: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/35.jpg)
Class - SparkAnnotation
35
■ Save the data and value of the annotation● type of annotation (Input, Output…) and its metadata● the value of this field
■ Serializable
![Page 36: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/36.jpg)
Interface - SparkAnnotationGetter
■ Return a map contains all spark annotations
public static Map<String, SparkAnnotation>
getSparkAnnotations(SparkApplication sparkApplication)
36
![Page 37: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/37.jpg)
Interface - SparkAnnotationSetter
■ Set a variable with corresponding arguments
■ Handle type casting
■ Used in SparkApplication class
public static void setSparkAnnotations
(Field field, SparkAnnotation sparkAnnotation)
37
![Page 38: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/38.jpg)
Class - SparkAppMetadata
■ Store application settings● Application name● Spark environment setting● Spark application class
■ Store a map of SparkAnnotation from a SparkApplication
■ Serializable
38
![Page 39: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/39.jpg)
Configuration Auto-Generator
■ Using SparkApplicationGetter to set SparkAppMetadata with JSON serializer can generate application configuration{
"name" : "User Prediction Results",
"input" : [ {
"name" : "srcTable",
"value" : "sortedPredictionResults",
"type" : "hive",
"fields" : {
"target": "ta",
"Score": "sc"
}
} ],
"output" : [ {
"name" : "destTable",
"value" : "gainChartData",
"type" : "hive",
"fields" : {
"count": "count",
}
} ],
"sparkConfig" : { },
"modelArgs" : {},
"mainClass" : "com.yahoo.ecdata.ml.GainsChartDataGeneration"
}39
![Page 40: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/40.jpg)
New Spark Application Submitter
■ Submit job from configuration● Command line submitter● Programmatic submitter
■ Translate configuration for annotation-based fields to input arguments
■ Control resource if needs (or use default setting)
40
![Page 41: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/41.jpg)
Example: Command Line Submitter
41
SparkSubmitter run -class com.yahoo.ecdata.generation.GainsChartDataGeneration
Missing required options: input, output
Configuration format for com.yahoo.ecdata.generation.GainsChartDataGeneration: -class <class> Spark Job class full name -conf <arg> JSON file for setting job args and Spark configs -sparkConf <key=value> Spark Configurations -output <type=_type,value=_value> [Output, filepath or hive path] -input <type=_type,value=_value> [Input, filepath or hive path]
![Page 42: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/42.jpg)
Example: Command Line Submitter (Cont.)
42
SparkSubmitter run -class com.yahoo.ecdata.generation.GainsChartDataGeneration -conf test_setting.json
The config is invalid for this class
Configuration format:
{
"name" : "User Prediction Results",
"input" : [
{ "name" : "input", "value" : "...","type" : "..." }
],
"output" : [
{ "name" : "output", "value" : "...", "type" : "..." } ],
"sparkConfig" : { },
"modelArgs" : {},
"mainClass" : "com.yahoo.ecdata.ml.GainsChartDataGeneration"
}
![Page 43: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/43.jpg)
Integration Framework Diagram
ConfigurationGenerator
Application 1
ConfigurationGenerate
SparkSubmitterRead
Submit
Application 2 Application k
ScanScan
Scan
SparkApplication
Inject settings Inject
settings Inject settings
43
![Page 44: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/44.jpg)
Example: Gains Chart Drawing Flow
DataGainsChartData
GenerationGainsChart
Drawing
44
![Page 45: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/45.jpg)
ExampleSpark Web Configurator
■ Users can submit applications from
web UI
45
![Page 46: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/46.jpg)
Conclusion
46
![Page 47: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/47.jpg)
Comparison of Frameworks
ESPF Oozie Airflow
Configuration Generator
v x x
Parameter Understanding
v x x
Flow Control x v v
Scheduling x v v
Maintenance Easy Native Native
47
![Page 48: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/48.jpg)
Future Work
■ Provide flow control between applications
■ Control resource automatically● Collect application statistics to predict the resources
■ Open source to community (work in progress)
48
![Page 49: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/49.jpg)
Conclusion
■ Simple and flexible framework for JVM-based languages● e.g, Java, Scala● Currently not support pySpark
■ Ease the maintenance effort in large system ● Easy for testing with changing configuration● Code is the documentation● Focus on business logic over configuration
49
![Page 50: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/50.jpg)
Acknowledgement
■ Jason Lin, Lucas Yang, Sas Chen, Norman Huang, Evans Ye
■ Yahoo APAC Data Team
50
![Page 51: Elastic Spark Programming Framework (ESPF) · Elastic Spark Programming Framework (ESPF) A Dependency-Injection Based Programming Framework for Spark Applications Bruce Kuo, Software](https://reader030.fdocuments.in/reader030/viewer/2022040116/5ec98d77b7511a59e711a0e0/html5/thumbnails/51.jpg)
Q&A
51