Implementation of linear regression and logistic regression on Spark
-
Upload
dalei-li -
Category
Data & Analytics
-
view
118 -
download
2
Transcript of Implementation of linear regression and logistic regression on Spark
![Page 1: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/1.jpg)
Parallel implementation of ML algorithms on Spark
Dalei Li EIT Digital
https://github.com/lidalei/LinearLogisticRegSpark
1
![Page 2: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/2.jpg)
Overview• Linear regression + l2 regularization
• Normal equation
• Logistic regression + l2 regularization
• Gradient descend
• Newton’s method
• Hyper-parameter optimization
• Experiments
2
![Page 3: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/3.jpg)
Tools
• IntelliJ + sbt
• Scala 2.11.8 + Spark 2.0.1
3
![Page 4: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/4.jpg)
Linear regression• Problem formulation
• Closed-form solution
• Computation reformulation
4
![Page 5: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/5.jpg)
Linear regression• Data set - UCI YearPredictionMSD, text file
• 515,345 songs, (90 audio numerical features, year)
• Core computation - norm terms and rmse
5
Implemented outer product + vector addition
![Page 6: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/6.jpg)
Workflow
6
Read file RegexTokenizer StandardScaler Solve normal equation
Spark SQL textAdd l2 regularization
LAPACK
Center data
Evaluation
rmse
![Page 7: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/7.jpg)
Validation
7
Spark ML linear regression with norm solver vs. my implementation (both with 0.1 l2 regularization)
Randomly split data set into train 70% + test 30%. The RMSEs on test set are also identical, less than 0.5% difference.
![Page 8: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/8.jpg)
Logistic regression• Problem formulation
• Gradient descent
• Newton’s method
• Computation reformulation - gradient and Hessian matrix
8
![Page 9: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/9.jpg)
Logistic regression• Data set - UCI HIGGS, csv file
• 11 million instances, (21+7 numerical features, binary label)
• Core computation - gradient and Hessian matrix
9
treeReduce can reduce the pressure of final ops in driver.
![Page 10: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/10.jpg)
Workflow
10
Read file VectorAssembler DF to RDDgradient descent/
newton’s method
Spark SQL csv Gradient - add l2 regularization
Scala case class Instance (features, label),
Newton’s - append all-one column
Evaluation
cross entropy confusion matrix
![Page 11: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/11.jpg)
Validation
11
Spark ML logistic regression with L-BFGS vs. my implementation of Newton’s method
Randomly split data set into train 70% + test 30%. The learned THETAs are almost identical, the last one is bias.
![Page 12: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/12.jpg)
• Grid search to find optimal hyper-parameters with best generalization error
• Estimate generalization error
• k-Fold cross validation
Hyper-parameter optimization
12
Hyper-parameter is a parameter used in a training process but not a part of a classifier itself. It controls what kind of parameters can / tend to be selected. For example, polynomial expansion will make non-linear relationship between a label and features be learned possibly.
![Page 13: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/13.jpg)
Grid search
• Grid - [polynomial expansion degree] x [l2 regularization]
• Polynomial expansion is memory killer
• Degree 3 on 7 features results in 119 features
• Be careful with exploiting parallelism
13
To increase temporal locality - accesses to a data frame are clustered in time.
Polynomial expansion does not include constant column.
![Page 14: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/14.jpg)
K-Fold
14
DF Persist, randomSplit map=> [([train_i], test)] map=>[(train, test)]
Spark SQL data frame
[([DF], DF)]
[(union[DF], DF)]
![Page 15: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/15.jpg)
15
k-Fold
PE
![Page 16: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/16.jpg)
Experiments
16
Spark 2.0.2 standalone mode
3 cores + 5GB mem exact copy of read-in file
http://spark.apache.org/docs/latest/cluster-overview.html
In total, we have 3 physical machines with 12GB mem + 8 cores.Driver - execute scala programWorker - execute tasksExecutor - each application runs a or more processes on a worker nodeJob - triggered by an actionTask - a unit of work executed on an executor, related with number of partitions >= number of blocks (128MB). If set manually, 2-4 partitions for each CPU in your cluster.Stage - a set of tasks
Local file - path + content on each worker node.
![Page 17: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/17.jpg)
Performance test• ML Settings
• Logistic regression on HIGGS
• Train-test split, 70% + 30%
• Only 7 high level features were used
• Test unit 1 - 100 times full gradient descent + training error on training set, initial learning rate 0.001, l2 regularization 0.1
• Test unit 2 - compute confusion matrix on test set and make predictions
17
![Page 18: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/18.jpg)
Performance and speedup curve
18
0
1.25
2.5
3.75
5
0
225
450
675
900
local 1 executor 2 executors 3 executors 4 executors 5 executors
training time (s) training-speed up
1
1.822
2.372
2.693
3.641
4.43
Running time vs. #executors (2 times average). Except for local, all tests have enough memory
Local mode does not have enough memory, causing data cannot be persist in memory. Thus, the running time is much higher.
Having more executors will reduce the running time linearly.
![Page 19: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/19.jpg)
Grid search• 10% of original data, i.e., 1.1 million instances, 7 high level features only
• Grid
• Polynomial degrees - 1, 2, 3
• l2 regularization - 0, 0.001, 0.01, 0.1, 0.5
• 3-Fold cross validation
• 100 times gradient descent with initial learning rate 0.01
• 2 executors with 10GB mem + 5 cores each
• Result - 4400s training time, final test accuracy 62.4%
19
Confusion matrix: truePositive: 117605, trueNegative: 88664, falsePositive: 66529, falseNegative: 57786
![Page 20: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/20.jpg)
Conclusion• Persist data - use more than once (incl. having branches)
• Change default cluster settings, e.g., executor memory per executor is 1GB
• Make use of Spark UI to find bottlenecks
• Using Spark builtin functions if possible
• Good examples for missing functions
• Don’t use accumulators in a transformation, except only need approximations
• Always start from small data to debug faster
• Future work - obey train-test split
20
![Page 21: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/21.jpg)
Q&A• Thank you!
• Useful links
• Master - spark://ip:7077, e.g., spark://b2.lxd:7077
• Cluster - http://ip:8080/
• Spark UI - http://ip:4040/
• https://spark.apache.org/docs/latest/programming-guide.html
• http://spark.apache.org/docs/latest/submitting-applications.html, package a jar - sbt package
21
![Page 22: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/22.jpg)
Backend slides
22
![Page 23: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/23.jpg)
Training time vs. # executors
23
0
0.25
0.5
0.75
1
0
225
450
675
900
local 1 executor 2 executors 3 executors 4 executors 5 executors
training time (s) test accuracy
![Page 24: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/24.jpg)
Spark UI
24
Jobs timeline
![Page 25: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/25.jpg)
Spark UI
25
Executor summary
![Page 26: Implementation of linear regression and logistic regression on Spark](https://reader031.fdocuments.in/reader031/viewer/2022022413/58ed2b651a28ab6c628b46f3/html5/thumbnails/26.jpg)
Numerical stability
26