Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal Data Scientist, SAS...
-
Upload
mlconf -
Category
Technology
-
view
327 -
download
0
Transcript of Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal Data Scientist, SAS...
Copyr i g ht © 2016, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
LOCAL SEARCH OPTIMIZATION FOR
HYPERPARAMETER TUNING
FUNDA GUNES, PATRICK KOCH
Copyr i g ht © 2016, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
LSO FOR HYPER-
PARAMETER TUNINGOVERVIEW
1. Training Predictive Models
• Optimization problem
• Parameters – optimization and model training
2. Tuning Predictive Models
• Manual, grid search, optimization
• Optimization challenges
3. Sample Results
4. Distributed / Parallel Processing
5. Opportunities and Challenges
Copyr i g ht © 2016, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
LSO FOR HYPER-
PARAMETER TUNINGPREDICTIVE MODELS
Transformation of relevant data to high-quality descriptive and predictive models
Ex.: Neural Network
• How to determine model
configuration?
• How to determine weights, activation
functions?…
.
….
….
Input layer Hidden layer Output layer
x1
x2
x3
x4
xn
wij
wjk
f1(x)
f2(x)
fm(x)
Copyr i g ht © 2016, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
LSO FOR HYPER-
PARAMETER TUNINGMODEL TRAINING – OPTIMIZATION PROBLEM
• Objective Function
• 𝑓 𝑤 =1
𝑛 𝑖=1𝑛 𝐿 𝑤; 𝑥𝑖 , 𝑦𝑖 + 𝜆1 𝑤 1 +
𝜆2
2𝑤 2
2
• Loss, 𝐿(𝑤; 𝑥𝑖 , 𝑦𝑖), for observation (𝑥𝑖 , 𝑦𝑖) and weights, 𝑤
• Stochastic Gradient Descent (SGD)
• Variation of Gradient Descent 𝑤𝑘+1 = 𝑤𝑘 − 𝜂𝛻𝑓(𝑤𝑘)
• Approximate 𝛻𝑓 𝑤𝑘 with gradient of mini-batch sample: {𝑥𝑘𝑖 , 𝑦𝑘𝑖 }
• 𝛻𝑓𝑡(𝑤) =1
𝑚 𝑖=1𝑚 𝛻𝐿(𝑤; 𝑥𝑘𝑖 , 𝑦𝑘𝑖)
• Choice of 𝜆1, 𝜆2, 𝜂, 𝑚 can have significant effects on solution quality
Copyr i g ht © 2016, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
LSO FOR HYPER-
PARAMETER TUNINGMODEL TRAINING – OPTIMIZATION PARAMETERS (SGD)
• Learning rate, η
• Too high, diverges
• Too low, slow performance
• Momentum, μ
• Too high, could “pass” solution
• Too low, no performance improvement
• Other parameters
• Mini-batch size, m
• Adaptive decay rate, β
• Annealing rate, r
• Communication frequency, c
• Regularization parameters, λ1 and λ2
• Too low, has little effect
• Too high, drives iterates to 0
Copyr i g ht © 2016, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
LSO FOR HYPER-
PARAMETER TUNINGHYPERPARAMETERS
• Quality of trained model governed by so-called ‘hyperparameters’
No clear defaults agreeable to a wide range of applications
• Optimization options (SGD)
• Neural Net training options
• Number of hidden layers
• Number of neurons in each hidden layer
• Random distribution for initial
connection weights (normal, uniform, Cauchy)
• Error function (gamma, normal, Poisson, entropy)
….
….
….
Input layer Hidden layer Output layer
x1
x2
x3
x4
xn
wij
wjk
f1(x)
f2(x)
fm(x)
Copyr i g ht © 2016, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
LSO FOR HYPER-
PARAMETER TUNINGMODEL TUNING
• Traditional Approach: manual tuning
Even with expertise in machine learning algorithms and their parameters, best settings
are directly dependent on the data used in training and scoring
• Hyperparameter Optimization: grid vs. random vs. optimization
x1
x2
Standard Grid Search
x1
x2
Random Latin Hypercube
x1
x2
Random Search
Copyr i g ht © 2016, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
LSO FOR HYPER-
PARAMETER TUNINGHYPERPARAMETER OPTIMIZATION CHALLENGES
x
Objective
blows up
Categorical / Integer
Variables
T(x)
Noisy/Nondeterministic
Flat-regions.
Node failure
Tuning objective, T(x), is validation error score (to avoid increased overfitting effect)
Copyr i g ht © 2016, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
LSO FOR HYPER-
PARAMETER TUNINGPARALLEL HYBRID DERIVATIVE-FREE OPTIMIZATION STRATEGY
Default hybrid search strategy:
1. Initial Search: Latin Hypercube Sampling (LHS)
2. Global search: Genetic Algorithm (GA)
• Supports integer, categorical variables
• Handles nonsmooth, discontinuous space
3. Local search: Generating Set Search (GSS)
• Similar to Pattern Search
• First-order convergence properties
• Developed for continuous variables
All naturally parallelizable methodsCan run any solver combination in parallel
Copyr i g ht © 2016, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
LSO FOR HYPER-
PARAMETER TUNINGSAMPLE TUNING RESULTS – AVERAGE IMPROVEMENT
• 13 test problems
• Single 30% validation
partition
• Single Machine
• Conservative default
tuning process:
• 5 Iterations
• 10 configurations per
iteration
• Run 10 times each
• Averaged results
ML
Average
Error %
Reduction
DT 5.2
GBT 5.0
RF 4.4
DT-P 3.0
Higher is better
No Free Lunch
Copyr i g ht © 2016, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
LSO FOR HYPER-
PARAMETER TUNINGSAMPLE TUNING RESULTS – TUNING TIME
• 13 test problems
• Single 30% validation
partition
• Single Machine
• Conservative default
tuning process:
• 5 Iterations
• 10 configurations per
iteration
• Run 10 times each
• Averaged results
MLAverage
% Error
Average
Time
DT 15.3 9.2
DT-P 13.9 12.8
RF 12.7 49.1
GBT 11.7 121.8
Copyr i g ht © 2016, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
LSO FOR HYPER-
PARAMETER TUNINGSINGLE PARTITION VS. CROSS VALIDATION
• For each problem:
• Tune with single 30% partition
• Score best model on test set
• Repeat 10 times
• Average difference between test
error and validation error
• Repeat process with 5-fold
cross validation
• Cross validation for small-
medium data sets
• 5x cost increase for sequential
tuning process
• manage in parallel / threaded
environment
Copyr i g ht © 2016, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
LSO FOR HYPER-
PARAMETER TUNINGPARALLEL MODE: DATA PARALLEL VS. MODEL PARALLEL
LHS
GA
GSS
D
NM
…
Node 1
Node 2
Node 3
Node k
..
.
x
f(x)
Train
Score
LHS
GA
GSS
D
NM
…
Train
Node 1
Score
x
f(x)
Train
Node 2
Score
Train
Node k
Score
..
.
Hybrid Solver Manager Hybrid Solver Manager
..
.
Data Distributed / Parallel Model Parallel
Copyr i g ht © 2016, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
LSO FOR HYPER-
PARAMETER TUNING“DATA PARALLEL” – DISTRIBUTED TRAIN, SEQUENTIAL TUNE
15
36 3440
65
112
140
226
0
50
100
150
200
250
1 2 4 8 16 32 64 128
Tim
e (
seconds)
Number of Nodes for Training
IRIS Forest Tuning Time105 Train / 45 Validate
0
100
200
300
400
500
600
1 2 4 8 16 32 64
Tim
e (
seconds)
Number of Nodes for Training
Credit Data Tuning Time49k Train / 21k Validate
Copyr i g ht © 2016, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
LSO FOR HYPER-
PARAMETER TUNING
“DATA AND MODEL PARALLEL” – DISTRIBUTED TRAIN,
PARALLEL TUNE
• Distributed train,
parallel tune:
• Train: 4 workers
• Tune: n parallel
0
2
4
6
8
10
12
2 4 8 16 32
Tim
e (
hours
)
Maximum Models Trained in Parallel
Handwritten Data Train 60k / Validate 10k
LHS
GA
GSS
D
NM
…
Train, Score
Node 1
Node 2
Node 3
Node 4
Train, Score
Node 5
Node 6
Node 7
Node 8
Hybrid Solver Manager
..
.
..
.
Copyr i g ht © 2016, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
LSO FOR HYPER-
PARAMETER TUNINGHYPERPARAMETER OPTIMIZATION RESULTS
• Default Neural Network, 9.6 % error
• Iteration 1 – Latin Hypercube best,
6.2% error
• Very similar configuration can have
very different error…
• Best after tuning: 2.6% error
0
20
40
60
80
100
0 50 100 150
Err
or
%
Evaluation
Iteration 1: Latin Hypercube Sample Error
0
2
4
6
8
10
12
0 5 10 15 20
Obje
ctive (
Mis
cla
ssific
ation E
rror
%)
Iteration
Handwritten Data Train 60k/Validate 10k
Tuned Neural Net
Hidden Layers 2
Neurons 200, 200
Learning Rate 0.012
Annealing Rate 2.3e-05
Copyr i g ht © 2016, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
LSO FOR HYPER-
PARAMETER TUNINGMANY TUNING OPPORTUNITIES AND CHALLENGES
• Initial Implementation
• PROCs NNET, TREESPLIT, FOREST, GRADBOOST
• SAS® Viya™ Data Mining and Machine Learning
• Other search methods, extending hybrid solver framework
• Bayesian / surrogate-based Optimization
• New hybrid search strategies
• Selecting best machine learning algorithm
• Parallel tuning across algorithms & strategies for effective node usage
• Combining models
Better (generalized) Models = Better Predictions = Better Decisions
Copyr i g ht © 2016, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
THANK YOU!
sas.com/viya
Copyr i g ht © 2016, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
Pa
ralle
l &
Se
ria
l, P
ub
/ S
ub
,
Web S
erv
ices, M
Qs
Source-basedEngines
Microservices
UAA
Query
Gen
Folders
CAS
Mgmt
Data
Source
Mgmt
Analytics
GUIs
etc.…
BI
GUIs
Env
Mgr
Model
Mgmt
Log
Audit
UAAUAA
Data
Mgmt
GUIs
In-Memory Engine
In-Cloud
In-Database
In-Hadoop
In-Stream
Solutions
APIs
Infrastructures
Platforms
SAS®
Viya™
PLATFORM
Analytics
Data ManagementFraud and Security Intelligence
Business VisualizationRisk Management
!
Customer Intelligence
Cloud Analytics Services (CAS)
DISTRIBUTED
COMPUTING
Copyr i g ht © 2016, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
EXPERIMENTAL
RESULTSEXPERIMENT SETTINGS
• Decision tree:• Depth
• Number of bins for interval variables
• Splitting criterion
• Random Forest:• Number of trees
• Number of variables at each split
• Fraction of observations used to build a tree
• Gradient Boosting: • Number of trees
• Number of variables at each split
• Fraction of observations used to build a tree
• L1 regularization
• L2 regularization
• Learning rate
• Neural Networks:• Number of hidden layers
• Number of neurons in each hidden layer
• LBFGS optimization parameters
• SGD optimization parameters
For this study:
• Linux cluster with 144 nodes
• Each node has 16 cores
• Viya Data Mining and Machine Learning
HYPERPARAMETERS