Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal Data Scientist, SAS...

Copyr i g ht © 2016, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

LOCAL SEARCH OPTIMIZATION FOR

HYPERPARAMETER TUNING

FUNDA GUNES, PATRICK KOCH


LSO FOR HYPER-

PARAMETER TUNINGOVERVIEW

1. Training Predictive Models

• Optimization problem

• Parameters – optimization and model training

2. Tuning Predictive Models

• Manual, grid search, optimization

• Optimization challenges

3. Sample Results

4. Distributed / Parallel Processing

5. Opportunities and Challenges


LSO FOR HYPER-

PARAMETER TUNINGPREDICTIVE MODELS

Transformation of relevant data to high-quality descriptive and predictive models

Ex.: Neural Network

• How to determine model

configuration?

• How to determine weights, activation

functions?…

.

….

….

Input layer Hidden layer Output layer

x1

x2

x3

x4

xn

wij

wjk

f1(x)

f2(x)

fm(x)


LSO FOR HYPER-

PARAMETER TUNINGMODEL TRAINING – OPTIMIZATION PROBLEM

• Objective Function

• 𝑓 𝑤 =1

𝑛 𝑖=1𝑛 𝐿 𝑤; 𝑥𝑖 , 𝑦𝑖 + 𝜆1 𝑤 1 +

𝜆2

2𝑤 2

2

• Loss, 𝐿(𝑤; 𝑥𝑖 , 𝑦𝑖), for observation (𝑥𝑖 , 𝑦𝑖) and weights, 𝑤

• Stochastic Gradient Descent (SGD)

• Variation of Gradient Descent 𝑤𝑘+1 = 𝑤𝑘 − 𝜂𝛻𝑓(𝑤𝑘)

• Approximate 𝛻𝑓 𝑤𝑘 with gradient of mini-batch sample: {𝑥𝑘𝑖 , 𝑦𝑘𝑖 }

• 𝛻𝑓𝑡(𝑤) =1

𝑚 𝑖=1𝑚 𝛻𝐿(𝑤; 𝑥𝑘𝑖 , 𝑦𝑘𝑖)

• Choice of 𝜆1, 𝜆2, 𝜂, 𝑚 can have significant effects on solution quality


LSO FOR HYPER-

PARAMETER TUNINGMODEL TRAINING – OPTIMIZATION PARAMETERS (SGD)

• Learning rate, η

• Too high, diverges

• Too low, slow performance

• Momentum, μ

• Too high, could “pass” solution

• Too low, no performance improvement

• Other parameters

• Mini-batch size, m

• Adaptive decay rate, β

• Annealing rate, r

• Communication frequency, c

• Regularization parameters, λ1 and λ2

• Too low, has little effect

• Too high, drives iterates to 0


LSO FOR HYPER-

PARAMETER TUNINGHYPERPARAMETERS

• Quality of trained model governed by so-called ‘hyperparameters’

No clear defaults agreeable to a wide range of applications

• Optimization options (SGD)

• Neural Net training options

• Number of hidden layers

• Number of neurons in each hidden layer

• Random distribution for initial

connection weights (normal, uniform, Cauchy)

• Error function (gamma, normal, Poisson, entropy)

….

….

….

Input layer Hidden layer Output layer

x1

x2

x3

x4

xn

wij

wjk

f1(x)

f2(x)

fm(x)


LSO FOR HYPER-

PARAMETER TUNINGMODEL TUNING

• Traditional Approach: manual tuning

Even with expertise in machine learning algorithms and their parameters, best settings

are directly dependent on the data used in training and scoring

• Hyperparameter Optimization: grid vs. random vs. optimization

x1

x2

Standard Grid Search

x1

x2

Random Latin Hypercube

x1

x2

Random Search


LSO FOR HYPER-

PARAMETER TUNINGHYPERPARAMETER OPTIMIZATION CHALLENGES

x

Objective

blows up

Categorical / Integer

Variables

T(x)

Noisy/Nondeterministic

Flat-regions.

Node failure

Tuning objective, T(x), is validation error score (to avoid increased overfitting effect)


LSO FOR HYPER-

PARAMETER TUNINGPARALLEL HYBRID DERIVATIVE-FREE OPTIMIZATION STRATEGY

Default hybrid search strategy:

1. Initial Search: Latin Hypercube Sampling (LHS)

2. Global search: Genetic Algorithm (GA)

• Supports integer, categorical variables

• Handles nonsmooth, discontinuous space

3. Local search: Generating Set Search (GSS)

• Similar to Pattern Search

• First-order convergence properties

• Developed for continuous variables

All naturally parallelizable methodsCan run any solver combination in parallel


LSO FOR HYPER-

PARAMETER TUNINGSAMPLE TUNING RESULTS – AVERAGE IMPROVEMENT

• 13 test problems

• Single 30% validation

partition

• Single Machine

• Conservative default

tuning process:

• 5 Iterations

• 10 configurations per

iteration

• Run 10 times each

• Averaged results

ML

Average

Error %

Reduction

DT 5.2

GBT 5.0

RF 4.4

DT-P 3.0

Higher is better

No Free Lunch


LSO FOR HYPER-

PARAMETER TUNINGSAMPLE TUNING RESULTS – TUNING TIME

• 13 test problems

• Single 30% validation

partition

• Single Machine

• Conservative default

tuning process:

• 5 Iterations

• 10 configurations per

iteration

• Run 10 times each

• Averaged results

MLAverage

% Error

Average

Time

DT 15.3 9.2

DT-P 13.9 12.8

RF 12.7 49.1

GBT 11.7 121.8


LSO FOR HYPER-

PARAMETER TUNINGSINGLE PARTITION VS. CROSS VALIDATION

• For each problem:

• Tune with single 30% partition

• Score best model on test set

• Repeat 10 times

• Average difference between test

error and validation error

• Repeat process with 5-fold

cross validation

• Cross validation for small-

medium data sets

• 5x cost increase for sequential

tuning process

• manage in parallel / threaded

environment


LSO FOR HYPER-

PARAMETER TUNINGPARALLEL MODE: DATA PARALLEL VS. MODEL PARALLEL

LHS

GA

GSS

D

NM

…

Node 1

Node 2

Node 3

Node k

..

.

x

f(x)

Train

Score

LHS

GA

GSS

D

NM

…

Train

Node 1

Score

x

f(x)

Train

Node 2

Score

Train

Node k

Score

..

.

Hybrid Solver Manager Hybrid Solver Manager

..

.

Data Distributed / Parallel Model Parallel


LSO FOR HYPER-

PARAMETER TUNING“DATA PARALLEL” – DISTRIBUTED TRAIN, SEQUENTIAL TUNE

15

36 3440

65

112

140

226

0

50

100

150

200

250

1 2 4 8 16 32 64 128

Tim

e (

seconds)

Number of Nodes for Training

IRIS Forest Tuning Time105 Train / 45 Validate

0

100

200

300

400

500

600

1 2 4 8 16 32 64

Tim

e (

seconds)

Number of Nodes for Training

Credit Data Tuning Time49k Train / 21k Validate


LSO FOR HYPER-

PARAMETER TUNING

“DATA AND MODEL PARALLEL” – DISTRIBUTED TRAIN,

PARALLEL TUNE

• Distributed train,

parallel tune:

• Train: 4 workers

• Tune: n parallel

0

2

4

6

8

10

12

2 4 8 16 32

Tim

e (

hours

)

Maximum Models Trained in Parallel

Handwritten Data Train 60k / Validate 10k

LHS

GA

GSS

D

NM

…

Train, Score

Node 1

Node 2

Node 3

Node 4

Train, Score

Node 5

Node 6

Node 7

Node 8

Hybrid Solver Manager

..

.

..

.


LSO FOR HYPER-

PARAMETER TUNINGHYPERPARAMETER OPTIMIZATION RESULTS

• Default Neural Network, 9.6 % error

• Iteration 1 – Latin Hypercube best,

6.2% error

• Very similar configuration can have

very different error…

• Best after tuning: 2.6% error

0

20

40

60

80

100

0 50 100 150

Err

or

%

Evaluation

Iteration 1: Latin Hypercube Sample Error

0

2

4

6

8

10

12

0 5 10 15 20

Obje

ctive (

Mis

cla

ssific

ation E

rror

%)

Iteration

Handwritten Data Train 60k/Validate 10k

Tuned Neural Net

Hidden Layers 2

Neurons 200, 200

Learning Rate 0.012

Annealing Rate 2.3e-05


LSO FOR HYPER-

PARAMETER TUNINGMANY TUNING OPPORTUNITIES AND CHALLENGES

• Initial Implementation

• PROCs NNET, TREESPLIT, FOREST, GRADBOOST

• SAS® Viya™ Data Mining and Machine Learning

• Other search methods, extending hybrid solver framework

• Bayesian / surrogate-based Optimization

• New hybrid search strategies

• Selecting best machine learning algorithm

• Parallel tuning across algorithms & strategies for effective node usage

• Combining models

Better (generalized) Models = Better Predictions = Better Decisions


THANK YOU!

[email protected]

[email protected]

sas.com/viya


Pa

ralle

l &

Se

ria

l, P

ub

/ S

ub

,

Web S

erv

ices, M

Qs

Source-basedEngines

Microservices

UAA

Query

Gen

Folders

CAS

Mgmt

Data

Source

Mgmt

Analytics

GUIs

etc.…

BI

GUIs

Env

Mgr

Model

Mgmt

Log

Audit

UAAUAA

Data

Mgmt

GUIs

In-Memory Engine

In-Cloud

In-Database

In-Hadoop

In-Stream

Solutions

APIs

Infrastructures

Platforms

SAS®

Viya™

PLATFORM

Analytics

Data ManagementFraud and Security Intelligence

Business VisualizationRisk Management

!

Customer Intelligence

Cloud Analytics Services (CAS)

DISTRIBUTED

COMPUTING


EXPERIMENTAL

RESULTSEXPERIMENT SETTINGS

• Decision tree:• Depth

• Number of bins for interval variables

• Splitting criterion

• Random Forest:• Number of trees

• Number of variables at each split

• Fraction of observations used to build a tree

• Gradient Boosting: • Number of trees

• Number of variables at each split

• Fraction of observations used to build a tree

• L1 regularization

• L2 regularization

• Learning rate

• Neural Networks:• Number of hidden layers

• Number of neurons in each hidden layer

• LBFGS optimization parameters

• SGD optimization parameters

For this study:

• Linux cluster with 144 nodes

• Each node has 16 cores

• Viya Data Mining and Machine Learning

HYPERPARAMETERS

Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal Data Scientist, SAS...

Technology

Transcript of Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal Data Scientist, SAS...