Accelerating the Random Forest algorithm for commodity parallel- Mark Seligman

Accelerating the Random Forest algorithm for commodity parallelhardware

Mark Seligman

Suiji

August 5, 2015

Mark Seligman (Suiji) Accelerating the Random Forest algorithm for commodity parallel hardware August 5, 2015 1 / 44

1 Outline

2 Introduction

3 Random Forests

4 Implementation

5 Examples and anecdotes: R

6 Ongoing work

7 Summary and future work


Outline

Introduction

Random Forests

Implementation

Examples and anecdotes

Ongoing work

Summary and future work

Q & A


1 Outline

2 Introduction

3 Random Forests

4 Implementation


6 Ongoing work



Arborist project

Began as proprietary implementation of Random Forest (TM) algorithm.

Aim was enhanced performance across a wide variety of hardware, data and workflows.

GPU acceleration a key concern.

Open-sourced and rewritten following dissolution of venture.

Arborist is the project name.

Pyborist is the Python implementation, under development.

Rborist is the R package.


Project design goals

Language-agnostic, compiled core.

Minimal reliance on call-backs and external libraries.

Minimize data movement.

Ready extensibility.

Common source base for all spins.


1 Outline

2 Introduction

3 Random Forests

4 Implementation


6 Ongoing work



Binary decision trees, briefly

Prediction method presenting a series of true/false questions about the data.

Answer to given question determines which (of two) questions to pose next.

Successive T/F branching relationship justifies “tree” nomenclature.

Different data take different paths through the tree.

Terminal (or “leaf”) node in path reports score for that path (and data).

Can build single tree and refine: “boosting”.

Can build “forest” of (typically) 100− 1000 trees.I Overall average (regression) or plurality (classification) derived from each tree’s score.


Random Forests

Random Forest is trademarked, registered to Leo Breiman (dec.) and Adele Cutler.

Predicts or validates vector of data (“response”)I Numerical: “regression”.I Categorical: “classification”.

Trains on design matrix of observations: “predictor” columns.I Columns individually either numerical or categorical (“factors”).

Trees trained on randomly-selected (“bagged”) set of matrix rows.

Predictors sampled randomly throughout training - separately chosen for each node.

Validation on held-out subset: different for each tree.

Independent prediction on separately-provided test sets.


Training as tree building

Begins with a root node, together with the bagged set.I Bagging: can view as indicator set of row indices, with multiplicities.

Subnode construction (“splitting”) is driven by information content.I Nodes with sufficient information branch into two new subnodes.I Branch is annotated with splitting criterion, determining its sense.I If no splitting, the node is terminal: leaf.

Tree construction can proceed depth-first, breadth-first, ...

Construction terminates when frontier nodes exhaust information content.

User may also constrain termination: node count, tree depth, node width, ...


Building trees: splitting as conditioning

Splitting has the consequence of partitioning the training data into progressively smallersubsets.

Operationally, the splitting criterion conditions the data into complementary subspaces.I The left successor inherits the subspace satisfying the criterion.I The right successor inherits its complement.

From this perspective, the root node trains on all bagged observations.

Successor nodes, similarly, train on data conditioned by the parent.

As we’ll see, the conditioned subspaces can be characterized as row sections of the design.

In other words, the splitting criteria define successive bipartitions on row indices.

From this perspective, then, the algorithm would seem to terminate naturally.


Splitting: predictor perspective

Splitting criteria are formulated as order or subset relations with respect to somepredictor:

I E.g., numerical predictor: p <= 3.2 ? branch left : branch right.I Factor predictor: q ∈ {3, 8, 17} ? branch left : branch right.

At a given node, candidate criteria obtained over randomly-sampled set of predictors.

Each predictor evaluates a series of (L/R) trial subsets:I For numerical predictors, trials are distinct cuts in the linear order.I For factors, trials are partitions over the runs of identical predictor values.I Criterion derived from trial maximizing “impurity” (separation) on response.

The predictor/criterion pair best “separating” the response is chosen for splitting.


Predictor ordering ⇐⇒ row index permutation

Trial score is a function only of response - evaluated according to predictor order.

Irrespective of predictor, a given node is scored over a unique set of response indices.I The role of the predictor is to dictate the order to walk the indices.I That is, predictor values play no role in scoring the trials.I Hence only predictor ranks (and runs), affect scoring.

Each trial, in particular the “winner”, determines a bipartition of predictor ranks.

The predictor ranks, in turn, define a bipartition of row indices:I One set of indices characterizes the left branch.I Its complement characterizes the right branch.

Throughout training, then, the frontier nodes train over a (highly-disconnected) partitionof the original bagged data, as row sections.


Trial generation: 4 cases, divergent work loads

PredictorResponse Numerical Factor

Regression Index walk Index walk 7→ run sets(weighted variance) Run set sort

Run set walk: O(# runs)

Classification? Index walk Index walk 7→ run sets(Gini gain) Run set walk: O(2# runs)

Index walks are linear in node width, but differ in state maintained.

Power set walks resort to sampling above ∼ 10 runs.?Binary classification walks runs linearly.


Aside: performance is data-dependent

As with linear algebra, the appropriate treatment depends on the contents of the (design)matrix: SVD, for example.

E.g., regression has regular access patterns and tends to run very quickly.

Constraints in response or predictor values - may benefit from numerical simplification.

Will ties play a significant role? - sparse data can train very quickly.

Custom implementations rely heavily on the answer to such questions.

It therefore makes sense to strive for extensibility and ease of customization.


Data locality

Computers store data in hierarchy:I Registers.I Caches (L1 - L3).I RAM.I Disk.

CPU operates on registers.

Loading registers consumes many clock cycles, depending upon position in hierarchy.

Performance therefore best when data is spatially (hence temporally) local.

Similarly, loops over vectors most efficient when data in consecutive iterations separatedby predictable and short(ish) strides.

“Regular” access patterns allow compiler, and hence hardware, to do a good job.

Regularity is crucial for GPUs, which excel at performing identical operations oncontiguous data.


Algorithm: observations

Splitting is “embarrassingly parallel”: trials can be evaluated on all nodes in the frontier,and all candidate predictors, at the same time.

However, ranks corresponding to a node’s row indices vary with predictor.

Naive solution is to sort observations at each splitting step.I Approach used by early implementations.I Does not scale well.

Predictor ordering used repeatedly: suggests pre-ordering.


Algorithm: cont.

With pre-ordering, index walk accumulates per-node state by row index lookup.I Original Arborist approach.I No data locality, as index lookup is irregular.I Large state budget: must be swapped as indexed node changes.

“Restaging”: maintain separately-sorted state vectors for each predictor, by node.I Begin with pre-sorted list, update via (stable) bipartition at each node.I Current Arborist approach. Data locality improves with tree depth.I Only modest amount of state to move: 16 bytes to include doubles.I Splitting becomes quite regular: next datum prefetchable by hardware.I Each node/predictor pair restageable on SIMD hardware

F Partition using parallel scan.


1 Outline

2 Introduction

3 Random Forests

4 Implementation


6 Ongoing work



Organization

Compiled code with various language front-ends.

R was the driving language, but Python now under active development.

Front-end “bridges” wherever possible: Rcpp, Cython.

Minimal use of front-end call-backs: PRNG, sampling, sorting.

Common code base also supports GPU version, largely as a subtyped extension.


Look and feel

Guided by existing packages. Many options the same, or similar.

Supports only numeric and factor data: leaves type-wrangling to the user.

Predictor sampling: continuous predProb (Bernoulli) - vs. discrete max features (w/oreplacement).

Breadth-first implementation; introduces specification of terminating level.

Introduces information-based stopping criterion, minRatio.

Unlimited (essentially) factor cardinality: blessing as well as curse.

Many useful package options remain to be implemented.


Distinguishing features

Decoupling of splitting from row-lookup: restaging + highly regular node walk.

Both stages benefit from resulting data locality and regularity.

Restaging maintained as stable partition (amenable to SIMD parallelization).

Training produces lightweight, serial “pre-tree”.

Rich intermediate state: e.g., frontier maps reveal quantiles.

Amenability to workflow internalization: “loopless” behavior.


Training wheels, early experience

Began with R front end.

Compare performance with randomForest package.

“Medium” data: large, but in-memory.

Speedups typically observed as row counts approach 500− 1000.

Linear scaling with # predictors, # trees, as expected.

Log-linear with # rows, also as expected.

Regression much easier to accelerate than classification.


1 Outline

2 Introduction

3 Random Forests

4 Implementation


6 Ongoing work



Feature trials: Bernoulli vs. w/o replacement

German credit-scoring data1: binary response, 1000 rows.

7 numerical predictors.

13 categorical predictors, cardinalities range from 2 to 10.

Graphs to follow compare misprediction, execution times at various predictor selections.

Accuracy, as function of trial metric:I Bernoulli (blue) and w/o replacement (green) appear to track well.

Performance of Rborist generally 2-3 ×, in this regime.

1 Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml].Irvine, CA: University of California, School of Information and Computer Science.


● ●

●

●

●

● ●

●

●

●●

●

●

●

● ●

●

●

●

5 10 15

0.20

0.22

0.24

0.26

0.28

0.30

0.32

Misprediction: predProb (Rborist) vs. mtry (RF)

mtry (default = 4) or equivalent (default = 4.8)

Mis

pred

ictio

n ra

te

●

●

●

● ● ●

●

●

●

●

●

●

●

●

●

●

●

●

●


●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

5 10 15

2.0

2.5

3.0

3.5

4.0

4.5

Execution time ratios: randomForest / Rborist

(equivalent) mtry

Rat

io


Instructive user account

Arborist performance problem noted in recent blog post.2

Airline flight-delay data tested on various RF packages.

8 predictors; various row counts: 104, 105, 106, 107.

Slowdown appears due to error in large-cardinality sampling.I In fact, Github version had already repiared salient problem.

Nonetheless, suggests improvements either implemented or to-be.

2 “Benchmarking Random Forest Implementations”, Szilard Pafka, DataScience.LA, May 19,2015.


Account, cont.

Splitting now parallelized across all pairs, rather than by-predictor.

Class weighting to treat unbalanced data.

Binary classification with high-cardinality factors:I Replaces sampling with n log n method.

Points to need for more, and broader, testing.


GPU: pilot study with University of Washington team

GWAS data provided by Dept. Global Health 3.I 100 samples, up to ∼ 106 predictors.I Binary response: HIV detected or not.I Purely categorical predictors with cardinality = 3: SNPs.

Bespoke CPU and GPU versions spun off for the data set.

Each tree trained (almost) entirely on GPU.

Results illustrate potential - for highly regular data sets.

Drop-off on right is an artefact from copying data.frame.

3 Courtesy of Lingappam lab.


CPU vs. GPU: execution time ratios of bespoke versions

●

●

●

●

●●

●

●

0 50000 100000 150000 200000 250000

1520

2530

3540

4550

CPU vs GPU: timing ratios (1000 trees)

Predictor count

Rat

io


1 Outline

2 Introduction

3 Random Forests

4 Implementation


6 Ongoing work



GPU-centric packages

ad hoc version not scalable as implemented: rewritten.

Restaging now implemented as stable partition via parallel scan.

Nvidia engineers concur with this solution, anticipate good scaling.

In general, though, data need not be so regular as this.I Mixed predictor types and multiple cardinalities present load-balancing challenges.I Dynamic parallelism option available for irregular workloads.I Predictor selection thwarts data locality: adjacent columns not necessarily used.

Lowest-hanging fruit may be isolated special cases such as SNP data.


GPU vs. CPU

Highly regular regression/numeric case may perform well on GPU.

On-GPU transpose: restaging, splitting employ different major orderings.

For now: split on CPU and restage (highly regular) on GPU.

Multiple trees can be restaged at once via software pipeline:I Masks transfer latency by overlapping training of multiple trees.I Keeps CPU busy by dispatching less-regular tasks to multiple cores.


CPU-level parallelism

Original work focused on predictor-level parallelism, emphasizing wider data sets.

Node-level parallelism has emerged as equal player (e.g., flight-delay data).I But with high core count and closely-spaced data, false sharing looms as potential threat.

Infrastructure now in place to support hierarchical parellization:I Head node orders predictors and scatters copies.I Multiple nodes each train blocks of trees on multicore hardware.I GPU participation also possible.I Head node gathers pretrees, builds forest, validates.I Remaining implementation chiefly a matter of scheduling and tuning.


CPU: load balancing

Mixed factor, numerical predictors offer greatest challenge, especially for classification.

In some cases, may benefit from parallelizing trial generations themselves.

Irrespective of data types, inter-level pass following splitting is inherently sequential.

May make sense to pipeline: overlap splitting of one tree with interlevel of another.

N.B.: Much more performance-testing is needed to investigate these scenarios.


Additional projects

Sparse internal representation.I Inchoate; main challenge is defining interface.

NA handling.I Some variants easier to implement than others.

Post-processing: facilitate use by other utilities.I Feature contributions.


Pyborist: goals

Encourage flexing, testing by broader ML community.

Honor precedent of scikit-learn: features, style.

Provide R-like abstractions: data frames and factors.

Attempt to minimize impact of host language on user data.

Stress software organization and design.


Pyborist: key ingredients

Cython bridge: emphasis on compilation.

Pandas: “dataframe” and “category” essential.

NumPy: PRNG, sort and sampling call-backs.

Considered other options: SWIG, CFFI, ctypes ...


1 Outline

2 Introduction

3 Random Forests

4 Implementation


6 Ongoing work



Summary

Only a few rigid design principles:I Constrain data movement.I Language-agnostic, compiled core implementation.I Common source base.

Plenty of opportunities for improvement.

Load balancing appears to be lowest-hanging fruit: both CPU and GPU.


Longer term

Solicit help, comments from the community.

Expanded use of templating: large, small types.

Out-of-memory support.

Generalized dispatch, fat binaries.

Plugins for other splitting methods: non-Gini.

Internalize additional workflows.


Acknowledgments

Stephen Elston, Quanta Analytics.

Abraham Flaxman, Dept. Global Health, U.W.

Seattle PuPPy.


Thank you

[email protected]

@suijidata

blog.suiji.org (“Mood Stochastic”)

github.com/suiji/Arborist


Accelerating the Random Forest algorithm for commodity parallel- Mark Seligman

Data & Analytics

Transcript of Accelerating the Random Forest algorithm for commodity parallel- Mark Seligman