Accelerating a random forest classifier: multi-core, GP-GPU, or FPGA? 1.

1

Accelerating a random forest classifier: multi-core,

GP-GPU, or FPGA?

2 Introduction

Purpose of the paper:

Compare and Contrast effectiveness of FPGAs, GP-GPUs, and Multi-Core CPUs for accelerating classification using models generated by compact random forest machine learning classifiers.

Topics in paper:

Random Forest Classification

Implementation of CRF on Select Devices

Results from Implementation

3 Random Forest Classifier

Definition: A random forest is a classifier consisting of a collection of tree structured classifiers {h(x,Θk ), k=1, ...} where the {Θk} are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input x [1].

Key Determining Features:

Number of Decision Trees

Depth of Decision Trees

4 Challenges and Solutions

Challenges

Hard to apply hardware as decision trees vary significantly in terms of shape and depth.

Data dependent, and difficult to provide deterministic memory access into the trees

Expensive to speed up processing time for each sample to be identical as the tree must be fully populated.

Not compute intensive, computation to communication ratio is poor.

Solution: Compact Random Forests

Researchers at LLNL developed an efficient training algorithm that minimizes tree depth to produce a compact random forest.

Allows for fully populating all decision trees.

Makes the forest small enough to fit in the memory of one or more accelerators and to tap this internal memory bandwidth.

5 Training Compact Random Forest Classifiers

The CRF training algorithm accepts a parameter for maximum tree depth , and generates tree’s no larger than that depth.

Derived using LogitBoost.

A “URL Reputation” data set is used for training and labels the data as either malicious or benign.

Data from 121 days is split: 0 – 59 as training 60 – 120 as testing.

6 Algorithm used for OpenMP on Shared-Memory Multiprocessor

Doubly nested loop that iterates over samples and trees in the forest.

For performance testing of the multi-core CPU a set of data that used sparse, irregular trees, which terminated as early as possible was run.

Open MP exploited the data parallelism between samples, and processed each sample independently allowing for best performance. –an OpenMP pragma was used for this.

7 FPGA Algorithm Implementation

Targeted Hardware:

Hitech Global HTG-V6-PCIE-L240-1 board with a XC6VLX240T-1FFG1759 Virtex 6.

Key Parameters:

Depth:

Directly effects flip flop usage.

Data Width:

Very wide samples are hard to multiplex, taxing FPGA routing resources.

Parameter Sizes Used

Max Depth: 6

Data width: 2048 bits

8 FPGA Implementation Continued Basic Compact Forest Implementation

There ‘n’ number of trees in the system

There are ‘s’ number of stages in a tree.

Each stage represents all the nodes in that level.

In-house gigabit ethernet core is used for communication.

On start up data is loaded into the tree’s pipelines.

Once configured, sample data is streamed to the FPGA where it enters the data pipeline in the CRF which then aligns the data with the stages in each tree.

Small problem with this implementation:

With a wide data pipeline distributed to multiple destinations it creates routing issues

9 FPGA Implementation Continued

Introducing Clumps:

Have the same architecture as the CRF.

Each clump could contain anywhere from 1 tree to all of the trees in the Forest.

Increasing number of clups in design reduces amount of routing but increases number of Flip Flops.

Even after applying Clumps it was apparent that the CRF wouldn’t fit on One FPGA

8 trees were placed on a LX240 FPGA, and 16 on the LX550T-2

10 FPGA Implementation Continued-Tree Implementation

Most direct implementation would be a specialized block of logic for each node within the tree. This would decrease memory requirements but would increase routing logic

Instead the design uses a single block of logic at each level to implement the functionality of any node at that level.

Memory distribution is now more complicated:

BRAMs are fast, however they are limited and are only used on Levels with 32 or more nodes.

Flip-flops are slow, but plentiful so they are used for the lower levels.

11 GP-GPU Algorithm Implementation Each processor uses independent threads to process a small number of samples in

parallel on a portion of the CRF.

Memory is broken into two portions, sample data and forest data.

Forest data is small and loaded once and re-used for every sample.

Sample data is constantly changing therefore would use too many resources if had to be run on every processor

Design is divided into blocks where each processor runs certain samples on certain trees within the CRF which doesn’t strain the resources.

12 Results Recap on Hardware used for each Design:

Multi-Core CPU and GP-GPU: 2 socket Intel X5660 Westmere system with 12 cores running at 2.8 GHz, 96 GB DRAM, and an attached NVIDIA Tesla M2050 that has 3 GB GDDR5

FPGA: Hitech Global HTG-V6-PCIE-L240-1

Testing Parameters:

Maximum tree depth of 6

Data width of 2048 bits

Criteria Evaluated to Compare:

Performance

Power and Cost

Scalability

Problems encountered:

Unable to acquire the 4 FPGA board required to run the full implementation so they improvised with the one board and one smaller board and did a partial implementation and made assumptions.

Relied on Data sheets for the Power as they were unable to measure it due to the first problem and the working environment for the CPU and GPU setup.

13 Results - Performance

In order for a fair performance test they allowed the tree to be fully implemented and populated before measuring results.

The results are in Kilo Samples per second (KSps):

CPU: 9,291 KSps (12 threads) & 884 KSps (1thread)

GPU: 20,398 KSps (14 processors w/ 1536 threads per processor)

FPGA: 31,250 KSps (w/4 LX240s)

14 Results – Power and Cost

As mentioned before, Problem testing power, so the data sheets were used:

Power for Tesla M2050 was listed as <= 225W

Power for Intel Westmere-EP X5660 was 95W

Xilinx Xpower Estimator provided the following chart for the power consumption of the FPGA’s

15 Results - Scalability

As this is a machine learning algorithm the size of the CRF is dependent on the complexity of the input and the level of classification accuracy.

FPGA based system can be scaled out to support moderately large forests by adding hardware

The following is the results in Kilo-Samples per second when the number of trees was increased to almost 7 times the size, 234 trees.

CPU: 1,044 KSps (12 threads) & 93 KSps (1thread)

GPU: 5,381 KSps (14 processors w/ 1536 threads per processor)

FPGA: 31,250 KSps (w/8 LX240s)

16 Discussion and Conclusion

FPGAs offer the highest level of performance and performance per Watt.

FPGAs are built to support a maximum CRF size and require additional hardware to scale to larger classifiers.

GP-GPUs offer the best performance, and degrades slowly with larger classifiers.

GP-GPUs still have hard resource bounds that are sensitive to classifier or sample size.

Multi-Core CPUs with OpenMP are extremely simple to get scalable, near linear performance

17 Questions?

18 References

[1] http://oz.berkeley.edu/~breiman/randomforest2001.pdf - definition

[2] Brian van Essen, C. M. (2012). Accelerating a random forest classifier: multi- core, GP-GPU, or FPGA? Livermore, CA: Lawrence Livermore National Laborator,.

http://oz.berkeley.edu/~breiman/randomforest2001.pdf

http://oz.berkeley.edu/~breiman/randomforest2001.pdf

Accelerating a random forest classifier: multi-core, GP-GPU, or FPGA? 1.

Documents

Transcript of Accelerating a random forest classifier: multi-core, GP-GPU, or FPGA? 1.