A digital architecture for support vector machines: theory ... · A Digital Architecture for...

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 5, SEPTEMBER 2003 993

A Digital Architecture for Support Vector Machines:Theory, Algorithm, and FPGA Implementation

Davide Anguita, Member, IEEE, Andrea Boni, and Sandro Ridella, Member, IEEE

Abstract—In this paper, we propose a digital architecture forsupport vector machine (SVM) learning and discuss its implemen-tation on a field programmable gate array (FPGA). We analyzebriefly the quantization effects on the performance of the SVM inclassification problems to show its robustness, in the feedforwardphase, respect to fixed-point math implementations; then, we ad-dress the problem of SVM learning. The architecture describedhere makes use of a new algorithm for SVM learning which is lesssensitive to quantization errors respect to the solution appeared sofar in the literature. The algorithm is composed of two parts: thefirst one exploits a recurrent network for finding the parametersof the SVM; the second one uses a bisection process for computingthe threshold. The architecture implementing the algorithm is de-scribed in detail and mapped on a real current-generation FPGA(Xilinx Virtex II). Its effectiveness is then tested on a channel equal-ization problem, where real-time performances are of paramountimportance.

Index Terms—Digital neuroprocessors, field programmablegate arrays (FPGAs), quantization effects, support vector machine(SVM).

I. INTRODUCTION

T HE hardware implementation of neural networks hasrecently attracted new interest from the neurocomputing

community despite the skepticism generated by the devices ofthe first generation, which appeared during the 1980s (see, forexample, [1] for a survey of these solutions). Nowadays, tworesearch areas can be easily identified: the first one focuseson biological inspired devices and builds on Mead’s seminalwork [2]. In this case, the physical behavior of microelectronicdevices is exploited to realize complex information processingwith very low requirements in terms of area and power con-sumption [3].

The second area addresses the hardware implementation ofalgorithms, which are inspired by the neurocomputing frame-work but not necessarily justified form a biological point ofview. Its main target is the design of dedicated analog or digitalhardware with improved characteristics (e.g., performance, sil-icon area, power consumption, etc.) respect to a software imple-mentation on a general-purpose microprocessor. This approachhas been the most criticized in the past [4], mainly because oneof the main targets of the research has been the raw compu-tational power. There are, however, many other advantages in

Manuscript received September 15, 2002.D. Anguita and S. Ridella are with the Department of Biophysical and

Electronic Engineering, University of Genova, 16145 Genoa, Italy (e-mail:[email protected]; [email protected])

A. Boni is with the Department of Information and Communication Tech-nologies, University of Trento, Trento, Italy (e-mail: [email protected])

Digital Object Identifier 10.1109/TNN.2003.816033

designing special-purpose devices, as recalled before, and insome cases (e.g., embedded systems) a dedicated solution canbe preferable [5].

Our work fits in this last framework: we propose a newalgorithm for support vector machine (SVM) learning anda digital architecture that implements it. The SVM is a newlearning-by–example paradigm recently proposed by Vapnikand based on its statistical learning theory [6]. After the firstpreliminary studies, SVMs have shown a remarkable efficiency,especially when compared with traditional artificial neuralnetworks (ANNs), like the multilayer perceptron. The mainadvantage of SVM, with respect to ANNs, consists in the struc-ture of the learning algorithm, characterized by the resolutionof a constrained quadratic programming problem (CQP), wherethe drawback of local minima is completely avoided. Ouralgorithm improves on the proposals appeared so far (e.g., [17]and consists of two parts: the first one, previously reported inthe literature, solves the CQP respect all the parameters of thenetwork, except the threshold, while the second one allows thecomputation of such threshold by using an iterative procedure.

The proposed algorithm can be easily mapped to a digitalarchitecture: to assess the effectiveness of our approach,and measure the actual performance of both the algorithmand the architecture, we implement our solution on a fieldprogrammable gate array (FPGA) and test it on a telecom-munication application. Our choice is motivated by recentadvances in the FPGA-based technology, which allows easyreprogrammability, fast development times and reduced effortswith respect to full-custom very large-scale integration (VLSI)design. At the same time, the advances in the microelectronicsprocess technology allow the design of FPGA-based digitalsystems having performances very close to the ones obtainedby a manual full-custom layout (see, for example, [25], whichdetails an efficient implementation of a neural processor).

In Section II, we revise briefly the SVM. Section III addressesthe hardware implementation of SVMs when targeting digitalsolutions. New results on the robustness of this learning ma-chine, respect to quantization errors, are presented, and, aftersurveying the state of the art of SVM learning algorithms, tar-geted to VLSI implementations, our new proposal is detailed.The digital architecture is described in Section IV and its FPGAimplementation, along with the experiments on a telecommuni-cation application, is described in Section V.

II. SVM

The objective of SVM learning is finding a classificationfunction that approximates the unknown, on the

1045-9227/03$17.00 © 2003 IEEE

994 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 5, SEPTEMBER 2003

basis of the set of measures , whereis an input pattern, and the correspondingtarget.

To allow for nonlinear classification functions, the trainingpoints are mapped from the input spaceto a feature space

, with , through a nonlinear mapping ;then, a simple linear hyperplane is used for separating the pointsin the feature space. By using clever mathematical propertiesof nonlinear mappings, SVMs avoid explicitly working in thefeature space, so that the advantages of the linear approach areretained even though a nonlinear separating function is found(see [6]–[8] for more details)

The function is given by

(1)

where is the normal vector of the separating hyperplane.Among all possible hyperplanes, SVMs find the one that cor-

responds to a function having a maximal margin or, in otherwords, the maximum distance from the points of each class.Since it is not common that all the points can be correctly clas-sified, even if mapped in the high-dimensional feature space,the SVM allows for some errors but penalizes their cardinality.

Formally, this description leads to a CQP, by requiring us tofind

(2)

subject to

(3)

This is usually referred as thePrimal CQP. The function tobe minimized is composed by two parts: the first one forces thehyperplane to have a maximal margin, while the second termpenalizes the presence of misclassified points. The constantsimply sets the tradeoff between the two terms.

One of the reason of the SVM success is the fact that thisapproach can be used to control the complexity of the learningmachine, namely through the structural risk minimization prin-ciple [6]. This principle can provide a method for controlling thecomplexity of the learning machine and define upper bounds ofits generalization ability, albeit in a statistical framework.

The above —CQP is usually rewritten in dual form , byusing the Lagrange multiplier theory [6]

(4)

subject to

(5)

where

(6)

TABLE IALGORITHM 1: DSVM WITH FIXED BIAS

which allows us to write using the dual variables

(7)

There are several advantages in using the dual formulation: themain one is that there is no need to know explicitly the function

, but only the inner product between two points in the featurespace. This is the well-knownkernel trick that allows to dealimplicitly with nonlinear mappings through the use of kernelfunctions

(8)

Using the above notation, the-CQP can be rewritten in acompact form, as follows:

(9)

where and is a vector of all ones.Since the seminal works on kernel functions, many kernels

of the form given by (8) have been found; among them are thelinear, the Gaussian, and the polynomial kernels

(10)

Another advantage of using the kernel functions lies in thefact that they are positive semidefinite functionals. Therefore,

ANGUITA et al.: A DIGITAL ARCHITECTURE FOR SUPPORT VECTOR MACHINES 995

TABLE IIALGORITHM 2: FIBS

using this property and the fact that the constraints of the aboveoptimization problem are affine, any local minima is also aglobal one and algorithms exist which, given a fixed tolerance,find the solution in a finite number of steps [9]. Furthermore, ifthe kernel is strictly positive definite, which is always the caseexcept for pathological situations, the solution is also unique.These properties overcome many typical drawbacks of tradi-tional neural-network approaches, such as the determinationof a suitable minimum, the choice of the starting point, theoptimal stopping criteria, etc.

As a final remark, note that the thresholddoes not appearin the dual formulation, but it can be found by using theKarush–Khun–Tucker (KKT) conditions at optimality. Let

be the set oftrue support vectors, thenthe following equality holds:

(11)

thus, can be found by

(12)

Actually, this approach is correct as long as ; otherwiseone can use a more robust method, suggested in [10] (see also[9]), where the threshold is computed using the KKT conditions,but without resorting to support vectors.

III. H ARDWARE IMPLEMENTATION OF SVMS

A. New VLSI-Friendly Algorithm for SVM Learning

Optimization problems like —CQP are well known to thescientific community, as usually faced when solving severalreal-world tasks. As a consequence, they have been deeplystudied by researchers, and several methods have been pro-posed for their resolution. Among others, methods that can beeasily implemented in hardware are particularly appealing: werefer to them as VLSI-friendly algorithms. The leading idea ofthese methods is to map the problem on a dynamical systemdescribed by a differential equation

(13)

with and whose stable point, for , coin-cides with the solution of the optimization problem.

Equation (13) can be seen as a recurrent neural network and,from an electronic point of view, can be implemented, on analoghardware, with simple electronic devices [14], [15].

A digital architecture can be targeted in a similar way bydefining a recurrent relation of the form

(14)

A simple way to obtain (14) from (13) is to use the Euler’smethod for integration, and to obtain

(15)

where is the integration step. Unfortunately, finding a suitable, which guarantees the convergence oftoward the solution

when , is not a trivial task.Recently, a useful convergence result has been presented for a

network that solves a CQP with inequality constraints [16]. Asshown in [17], this algorithm (digital SVM or DSVM) can beapplied effectively to SVM learning when a Gaussian kernel ischosen. The underlying idea is to exploit the fact that a Gaussiankernel maps the data to an infinite feature space, so the effect ofremoving one of the parameters of the SVM from the learningprocess can be negligible. In particular, if we let , we


Fig. 1. SVMblock.

force the separating hyperplane to pass through the origin in thefeature space and the equality constraint disappears from thedual formulation of the CQP. Then, the DSVM can be used forsolving the resulting problem

(16)

The core of DSVM is very simple

(17)

(18)

where . The convergence of the above recurrentrelation is guaranteed, provided that .

Even if several experiments on real-world data sets havedemonstrated the effectiveness of this method, it is greatlypenalized by two facts: the generalization of a SVM with aconstant threshold does not fit the usual theoretical framework,and it is not easily applicable to other kernels (e.g., the linearone).

Here, we suggest a new approach, which allows the use of theDSVM algorithm and the computation of the threshold as well.This result is based on a recent work on parametric optimization[18]: the main idea is to design a learning algorithm composedby two parts, working iteratively. The first part addresses theresolution of a CQP with fixed, whereas the second one im-

Fig. 2. counters—block.

plements a procedure for updating the value ofitself, in orderto reach iteratively the optimal threshold value.

If we consider the problem of SVM learning, it is easyto deduce that, whenever the thresholdis considered as anapriori known parameter, then the dual formulation becomes

(19)

where .


Fig. 3. dsvm—block.

Such a problem can be solved by a slightly modified versionof DSVM, listed in Table I. Given the input kernel matrix,

a threshold , and a starting value , the algorithm solves theCQP of (19) by providing an intermediate solution.


Fig. 4. GenericPE.

As our aim is to solve the CQP with the equality constraint,it is possible to deduce, from the term , therange of variation of . In fact, being the optimal value of thethreshold to be find anda tentative value at a given step, if

then , otherwise if then : in other words,is a nonlinear function of that crosses zero only once, when

, therefore, lies in a range for whichand [18]. Consequently, a simple bisection procedurecan be derived for finding , that is the only point where .Our proposal, calledFIxed b Svm(Fibs), is listed in Table II:it has been designed with the goal of an implementation on adigital architecture, but, as will be clear subsequently, it can beimplemented on a general-purpose floating-point platform aswell.

Its functionality is based on the search of a range of valuesto which the threshold belongs. Then, at each step,

it proceeds according to a simple bisection process by finding atentative value , and by updating and

on the basis of the value of. It terminates when the rangebecomes smaller than a given tolerance. Note that,

when the algorithm starts, both and are not known,therefore, a first search of the feasible range must be performed.We found experimentally that and are goodstarting choices.

B. Quantization Effects in SVMs

To the best of our knowledge, no theoretical analysis of thequantization effects of the SVM parameters has appeared so farin the literature. For this reason, we quantify here some of theseeffects in order to prove that the feedforward phase of the SVMcan be safely implemented in digital hardware.

We perform a worst-case analysis based on the propertiesof interval arithmetic [13], [11]. Other techniques can be usedas well (e.g., statistical methods [12]), but we believe thatworst-case results, when applicable, can be of more interest,when targeting digital architectures, because provide guaran-teed bounds on the performance of the implementation.

Fig. 5. bias–Block.

The effect of quantization can be described by transformingeach variable involved in the computation of the feedforwardphase in an interval

(20)

(21)


Fig. 6. s-block.

(22)

where , , and are no less than half of the quantiza-tion step. Note that, for simplifying the notation .

Using the properties of interval arithmetic [13] and the factthat , it is possible to derive upper bounds of the quanti-zation effects

(23)

(24)

TABLE IIIMAIN CHARACTERISTICS OFFSMS

(25)

(26)

(27)

where is the number of patterns with positive (neg-ative) target and . In the above derivation, wehave assumed, for simplicity, that , as in Gaussian ker-nels, but a similar result can be derived for other kernels as well.

Then, the bound on the quantization effect is given by

(28)

The above equation can be used to analyze the contribute tothe quantization error from each parameter of the SVM. As anexample of this analysis, let us assume that each parameter isquantized with a similar step . Accordingto SVM theory, any pattern lying outside the margin gives anoutput ; therefore, a pattern can be misclassified ifthe total quantization error is greater than one and of oppositesign respect to the correct target value. In other words, using(28), a necessary condition for a misclassification error is givenby

(29)

If the two classes are balanced the above equationsimplifies to an intuitive relation between the maximum admis-sible quantization step and some parameters of the SVM

(30)

Equation (30) shows that the quantization step cannot begreater than a quantity that depends on: 1) the number of thepatterns and 2) the size of the alphas. This is an intuitive result:in fact, as the number of patterns grows, the summation onthem can suffer from the accumulation of quantization errors


Fig. 7. Misclassification error due to quantization effects.

and, at the same time, larger values of the parameters canheavely affect the classification output.

IV. DIGITAL ARCHITECTURE

In Section III, we have detailed the Fibs algorithm and wehave shown that can it be applied to solve an SVM-based clas-sification problem with any kind of kernel function. In this sec-tion, we describe in detail an architecture that implements it,following a top-down approach. At first, we will show the mainsignals and blocks that compose the design, then we will de-tail each one of them. The architecture described in this paperis just a first study on this topic, useful to understand its mainproperties, such as number of clock cycles needed to reach afeasible solution, clock frequency, device utilization, etc. Withthis aim, we focus our attention on the design of a prototype ofa RBF-SVM, but the same design can be easily extended to dif-ferent kind of SVMs.

A. Main Blocks and Signals

The input–output interface of our design is represented inFig. 1. It is characterized by two data lines, namely ,used for input data, and , used for the output, of 16bits each. The lines and act as handshake signals forthe I–O. The signal begins a learning session, while the

signal reports both its termination and the fact that theentity can start to submit the parametersand to the output.Finally, , , are the enable, the asynchronous reset, andthe clock signals, respectively.

The functionality of the can be subdivided in threebasic phases.

1) Loading Phase: When the SVMblock receives a ,transition , it begins the loading of the target vector

, and the kernel matrix . Actually, without loss of gen-erality, we suppose that the SVMblock receives and stores

in its memory directly the negated . The valuesmust be delivered by scanning the matrix row by row (orcolumn by column, thank to the simmetry properties of

), and following a typical handshake protocol. The acti-vation of the asyncronous reset permits one to clean up allregisters and to start a new learning with the initial value

, otherwise the final value of the previous learningis assumed as new starting point.

2) Learning Phase: As soon as the loading is completed, thestarts the learning phase, according to theFibs

algorithm detailed in Section III; once it terminates, theready signal is activated in the output, and the block canbegin the output phase;

3) Output phase: The submits the results oflearning, that is the values , , to the outputusing the same communication protocol of the loadingphase.

These logical phases are implemented by the general archi-tecture depicted in the block-scheme of Fig. 1. It is mainlycomposed by four computing blocks, namely the ,

, , and -blocks, and three controllers for the loading,learning, and output phase, respectively. Whereas all the signalsto/from the controllers are connected on thevia a tristate-based connection, data are connected on the

, while the information on the indexeseach element of the kernel matrix, as detailed in the following.

B. Structure of Each Block

In this section, we will describe the function of each basicblock of Fig. 1, each of which implements a specific functionexpressed in algorithmFibs: the —block contains both thememory to store matrix and all the digital components neededto implement the algorithm of Table I; the -block pro-vides all the counters and indexes needed to select either the en-tries of the kernel matrix and each alpha value during the flow of


(a)

(b)

Fig. 8. Distribution of data for Models 1a (a) and 1b (b)[D = 2; � = 0:2;m = 500(A);m = 32(B)].

the algorithm; the -block contains registers and combina-torial logic for the control the values , , and . Finally, the-block contains a register where the target vectoris stored,

and the logic components for the computation of the equalityconstraint, namely the value ofin Table II. As a final remark,let us note that all the connections to the and the

, are realized via a tristate connection.1) -Block: The structure of the –block

is represented in Fig. 2. It mainly consists of two counters, that

we call (column counter, used to index a column of the kernelmatrix) and (row counter, used to index a row of the kernelmatrix), respectively. The is also used to select a particularRAM, during loading, inside the -block of , and to se-lect a particular value of alpha, both during the computation of

and during the output phase.The output of is directly connected to the .

The is an up-counter, while the s counts both up and down,as detailed in Section IV-B2.


(a)

(b)

Fig. 9. Distribution of data for Models 2a (a) and 2b (b)[D = 0; � = 0:2;m = 500(A);m = 32(B)].

2) -Block: We implemented the DSVM algorithm ina one-dimensional systolic array composed bybasic pro-cessing elements ( s), each of which has the task of updatingan element of the vector. As previously indicated, we supposethat matrix is precomputed and stored in a set of RAMs ofsize bit; each RAM contains a row of , as shown inFig. 3. is the length of the word used to code each element

, while with we indicate the length of the words used tocode each . Fig. 4 shows the structure of a generic .

At a given time-step k, each holds the component inthe register , and its role is to compute ;the value is knowna priori at each , when the -blockis enabled to operate, and is deduced from the value, obtainedfrom the -block and from the value of the target, pro-vided by .

As a first step, the value is computed and storedin ; is then delivered to , whereas receives

from . As a second step, computes and stores


TABLE IVFLOATING AND FIXED-POINT EXPERIMENTS FORSONAR (LINEAR KERNEL)

TABLE VFLOATING AND FIXED-POINT EXPERIMENTS FORMODEL 1A

(� = 0:5; C = 0:05)

and so on. Finally, after steps, corre-sponding to clock cycles, the final value is ready at theouput of each .

The storing of inside the -block is the following: eachcontains theth row of , that is the value (stored at

location 0), (stored at location 1), and so on; finallyis stored at location . In this way the corresponding

has the value ready for the computation, each time a newarrives.During learning, is indexed by the address given by the

counter , which counts in mode ; instead, during loading,the same counter, used to store theth row, counts in mode ,and begins its counting from the value. Such an initial valueis provided by the external counter (see Fig. 2). Once eachcomponent of is ready, as we suppose here to implement anRBF-SVM, and being for such machines , a shiftof positions must be executed. To perform

TABLE VIFLOATING AND FIXED-POINT EXPERIMENTS FORMODEL 1B

(� = 0:5; C = 0:09)

TABLE VIIFLOATING AND FIXED-POINT EXPERIMENTS FORMODEL 2A

(� = 0:5; C = 0:02)

the shift, we directly connect the most significant wiresof each to the less significant wires of the correspondingadder, thus avoiding, in practice, an actual shift. The set of mul-tiplexers subsequent to the adder act as limiters, in order to con-straint each alpha to lie inside the box and to compute

, which is stored in the corresponding register . Assoon as the value is computed and stored in , its pre-vious value is stored in : this further set of registers isuseful in order to verify the termination of the algorithm, rep-resented by the activation of the signal . In particular,each component is a simple binary comparator providing“1” when the inputs are all equal and “0” otherwise. Finally, the

-block, implements anAND operation of its inputs.


If a further iteration must be executed , anothercycle begins after the storing of each in the corresponding

of the .3) Bias-Block: The —block (see Fig. 5), is simply

composed by three left-shift registers, containing , ,and , an adder, two 2-to-1 multiplexers, a 3-to-1 multiplexer,and a binary comparator, which detects the termination of thelearning phase whenever or .Note that, as is a feasible condition for stopping thealgorithm, the -block receives also the signalfrom the -block, which is equal to “1” when and “0”otherwise. Finally, the bisection processis simply implemented without connecting the LSB of theadder to the input of the multiplexer (such a connection isrepresented with a dashed line in Fig. 5), thus avoiding, as inthe case of the multiplication fordiscussed in Sections IV-B1and IV-B2, an actual right-shift.

4) -Block: The role of the –block (see Fig. 6), is thecomputation of , namely the sign of the quantity .It is composed by a serial-input serial-output circular shiftregister ( ), containing the vector, a “2’s complement”block, a 2-to-1 multiplexer, an adder and a registeracting as an accumulator. The valueis simply the MSB of

. The -block works during the loading phase, whenthe labels are stored in the , and duringthe learning phase after the value is generatedby the -block. The connection with the is asfollows: the serial-input of , read only when theinput shldSISO is active high (load), is connected to the LSB of

, while the vector is fully connected to .In practice, contains, at a given time during learning, avalue , from the set of registers .

5) Controllers: As previously indicated, theis composed by three finite-state machines (FSMs), eachone having the role of controlling a given phase of theFibs-algorithm; we indicate them as , and

, respectively. Actually, is subdivided inthree different modules that we call , and

. is the actual supervisor of the architecture andmanages all the transitions for the correct flow of algorithmFibs. In particular, it enables to control the function-ality of the -block, during a part of learning, and, after thevalue is obtained, enables the submodule ,which supervises the-block. receives all the controlsignals , and , and decides the correct actionaccording to their value, following the algorithm given inTable II. Table III, reports the main characteristics of the FSMs,such as number of states, transitions, total inputs/outputs.The choice of subdividing the controller of the system in fivedifferent sub-modules, is justified by the fact that, for synthesispurposes, it is better to have small modules working separately,as confirmed by the results discussed in Section V-D.

V. EXPERIMENTAL RESULTS

In order to test the proposed algorithm and the correspondingdigital architecture, we chose a well-known benchmarking

TABLE VIIIFLOATING AND FIXED-POINT EXPERIMENTS FORMODEL 2B

(� = 0:5; C = 0:08)

dataset (Sonar), and several datasets from a telecommunicationproblem, recently used for the application of SVMs to channelequalization purposes. In Section V-A, we also show how thetheoretical results on the quantization effects (Section III-B)relate to the actual misclassification of the SVM on the Sonardataset. The second problem, described in Section V-B, isparticularly appealing because it is a typical case where adedicated hardware can be of great usefulness. The floatingand the fixed-point experiments are discussed in Section V-C;furthermore, in such section we report the comparison of ourapproach with the well-known SMO algorithm [9], [19], [20]for SVM learning. Finally, in Section V-D, we discuss theimplementation on a FPGA.

A. Sonar Dataset

The Sonar dataset is a well-known linearly separableproblem, extensively used for benchmarking purposes oflearning algorithms [21]: its popularity is mainly due to thedifficulty of the classification task. As the computation of thethreshold is fundamental for the determination of the sepa-rating surface, it is particularly suitable to test the algorithmproposed in this paper. The sonar data set is composed by208 samples of 60 features each, usually subdivided in 104training and 104 test patterns. It is known that a linear classifiermisclassifies 23 test patterns, while a RBF-SVM misclassifiessix test patterns if the threshold is used, and eight test patternsotherwise [17].

The Sonar dataset has been used for testing the quality of thequantization error bounds found in Section III-B. In particular,we solved the problem using a Gaussian kernel withand : the solution consists of support vec-tors . Then, a random perturbation of size

has been applied to each parameter of the network and theaverage, minimum, and maximum misclassification errors havebeen registered on different trials.


TABLE IXHDL SYNTHESIS REPORT FORm = 8 AND m = 32

(a)

Fig. 10. Convergence ofb for (a)m = 8.

The results are shown in Fig. 7, as a function of :the worst-case bound given by (29), as expected, is quite con-servative and suggests that approximately ten bits are necessaryfor avoiding any misclassification error, while the experimentalvalue is eight. However, it is worthwhile noting that trialsare an infinitesimal amount respect to an exhaustive search formisclassification errors, but require many CPU hours on a con-ventional PC. The actual number of possibilities of addingto each parameter of the SVM is , therefore, an exhaus-tive search is impossible to perform in practice and this ap-proach could never provide enough confidence on the result.

The worst-case bound, instead, provides a safe value, which canbe easily computed.

B. Channel Equalization Problem

The channel equalization problem is a typical applicationwhere a special-purpose device can be effectively used, on thereceiver side, in order to estimate one between two symbols

, of an independent sequence emitted from a givensource. All the unknown nonlinear effects of the involvedcomponents (transmitter, channel and receiver) are modeled


(b)

(c)

Fig. 10. (Continued)Convergence ofb for percentage of test errors during learning for (b)m = 32 and (c)m = 32.

as finite-impulse response (FIR) filters, plus a Gaussian dis-tributed noise with zero mean and variance

(31)

The classical theory, tackles this problem by finding anoptimal classifier (the Bayesian maximum —ikelihooddetector), which provides an estimate of

through the observation of an-dimen-sional vector .Whereas these methods require the knowledge of the symbolsprobability, and the analytic structure of the model, neuralnetwork-based approaches have been successfully applied [22],[23] to systems where such information are not known.


TABLE XPOST-PLACE AND ROUTE SYNTHESIS REPORT FORDIFFERENTNUMBER OF

INPUT SAMPLES AND DIFFERENTOPTIMIZATION CRITERIA

In practice, a classifier is selected on the basis ofprevioussamples, having the following structure:

(32)

Here, we consider the following two different nonlinear modelsof the channel, that, substantially, differ for the delay.

• Model 1: .• Model 2: .

Furthermore, as our aim is to test the behavior of the algo-rithm and, above all, to study its actual hardware implementa-tion, we consider different number of training patterns. In par-ticular, we choose , as in [23], and : thislast choice guarantees a good tradeoff between the final gener-alization ability of the learned model and its device utilization.We call Model 1a, 1b, 2a, 2b, the corresponding distributions.Finally, in order to estimate the generalization error of the ob-tained SVM, we use a separate test set composed byand samples, respectively. With reference to (31) weconsider the following channel:

(33)

that assumes an ISI equal to 2. We choose and .Figs. 8 and 9 show the distribution of data obtained by (33) withthe given parameters.

C. Floating-and Fixed-Point Experiments

When designing a digital architecture one of the most impor-tant aspects that must be considered is the length of the wordthat represents the information inside the design. This param-eter has a crucial role because it influences both the length ofthe registers and, as a consequence, the device utilization andthe performance of the digital learning system.

When we faced the design of Fibs and the design of the cor-responding architecture in particular, we needed to understand:1) its behavior when using a floating-point math with respectto standard SVM learning algorithms, like the SMO; 2) the re-quired number of bits. To obtain these answers, we designedseveral experiments, both on the Sonar dataset, using a linearkernel, and on models 1a, 1b, 2a, and 2b.

The results of our first experiment on the Sonar dataset arereported in Table IV. In this table we report, for every kind of ar-chitecture, the number of support vectors (nsv), the value of thethreshold , the value of the quantity (eq.), the number

of training errors (TR), and, finally, the number of test errrors(TS). As expected, the floating-point (FP) version of our algo-rithm obtains good results with respect to the SMO algorithm.Note that we set and .

Table IV shows both floating- and fixed-point results for dif-ferent register lengths. The results of the fixed-point experi-ments are reported in the second part of the table, where, withthe notation , the following information is indi-cated: the number of bits used to code each , the numberof bits used to code the integer , and the fractional partof each and .

From the observation of the results three main importantproperties emerge: 1) Fibs requires a relatively low numberof bits, especially to code the kernel matrix; 2) the accuracyobtained on the equality constraint is not very critical, but,above all; 3) the quantization effect of the kernel matrix is abenefit for the generalization capability. These results are alsoconfirmed by the experiments on the models generated from thetelecommunication problem, as reported in Tables V–VIII. Inparticular, the values reported in the tables clearly indicate thatvery few bits can be used to code each, and that a codingconfiguration, which outperforms the FP solution, can often befound. Note that we used a RBF-SVM with for bothmodels and (for Model 1a), (for Model1b), (for Model 2a) and (for Model 2b). Tocomplete our analysis, we compared our results with the onesreported in [23]. The results of our algorithm outperform theones reported there, obtained with polynomial kernels, evenby using only 32 samples. In fact, whereas [23] reaches anaccuracy of 4.2% on the test of Model 1a, we reached 3.6%with Model 1a and 4.2% with Model 1b. Similar results areobtained with Model 2: case 2a improves on [23] with anaccuracy of 16%; case 2b, instead, is worse, as we measured anaccuracy of 23.5%.

D. FPGA Implementation, Functional Simulations, andSynthesis Analysis

The digital architectures described here have been imple-mented and tested on one of the most powerful Xilinx FPGAs,the Virtex–II, particularly suited to embed high-performancedigital signal processors [24]. We chose, as target device, the

, characterized by 8M system gates, an array of112 104 configuration logic blocks (CLBs) providing 46 592Slices (1 CLB = 4 Slices) and a maximum of 1456 Kbits dis-tibuted RAM. Furthermore, it provides 168 1818 multiplierblocks and 168 18-Kbit selected-RAM blocks for an amount of3024 Kbits RAM. Note that the device is very large for currentstandards and, as detailed in the rest of this paper, only a smallamount of hardware is actually used by our architecture.

In order to study the main properties of our design, such asthe device utilization, we performed several experiments, at firstby choosing a small number of patterns , and thenby choosing a more realistic size, that is , discussedalso, from the generalization point of view, in Section V-C. Byusing a VHDL description, we could parameterize our designand change the size from to without any partic-ular effort, thus allowing an efficient study of the implementa-tion properties for different training set sizes.


Table IX shows the synthesis report, that is, the number of in-stantiated components for each block and for the whole design.

The functional simulations are summarized in Fig. 10. In par-ticular, it shows the convergence of the thresholdtoward theoptimum for different number of clock cycles, in both(A) and (B) cases. Each learning phase terminates after14 000 (A) and 140 000 (B) cycles, while each loading phaseterminates after 290 (A) and 4226 (B) cycles. Fig. 10(a) and(b) indicates that a feasiblecan be reached quite soon duringlearning: this suggests an acceptable rate of classification can beobtained well before the end of the learning process. In order tovalidate this assertion, we measured the percentage of test errorduring learning. Fig. 10(c) shows the value of test errors for dif-ferent clock cycles for . As one can easily verify, afteronly 90 000 clock cycles (50 000 before the termination of thelearning), the obtained performances are quite stable around thevalue obtained at the end of the algorithm.

The Xilinx synthesis tool reports an indicative clock fre-quency, whose final estimated value is known only afterthe place and route procedure, which physically assigns theresources of the device, such as Slices and connections amongthem. Table X lists the characteristics of the architecture afterthe Place and Route phases and shows the device utilization,expressed in number of slices and number of lookup tables, andthe clock frequency, for different architectures and optimizationcriteria. As expected, our approach is particularly efficientfrom the device utilization point of view. This is a remarkableproperty because several other modules, such as CPU cores,can be designed on the same device, thus allowing the buildingof stand-alone intelligent systems on chip.

From the observation of Table X, an interesting and unusualbehavior emerges. In fact, in the case , using the speedas optimization criteria, the synthesis tool provides better per-formances than the area case both in clock frequency and de-vice utilization. An explanation of this could derive from thefact that, when changing the optimization criteria, different syn-thesis and Place and Route algorithms are applied.

VI. CONCLUSION

In this paper, we have discussed the implementation of anSVM-based learning approach on a digital architecture and acase-study of its realization on an FPGA device. To the best ofour knowledge, this is the first work on this subject, and sev-eral issues need to be studied in more detail. First of all, the ob-tained performances could be greatly improved. In particular,the whole architecture appears to be quite independent fromthe target device, therefore, it is likely that slightly modifiedversions exists, which originate mapped-circuits working withhigher clock frequencies.

The main point, however, is that an efficient digital architec-ture can be adopted to solve real-world problems with SVMs,where a specialized hardware for on-line learning is of para-mount importance, and it can be easily embedded as a part of alarger system on chip (SoC).

Future work will address this issue, in the spirit of previousneural processors [25], with the integration on the same chip

of a general-purpose processing unit and a learning coprocessormodule.

ACKNOWLEDGMENT

The authors wish to thank two anonymous reviewers forhelping to improve both the structure and the content of thepaper and pointing out some imprecision in the first draft ofthis work.

REFERENCES

[1] C. S. Lindsey,Neural Networks in Hardware: Architectures, Productsand Applications. Stockolm, Sweden: Roy. Inst. Technol. On-LineLectures, 1998.

[2] C. Mead,Analog VLSI and Neural Systems. Reading, MA: Addison-Wesley, 1989.

[3] R. Sarpeshkar, “Analog vs. digital: Extrapolations from electronics toneurobiology,”Neural Comput., vol. 10, pp. 1601–1638, 1998.

[4] A. R. Omondi, “Neurocomputers: A dead end?,”Int. J. Neural Syst., vol.10, no. 6, pp. 475–482, 2000.

[5] U. Rückert, “ULSI architectures for artificial neural networks,”IEEEMicro Mag., pp. 10–19, May–June 2002.

[6] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.[7] N. Cristianini and J. Shawe-Taylor,An Introduction to Support Vector

Machines. Cambridge, U.K.: Cambridge Univ. Press, 2000.[8] B. Schölkopf and A. Smola,Learning With Kernels. Cambridge, MA:

MIT Press, 2002.[9] C.-J. Lin, “Asymptotic convergence of an SMO algorithm without any

assumption,”IEEE Trans. Neural Networks, vol. 13, pp. 248–250, Jan.2002.

[10] P. Laskov, “Feasible direction decomposition algorithms for trainingsupport vector machines,”Machine Learning, vol. 46, pp. 315–350,2002.

[11] D. Anguita, S. Ridella, and S. Rovetta, “Worst case analysis of weightinaccuracy effects in multilayer perceptrons,”IEEE Trans. Neural Net-works, vol. 10, pp. 415–418, Mar. 1999.

[12] Y. Xie and M. A. Jabri, “Analysis of the effects of quantization in mul-tilayer neural networks using a statistical model,”IEEE Trans. NeuralNetworks, vol. 3, pp. 334–338, Mar. 1992.

[13] G. Alefeld and J. Herzberger,Introduction to Interval Computa-tion. Reading, MA: Addison-Wesley, 1986.

[14] D. Anguita, S. Ridella, and S. Rovetta, “Circuital implementation of sup-port vector machines,”Electron. Lett., vol. 34, no. 16, pp. 1596–1597,1998.

[15] D. Anguita and A. Boni, “Improved neural network for SVM learning,”IEEE Trans. Neural Networks, vol. 13, pp. 1243–1244, Sept. 2002.

[16] M. J. Perez-Ilzarbe, “Convergence analysis of a discrete-time recurrentneural network to perform quadratic real optimization with bound con-straints,” IEEE Trans. Neural Networks, vol. 9, pp. 1344–1351, Nov.1998.

[17] D. Anguita, A. Boni, and S. Ridella, “Learning algorithm for nonlinearsupport vector machines suited for digital VLSI,”Electron. Lett., vol.35, no. 16, 1999.

[18] S. Fine and K. Scheinberg, “Incremental learning and selective sam-pling via parametric optimization framework for SVM ,” inAdvances inNeural Information Processing Systems 14, T. G. Dietterich, S. Becker,and Z. Ghahramani, Eds. Cambridge, MA: MIT Press, 2002.

[19] J. Platt, “Fast training of support vector machines using sequential min-imal optimization ,” inAdvances in Kernel Methods—Support VectorLearning, B. Schölkopf, C. Burges, and A. Smola, Eds. Cambridge,MA: MIT Press, 1999.

[20] S. S. Keerthi and E. G. Gilbert, “Convergence of a generalized SMOalgorithm for SVM classifier design,”Machine Learning, vol. 46, pp.351–360, 2002.

[21] J. M. Torres-Moreno and M. B. Gordon, “Characterization of the sonarsignals benchmark,”Neural Processing Lett., vol. 7, pp. 1–4, 1998.

[22] S. Chen, G. J. Gibson, C. F. N. Cowan, and P. M. Grant, “Adaptive equal-ization of finite nonlinear channels using multilayer perceptrons, signalprocessing,”Signal Processing, vol. 20, no. 2, pp. 107–119, 1990.

[23] D. J. Sebald and J. A. Bucklew, “Support vector machine techniquesfor nonlinear equalization,”IEEE Trans. Signal Processing, vol. 48, pp.3217–3226, Nov. 2000.

[24] Virtex II Platform FPGA Handbook (ver. 1.3), Xilinx.


[25] S. McBader, L. Clementel, A. Sartori, A. Boni, and P. Lee, “Softtotem:An FPGA implementation of the totem parallel processor,” presentedat the 12th Int. Conf. Field Programmable Logic Application, France,2002.

Davide Anguita (S’93–M’93) received the Laureadegree in electronic engineering in 1989 and thePh.D. degree in computer science and electronicengineering from the University of Genova, Genoa,Italy, in 1993.

After working as a Research Associate at the Inter-national Computer Science Institute, Berkeley, CA,on special-purpose processors for neurocomputing,he joined the Department of Biophysical and Elec-tronic Engineering, University of Genova, where heteaches digital electronics, programmable logic de-

vices, and smart electronic systems. His current research focuses on industrialapplications of artificial neural networks and kernel methods and their imple-mentation on electronic devices.

Andrea Boni received the Laurea degree in elec-tronic engineering in 1996 and the Ph.D. degree incomputer science and electronic engineering fromthe University of Genova, Genoa, Italy, in 2000.

After working as a Research Consultant withthe Department of Biophysical and ElectronicEngineering, University of Genova, he joined theDepartment of Information and CommunicationTechnologies, University of Trento, Trento, Italy,where he teaches digital electronics and adaptiveelectronic systems. His main scientific interests

focus on the study and development of digital circuits for advanced informationprocessing.

Sandro Ridella (M’93) received the Laurea degreein electronic engineering from the University ofGenova, Genoa, Italy, in 1966.

He is a Full Professor in the Department of Bio-physical and Electronic Engineering, University ofGenova, Italy, where he teaches inductive learningand statistics and optimization methods. In the lastten years, his scientific activity has been mainly fo-cused in the field of neural networks.

A digital architecture for support vector machines: theory ... · A Digital Architecture for...

Documents

Transcript of A digital architecture for support vector machines: theory ... · A Digital Architecture for...