IEEE TRANSACTIONS ON VERY LARGE SCALE … ·  · 2018-01-04Algorithm-Driven Architectural Design...

14
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 Algorithm-Driven Architectural Design Space Exploration of Domain-Specific Medical-Sensor Processors Mohammed Shoaib, Student Member, IEEE, Niraj K. Jha, Fellow, IEEE , and Naveen Verma, Member, IEEE Abstract— Data-driven machine-learning techniques enable the modeling and interpretation of complex physiological signals. The energy consumption of these techniques, however, can be excessive, due to the complexity of the models required. In this paper, we study the tradeoffs and limitations imposed by the energy consumption of high-order detection models implemented in devices designed for intelligent biomedical sensing. Based on the flexibility and efficiency needs at various processing stages in data-driven biomedical algorithms, we explore options for hardware specialization through architectures based on custom instruction and coprocessor computations. We identify the limi- tations in the former, and propose a coprocessor-based platform that exploits parallelism in computation as well as voltage scaling to operate at a subthreshold minimum-energy point. We present results from post-layout simulation of cardiac arrhythmia detec- tion with patient data from the MIT-BIH database. After wavelet- based feature extraction, which consumes 12.28 μJ, we demon- strate classification computations in the 12.00–120.05 μJ range using 10 000–100 000 support vectors. This represents 1170× lower energy than that of a low-power processor with custom instructions alone. After morphological feature extraction, which consumes 8.65 μJ of energy, the corresponding energy numbers are 10.24–24.51 μJ, which is 1548× smaller than one based on a custom-instruction design. Results correspond to V dd = 0.4 V and a data precision of 8 b. Index Terms— Biomedical sensor processors, classification accelerators, embedded machine learning, low-energy design by voltage and precision scaling, structured hardware specialization, support-vector machines. I. I NTRODUCTION B IOMEDICAL devices are striving to provide high-value analysis of physiological signals. Through continuous analysis of such signals, outpatient monitoring networks, for instance, promise comprehensive healthcare delivery over large populations and potentially diverse disease states [1]. The central need, as these systems aim to provide actionable outputs [2], is the ability to detect specific physiological states of interest from signals that are available through minimally invasive and low-power sensors. This poses two essential challenges: 1) the signal correlations to clinically Manuscript received March 1, 2012; revised June 3, 2012; accepted August 31, 2012. This work was supported in part by the Qualcomm Innovation Ph.D. Fellowship and the Gigascale Systems Research Center, one of six research centers funded under the Focus Center Research Program, the Semiconductor Research Corporation Entity. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Seda O. Memik. The authors are with the Department of Electrical Engineering, Princeton University, Princeton, NJ 08544 USA (e-mail: [email protected]; [email protected]; [email protected]). Digital Object Identifier 10.1109/TVLSI.2012.2220161 relevant states are often too complex to model based on physiology and 2) the precise correlations are hard to isolate in the presence of normal physiologic activity. In the case of arrhythmia detection based on electrocardiogram (ECG) sens- ing, for instance, abnormal ECG beat morphologies exhibit only subtle variations compared to normal waveform activity. Their discrimination from each other and from normal beats poses a major challenge for signal detection and analysis. Both the targeted and background variations are often expressed through specific manifestations in physiological waveforms. A potential key to accurate signal detection could thus lie in the ability to develop models based on representative or exemplary data. In fact, data-driven modeling techniques are emerging as a powerful approach for overcoming the men- tioned challenges [3]. This, in large part, has been prompted by the recent large-scale availability of data in the healthcare domain as well as the development of machine-learning tech- niques that are capable of exploiting large amounts of data to model specific correlations and then use the models within a decision function [4]. Furthermore, data-driven modeling often involves probabilistic methods for model construction and can thus also tolerate substantial variations due to background physiologic activity [5]. For the systems of interest, however, the computations involved must be achieved at very low power levels (e.g., 1–10 mW for wearable devices and 10–100 μW for implantable devices [6]). In the recent literature, low-power processing platforms for biomedical detection have empha- sized an optimization of signal-processing computations, largely based on synthesis and analysis operations [7]–[9]. In contrast to these techniques, data-driven methods focus on modeling and classifying signals by exploiting potentially rich data that are available. With regard to data-driven methods, applications have been based not on low-power sensors, but rather on high-performance computing clusters with substan- tially relaxed energy constraints [10], [11]. In the biomedical domain, chronic patient monitoring devices are beginning to exploit data-driven techniques, but have thus far been limited in their ability to incorporate the complete computation [2], [12], [13]. It has been shown, for instance, that local feature computations can be per- formed on the signal, reducing the communication data so that computationally intensive data-driven classification can be performed on a separate device [12]. Another direction for medical sensors has been to employ application-specific architectures for low energy [14], [15]. Such systems, however, 1063–8210/$31.00 © 2012 IEEE

Transcript of IEEE TRANSACTIONS ON VERY LARGE SCALE … ·  · 2018-01-04Algorithm-Driven Architectural Design...

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Algorithm-Driven Architectural DesignSpace Exploration of Domain-Specific

Medical-Sensor ProcessorsMohammed Shoaib, Student Member, IEEE, Niraj K. Jha, Fellow, IEEE, and Naveen Verma, Member, IEEE

Abstract— Data-driven machine-learning techniques enable themodeling and interpretation of complex physiological signals.The energy consumption of these techniques, however, can beexcessive, due to the complexity of the models required. In thispaper, we study the tradeoffs and limitations imposed by theenergy consumption of high-order detection models implementedin devices designed for intelligent biomedical sensing. Based onthe flexibility and efficiency needs at various processing stagesin data-driven biomedical algorithms, we explore options forhardware specialization through architectures based on custominstruction and coprocessor computations. We identify the limi-tations in the former, and propose a coprocessor-based platformthat exploits parallelism in computation as well as voltage scalingto operate at a subthreshold minimum-energy point. We presentresults from post-layout simulation of cardiac arrhythmia detec-tion with patient data from the MIT-BIH database. After wavelet-based feature extraction, which consumes 12.28 µJ, we demon-strate classification computations in the 12.00–120.05 µJ rangeusing 10 000–100 000 support vectors. This represents 1170×lower energy than that of a low-power processor with custominstructions alone. After morphological feature extraction, whichconsumes 8.65 µJ of energy, the corresponding energy numbersare 10.24–24.51 µJ, which is 1548× smaller than one based ona custom-instruction design. Results correspond to Vdd = 0.4 Vand a data precision of 8 b.

Index Terms— Biomedical sensor processors, classificationaccelerators, embedded machine learning, low-energy design byvoltage and precision scaling, structured hardware specialization,support-vector machines.

I. INTRODUCTION

B IOMEDICAL devices are striving to provide high-valueanalysis of physiological signals. Through continuous

analysis of such signals, outpatient monitoring networks,for instance, promise comprehensive healthcare delivery overlarge populations and potentially diverse disease states [1].The central need, as these systems aim to provide actionableoutputs [2], is the ability to detect specific physiologicalstates of interest from signals that are available throughminimally invasive and low-power sensors. This poses twoessential challenges: 1) the signal correlations to clinically

Manuscript received March 1, 2012; revised June 3, 2012; accepted August31, 2012. This work was supported in part by the Qualcomm Innovation Ph.D.Fellowship and the Gigascale Systems Research Center, one of six researchcenters funded under the Focus Center Research Program, the SemiconductorResearch Corporation Entity. The associate editor coordinating the review ofthis manuscript and approving it for publication was Dr. Seda O. Memik.

The authors are with the Department of Electrical Engineering, PrincetonUniversity, Princeton, NJ 08544 USA (e-mail: [email protected];[email protected]; [email protected]).

Digital Object Identifier 10.1109/TVLSI.2012.2220161

relevant states are often too complex to model based onphysiology and 2) the precise correlations are hard to isolatein the presence of normal physiologic activity. In the case ofarrhythmia detection based on electrocardiogram (ECG) sens-ing, for instance, abnormal ECG beat morphologies exhibitonly subtle variations compared to normal waveform activity.Their discrimination from each other and from normal beatsposes a major challenge for signal detection and analysis.

Both the targeted and background variations are oftenexpressed through specific manifestations in physiologicalwaveforms. A potential key to accurate signal detection couldthus lie in the ability to develop models based on representativeor exemplary data. In fact, data-driven modeling techniques areemerging as a powerful approach for overcoming the men-tioned challenges [3]. This, in large part, has been promptedby the recent large-scale availability of data in the healthcaredomain as well as the development of machine-learning tech-niques that are capable of exploiting large amounts of data tomodel specific correlations and then use the models within adecision function [4]. Furthermore, data-driven modeling ofteninvolves probabilistic methods for model construction and canthus also tolerate substantial variations due to backgroundphysiologic activity [5].

For the systems of interest, however, the computationsinvolved must be achieved at very low power levels (e.g.,1–10 mW for wearable devices and 10–100 μW forimplantable devices [6]). In the recent literature, low-powerprocessing platforms for biomedical detection have empha-sized an optimization of signal-processing computations,largely based on synthesis and analysis operations [7]–[9].In contrast to these techniques, data-driven methods focus onmodeling and classifying signals by exploiting potentially richdata that are available. With regard to data-driven methods,applications have been based not on low-power sensors, butrather on high-performance computing clusters with substan-tially relaxed energy constraints [10], [11].

In the biomedical domain, chronic patient monitoringdevices are beginning to exploit data-driven techniques, buthave thus far been limited in their ability to incorporate thecomplete computation [2], [12], [13]. It has been shown,for instance, that local feature computations can be per-formed on the signal, reducing the communication data sothat computationally intensive data-driven classification canbe performed on a separate device [12]. Another directionfor medical sensors has been to employ application-specificarchitectures for low energy [14], [15]. Such systems, however,

1063–8210/$31.00 © 2012 IEEE

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

lack applicability to a broader range of detection problems.In [16], we presented a specialized computational platform fordata-driven biomedical monitoring. In this paper, we inves-tigate the principles behind a hardware-specialized architec-ture for data-driven biomedical algorithms by quantitativelyevaluating various platform options, from general-purposeCPUs to custom instructions to hardware coprocessors. Forthe coprocessor approaches, we also evaluate the potentialof microarchitecture and circuit-level opportunities, such asdynamic voltage and precision scaling. We thus propose analgorithm-driven methodology that takes advantage of thecomputational structure and the characteristics of data-drivenpatient monitoring algorithms. Our specific contributions areas follows.

1) We present an energy analysis of representativebiomedical detection applications (cardiac arrhythmiadetection is considered in detail). The analysis isbased on patient data from the MIT-BIH database[17] and shows that classification, the complexity ofwhich depends on the characteristics of the data, posesthe primary energy limitation. We also show that thecomputational structure of classification limits theenergy savings attainable through the use of custominstructions.

2) Based on the energy analysis and the computationalrequirements of various parts of the algorithm, wepropose a generalized architecture for a biomedicalcomputation platform. This attempts to employprogrammability where computational flexibility isrequired, while leveraging hardware specialization forclassification, where set computations are required atvery high energy efficiency.

3) We propose a transistor-level design of a classifica-tion coprocessor that leverages a low-power technology(i.e., low-leakage FD-SOI). Specific requirements forcomputational flexibility are identified and incorporatedthrough hardware scalability in a parallelized subthresh-old implementation that operates at the minimum-energysupply voltage.

The rest of this paper is organized as follows. In Section II,we present an analysis of the data-driven biomedical algo-rithms and identify the computational steps involved. InSection III, we explore the architecture of a low-energy data-driven computational platform through custom instructions anda coprocessor. In Section IV, we describe specialized circuitsat the transistor level, including a variable-precision multiply-accumulate (MAC) unit for the coprocessor-based implemen-tation. In Section V, we present post-layout simulation resultsfor the coprocessor. Finally, we conclude in Section VI.

II. APPLICATION-DOMAIN ALGORITHMIC STUDY

In this section, we describe the general computational struc-ture of algorithms used for analyzing physiological signals. Wefocus on arrhythmia detection, which employs morphologicaland wavelet features and performs detection computationsbased on a support vector machine (SVM) classifier.

Fig. 1. Structure of data-driven biomedical detection algorithms, includingoffline training and online detection (employing biomarker extraction andinterpretation).

A. General Structure of Algorithms

Cardiac arrhythmias refer to abnormal heart beats that areindicative of a range of cardiovascular conditions. We focus onarrhythmia detection as an example in our algorithmic study.

Fig. 1 illustrates the structure of a typical machine-learning-based data-driven analysis algorithm. In the arrhyth-mia detection case study, ECG data are preprocessed fornoise removal (band-pass filtering), QRS-complex detection[18], and beat segmentation [19]. Subsequently, the detectionprocess involves the following two primary steps, namely,biomarker extraction, and biomarker interpretation (through aclassifier) [20]. Biomarkers refer to specific signal parametersthat are indicative of the physiological state of interest [21],[22]. For arrhythmia detection, a range of biomarkers havebeen used, including ECG morphology, beat intervals, andspectral features [23]–[26]. The diversity in the choice ofbiomarkers is due to the various clinical tradeoffs introducedby each, which can also be variable across patients [27], [28].In this paper, we use two prominent biomarkers: waveformmorphology [23] and spectral wavelets [29]. The associatedprocessing steps, including segmentation (to isolate individualbeats), are then implemented in software (enabling the energyanalysis presented next). The outputs from these stages formthe feature vectors that are used for classification.

Following the feature extraction process, an SVM is usedfor data-driven modeling and classification. SVMs are popularmachine-learning classifiers that can be efficiently trainedoffline. This process results in a set of vectors, called supportvectors (SVs), which are used to model the data by repre-senting a decision boundary. Although training can be doneoffline, classification, through the application of the SV model,must be performed in real time for chronic detection. Theactual classification computation is shown below for radialbasis function (RBF) and polynomial transformation kernels

DATA CLASS = sgn

[NSV∑i=1

K (�x · �svi) αi yi − b

](1)

where

K (�x · �svi) ={

ex p(−γ ||�x − �svi||2) RBF kernel

F(�x · �svi + β)d Poly. kernel.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SHOAIB et al.: DESIGN SPACE EXPLORATION OF DOMAIN-SPECIFIC MEDICAL-SENSOR PROCESSORS 3

(a) (b)

Fig. 2. Reducing model complexity by reducing NSV, degrading theaccuracy of classification for (a) RBF and polynomial kernels of order 3 and(b) polynomial kernels of order 2 and 4.

TABLE I

XTENSA CUSTOM PROCESSOR CONFIGURATION

Parameter Configuration

Instruction width 24 bPipeline length 5 stagesPipeline type Uniscalar

General-purpose registers 16ALUs 1

Branch units 1Core frequency 10 MHz

Instruction RAM 2 kBData RAM 4 kB

Datapath width 32 b

Disabled Options

Multipliers (MUL32/MUL16), Viterbi unit,Single-cycle MAC, zero-overhead loop,Normalized shift, min/max unit,ICache/DCache associativity

Here, sgn[] is the signum function, �x is the feature vector tobe classified, and �svi is the i th support vector (b, d , αi , β, γ ,and yi are training parameters).

As the number of support vectors (NSV ) and the feature-vector dimensionality (DSV ) scale, the classification com-putations are dominated by the dot product between �x and�svi. Further, in (2), K represents a kernel function, whose

choice, along with NSV and DSV , can have a major impacton classifier complexity.

B. Need for Advanced Classification Models

In this section, we present the limitations of using simpleclassification models (which would address the complexitychallenge described above). Experimental results show that,as the model complexity reduces, the classifier performancedegrades, necessitating high-order models. The detection algo-rithms illustrated in Fig. 1 are implemented using SVM-Light [30], which is an open-source implementation of anSVM, for the classifier. To correctly analyze the computationalcomplexities and tradeoffs imposed by the model, we usepatient data from the MIT-BIH database [17].

If the feature vectors were linearly separable, a linearkernel function (k) could be used for classification [31]. Thetest vector could then be pulled out of the summation in(2), enabling precomputation of a single decision vector. Asa result, even when NSV scales, the classification energy

(a) (b)

Fig. 3. Arrhythmia detector. (a) Classification energy scales with NSVdominate those of feature extraction. (b) Classification energy scales withDSV . NSV and DSV represent classification complexity.

can remain constant. However, biomedical applications haveshown to perform poorly when linear decision functions areused with medical datasets [32]. Nonlinear functions, such ashigh-order polynomials, RBFs, or sigmoidal kernels, are thusneeded for acceptable classifier accuracies [33]. Several bio-medical detection systems in the literature employ nonlinearkernels for classification. For instance, using polynomial andRBF kernels, 15.8% and 47.4% improvements in detectionaccuracy are reported in [34] and [35], respectively.

The performance of the classifier depends on the SV modeland the characteristics of the application data. Figs. 2(a)and (b) show how the sensitivity and specificity for arrhythmiadetection degrade as NSV is reduced.1 In order to reduce NSV ,the training parameters b, d , αi , β, γ , and yi are adjustedalong with the choice of the data subset used for training.Thus, the model complexity depends on the characteristicsof the application data and introduces an unavoidable trade-off with respect to accuracy performance. We next presentan energy analysis of the end-to-end arrhythmia detectionalgorithm, which employs high-order detection models foraccurate signal classification. This analysis will enable anarchitectural study toward a low-energy application-domainprocessor.

III. APPLICATION-DOMAIN ARCHITECTURAL STUDY

We take three approaches in our architectural study. First,we implement the entire arrhythmia detection algorithm on anembedded low-power base processor. We find that classifica-tion poses the energy bottleneck due to the complexity of themodels required. We then explore opportunities for hardwarespecialization through custom instructions and finally througha coprocessor.

For the base processor, we use the Xtensa processor fromTensilica, which is a customizable and extensible platform[36]. A family of processors can be built around the baseinstruction set architecture (ISA) of the synthesizable Xtensaprocessor core [37], [38]. As a result, custom processorconfigurations can be obtained with optimized performance,power dissipation, code size, and die size. Design automationalgorithms and tools for extensible and configurable processorsare discussed in [39] and [40]. Design-space exploration of the

1Sensitivity = (TP/TP + FN ) and specificity = (TN /TN + FP ), whereT (F)N (P) is the number of true (false) negatives (positives).

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE II

SOFTWARE ENERGY (PER TEST VECTOR) FOR PREPROCESSING AND

FEATURE EXTRACTION ON THE BASE XTENSA PROCESSOR CORE

Computational Step Energy/Test Vector

Preprocessing segmentation 84.02 μJMorphology feature extraction 2.61 μJ

Wavelet feature extraction 29.28 μJ

Tensilica processor parameters allows customizing thebase processor to achieve minimum-energy consumption.The major parameters include the choice of multipliers(MUL32/MUL16), instruction/data cache sizes andassociativities (range of 0–16 kB and 2–16), anddatapath/instruction path widths (range of 8–32 b). Wepick the design parameters that lead to minimum-energyconsumption based on an initial parameter-space exploration.The configuration obtained for the base processor is shown inTable I. We present energy profiling results of the arrhythmiadetector on the configured base processor.

A. Implementation on the Base Processor

In this section, we present energy analysis of a softwareimplementation of the arrhythmia detector on the Tensilicabase processor; initial design profiling leads to the configu-ration parameters shown in Table I. Note that even thoughclassification involves dot product computations, includingmultipliers (MUL16 and MUL32) in the base processor leadsto higher energy consumption. In Section III-B, we show thatthis energy increase is due to the overhead of fetching high-dimensional data for the multipliers.

The energy profiling results for the preprocessing andfeature extraction steps are shown in Table II (results areshown at the operating frequency of 10 MHz and supplyvoltage of 1.2 V). A feature vector is derived every heartbeat and consumes 84.02 μJ for segmentation, which includesthe process of isolating individual heartbeats along with thefiltering of noise and other interference sources. Subsequently,2.61 and 29.28 μJ of energy is consumed for morphologicaland wavelet feature extraction, respectively.

Fig. 3(a) and (b) shows the energy of classification versusNSV and DSV , respectively. Both the number and dimension-ality of the SVs are representative of the model complexity.It can be seen that, because of energy scaling, classificationenergy rapidly dominates that of feature extraction. NSV =10 000, using wavelet features and a fourth-order polynomialkernel, for instance, leads to an energy consumption ratio of941:3:1 for classification, preprocessing, and feature extrac-tion, respectively. At NSV = 100 000, the ratio increases to5172:3:1. As described in Section II-B, to avoid compromisingaccuracy, biomedical applications generally require complexmodels, causing the classification energy to be dominant.The energies reported in other biomedical applications exhibita similar trend (e.g., seizure detection based on electroen-cephalograph classification [12]).

The output of the Xtensa profiling tool for classificationcomputations is shown in Table III. The dot product computa-tions required in (2) are shown in bold (sprod_ns and kernel).

(a) (b)

Fig. 4. Custom-instruction based implementation resulting in only modestenergy improvement. Classification energy still dominates preprocessing +feature extraction (a) by 1146× for wavelet features and (b) by 264× formorphological features.

These functions correspond to the NSV and DSV analysispresented in Section II-B, and constitute over 70% of thecomputations. Thus, classification is the energy-intensive com-putational step and is targeted next for hardware specializationthrough custom instructions and a coprocessor.

B. Custom-Instruction-Based Platform

In this section, the use of custom instructions is explored asa hardware-specialization option. We primarily focus on theclassifier because of its importance in determining the totalenergy. We find that the custom instructions are insufficient forachieving significant classifier energy savings. This is due tothe overheads that remain for fetching high-dimensional datafrom memory. Thus, the energy reductions achievable throughthe use of custom instructions for classification are limitedbecause of the large number of operands involved in the dotproduct computation.

The Tensilica Xpress compiler [36] is used to optimize thesoftware implementation. This involves an automatic designspace exploration of the potential custom instructions. SVMclassification, wavelet transform with Daubechies wavelets oforder four, and morphological feature extraction functions(including threshold selection and QRS isolation) are cho-sen for implementation as custom instructions. We performXpress synthesis using FLIX, Fusion, and SIMD instructionsprovided by Tensilica [36]. These design options provideoptimization techniques, including automatic vectorization inthe custom processor. Fusion instructions enable the lowestenergy implementation. Table IV shows the top three custominstructions obtained for the feature-extraction and classifi-cation computations. The number of calls to each custominstruction is also shown as a percentage of the total instruc-tions. We observe that there is no commonality in the custominstructions across feature-extraction and classification compu-tations. Figs. 4(a) and (b) show the classification energy aftera custom-instruction-based optimization (corresponding towavelet and morphological features, respectively). The energyconsumption is still substantially dominated by classification.At NSV = 10 000, for a fourth-order polynomial kernel, theratios of energy consumption for classification, preprocessing,and feature extraction computations are 4432:3:1 and 720:3:1,for the wavelet and morphological features, respectively.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SHOAIB et al.: DESIGN SPACE EXPLORATION OF DOMAIN-SPECIFIC MEDICAL-SENSOR PROCESSORS 5

TABLE III

TENSILICA CODE PROFILER OUTPUT FOR THE SVM CLASSIFIER

Morphological Features, DSV = 26, NSV = 10 000, Frequency = 10 MHzFunction Percentage Self Cumulative Number of

Name Time (%) Secs. Secs. Cyclessprod_ns 59.12 0.09 0.09 9 25 748

kernel 11.67 0.02 0.11 1 82 738smult_s 7.95 0.01 0.12 124487

add_vector_ns 6.54 0.01 0.13 102408

Wavelet Features, DSV = 256, NSV = 10 000, Frequency = 10 MHz

Function Percentage Self Cumulative Number ofName Time (%) Secs. Secs. Cycles

sprod_ns 63.30 0.57 0.57 56 88 984kernel 12.89 0.12 0.69 11 58 467smult_s 8.34 0.08 0.77 7 53 113

add_vector_ns 6.60 0.06 0.83 5 93 164

Self Secs. is the number of seconds accounted for by a particular function alone.Cumulative Secs. is a running sum of the number of seconds accounted for by a function and those listed above it.

TABLE IV

CUSTOM INSTRUCTIONS FROM THE XPRESS COMPILER AT NSV = 10 000

Custom Instruction % of Total Instr.

Preproc. + Morphology

fusion.nop.loopgt.extui 32.6fusion.abs.add8×8.extui 11.4

fusion.nop.neg8.extui 9.0

Preproc. + Wavelet

fusion.ssl.mul8×16_0.extui 21.9fusion.l8rzl.extai 17.2

fusion.movt.z.extui 8.4

Classification

fusion.movi8×16.extui 13.1fusion.add.sdd8×16.simcw.extui 5.4

fusion.sll.sub16×16_0.extui 5.3

Table V summarizes the resulting energy savings obtainedthrough custom-instruction-based optimization. This optimiza-tion leads to roughly 10× energy improvement for thepreprocessing and feature-extraction operations. However,optimization of classification computations leads to roughlyonly 2× reduction in energy. The limited energy reductionfor classification is due to the large amount of data, whichneed to be fetched from memory, involved in the dot productcomputations (i.e., large NSV and DSV ) [37], [39]. This limi-tation is not adequately addressed by custom-instruction-basedoptimization. To gain further intuition, consider [37], which isused to rank candidate templates for custom-instruction-basedimplementation

Priority = OT

max (In − ω, 0) + max (Out − σ, 0) + ρ.

In the above equation, OT is the fraction of the total executiontime of the original program spent in the template, In andOut are the number of inputs and outputs of the template,respectively, ω is the number of inputs that can be encoded inthe instruction, σ is the number of outputs that can be encoded,and ρ is the number of cycles required by the template whenimplemented as a custom instruction. For custom instructioncandidate templates for classification computations, OT has

Fig. 5. General architecture of a biomedical processor.

a large value of 0.70 (according to Table III). However,the high-dimensional input vectors (corresponding to DSV )and the large number of cycles ρ required to fetch a high-order decision model (corresponding to NSV ) reduce thepriority as a potential choice for custom-instruction-basedimplementation. Choosing custom instructions for the dotproduct computation based on an alternate priority functionwould still result in suboptimal energy savings. This is becausethe processor architecture would limit the data width in theclassifier custom instructions, necessitating additional cyclesfor load-store operations. Thus, system memory overheadsoffset the benefits of custom-instruction-based speedup inclassification.

The use of a vector processor core for handling high-dimensional input data can thus provide a potential solu-tion for the classification computations. However, since theapplication data are not inherently vector in nature, such anarchitecture incurs unnecessary overheads for the general-purpose computations required, i.e., the feature-extractioncomputations required across clinical applications. Rather,the representation of data in a vector form is a specifictransformation introduced by the classification framework.Thus, the complexity and associated energy overheads [41]incurred by a vector-processor-based implementation of the

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE V

SPEEDUP AND ENERGY SAVING USING CUSTOM INSTRUCTIONS WITH A BASE XTENSA

PROCESSOR RUNNING AT Vdd = 1.2 V AND 10 MHZ, WITH NSV = 10 000

Processor Preproc. +Morphology

Preproc. +Wavelet

Classification

Configuration DSV = 26 DSV = 256

Base XtensaNo. Cycles 27.56k 36.04k 1.57M 8.99M

Energy (μJ) 86.62 113.29 4935.2 27574

Base Xtensa+ Custom Instr.

No. Cycles 2.75k 3.91k 0.78M 4.58M

Energy (μJ) 8.65 12.28 2455.3 14068

ENERGY IMPROVEMENT 10.01× 9.23× 2.01× 1.96×

Fig. 6. Architecture and layout of the classification coprocessor designed in an FD-SOI process. The coprocessor architecture comprises an array of MACunits to compute the dot product of SVs and TVs. The output is then transformed by a kernel function in order to evaluate the classification result.

preprocessing and feature-extraction steps are best avoidedfor a platform-level design. To exploit both the canonicalcomputations and data structuring required in the classificationframework, we next turn our attention to coprocessor-basedhardware specialization.

C. Coprocessor-Based Platform

In this section, we discuss a coprocessor-based special-ization that allows the data structures used by the classifierto be efficiently handled, yielding substantial energy savings.Further, this degree of specialization raises the opportunity formicroarchitectural optimizations based on parallelism, whichcan be readily exploited in the computation.

The architecture proposed in Fig. 5 aims to take advantageof the structure of biomedical algorithms where a high degreeof flexibility is primarily required in the feature computa-tions. The need for flexibility arises because of the rangeof feature computations involved in various applications. Forinstance, morphological and wavelet features are employed inarrhythmia detection [23], [29], proteomic classification [42],and heart-rate estimation [43]; spectral features are employedin seizure detection [12], brain–machine interfaces [44], andsleep disorder analysis [45].

Thus, a general-purpose processor is employed for featurecomputation, while an optimized coprocessor is employedfor kernel-based SVM classification. The feature-extractioncomputations are optimized through custom instructions, pro-viding significant energy savings, as shown in Table V. Thesecomputations involve floating-point operations in the Tensilica

CPU, incurring somewhat higher energy than a fixed-pointimplementation. However, as explained in the previous section,the contribution of the feature-extraction energy to the overallprocessor energy is very small. Thus, we focus on optimizingthe coprocessor. Here, in addition to hardware specialization,circuit and microarchitectural enhancements aim to achieveminimum-energy operation [46] through voltage scaling andparallelism, whereby the throughput constraints for real-timedetection can be met. In addition to energy efficiency, the needfor selective flexibility is also recognized so that the classifi-cation needs across a wide range of biomedical applicationscan be supported. For example, the rate of processing massspectrometry data [42] could be different from EEG [12] orECG signals [23], [29]. For these applications, NSV , DSV , dataprecision, as well as the kernel functions will also be different.The coprocessor thus introduces this flexibility through aprecision-scalable multiplier. It also yields programmabilityin the classification model, computation precision, and thechoice of kernel transformation function. These aspects aresummarized in the block diagram of Fig. 5.

IV. LOW-ENERGY CLASSIFICATION COPROCESSOR

In this section, we describe the architecture of the classifi-cation coprocessor in further detail.

A. Coprocessor Microarchitecture

Fig. 6 shows the architecture and layout of the classifi-cation coprocessor. It has three major functional blocks: 1)SV and test vector (TV) buffers; 2) MAC engine; and 3)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SHOAIB et al.: DESIGN SPACE EXPLORATION OF DOMAIN-SPECIFIC MEDICAL-SENSOR PROCESSORS 7

a programmable polynomial kernel core. Following offlinetraining, SVs are loaded into the SV preload buffers. The TVs,produced through feature extraction by the general-purposeprocessor, are loaded into the TV line buffer. TVs and SVsare then fed dimension by dimension to the MAC array inorder to perform the dot product operations in (2). Readoutfrom these buffers is optimized using a multiplexer-basedarray decoder. Hardware parallelism is employed through anarray of MAC units: MAC_1 to MAC_N , each of whichis associated with an SV preload buffer. The coprocessoroperates on integer data. Once multiplication over all thedimensions is complete, the dot products are multiplexed to thekernel transformation block, where a second-, third-, or fourth-order polynomial transformation is computed. Furthermore, aCORDIC module in the kernel block would potentially accom-modate additional transformation functions, such as RBF,sigmoid, etc. The results are scaled and summed by a finalaccumulator whose output sign determines the classificationresult.

B. Voltage Scaling and Paralellism

In this section, we describe the energy optimization pursuedfor the coprocessor through voltage scaling. Since the dotproduct derivation (in the MAC array) dominates the com-putation, we focus on optimizing its energy.

The total energy is determined primarily by the sum ofactive switching (Eact ) and subthreshold leakage (Elk). Thereduction in Eact due to Vdd scaling is opposed by the increasein leakage energy (due to the longer resulting leakage currentintegration time TMAC). The energy-optimal point thus typi-cally occurs in the subthreshold region, since here the circuitspeed begins to degrade rapidly [46]. Although this impliesthat energy optimization leads to low circuit performance,computational throughput constraints can be efficiently met ifthe required computations can be performed in parallel withoutimposing substantial overheads due to parallelization [47]. Wecan thus exploit the parallelism possible in the classifier dotproduct computation (i.e., MAC array) to achieve minimum-energy operation for real-time biomedical detection. To do this,we first determine the minimum-energy Vdd of a MAC unit.We then determine its performance at this Vdd (i.e., secondsper MAC operation, TM AC ). The total rate of MAC operations(RT .M AC ) required in the classifier computation [of (2)] isgiven by

RT .MAC = �NSV × DSV × RCLASS� (2)

where RC L ASS is the classification rate. The required paral-lelism is then RT .M AC ×TM AC . For the application considered,the RT .M AC required ranges from 2.7 M to 7.7 M MACs persecond [16].

Fig. 7 shows Eact and Elk of a MAC unit (based on atransistor-level simulation) implemented in the target 150-nmFD-SOI CMOS process (described further in Section IV-D).The total energy Etotal is minimized at a Vdd of 0.4 V, whichis in the subthreshold region for the technology. Fig. 8 showsthe performance achieved by a MAC unit as Vdd is scaled.Under worst case process and temperature conditions (i.e., lowtemperature in subthreshold), the maximum frequency at the

Fig. 7. Eact and Elk profiles for a MAC unit with the minimum total energyoccurring at Vdd = 0.4V .

Fig. 8. Operating frequency at Vdd = 0.4V is 520 kHz at 285 K (lowtemperature is slowest in subthreshold).

minimum-energy Vdd is 520 kHz (i.e., TM AC = 1.92 μs). Thelevel of parallelism required is thus 6–15 MAC units. Fig. 7,however, shows that the energy minimum is shallow, partic-ularly if Vdd is increased slightly. For instance, to increasethe MAC performance by a factor of three (in order to coverthe target RT .M AC range), Vdd must be increased by less than50 mV (based on Fig. 8), causing a negligible increase intotal energy (based on Fig. 7). We thus optimize for the lowerperformance (by employing six MAC units) and use voltagescaling, with minimal impact on the optimization, to elevatethe performance when required.

C. Circuit-Level Optimization

In this section, we describe how the scalability desired inthe classification coprocessor is achieved.

1) SV and TV Buffers: The energy of the buffers isoptimized for read operations since the SVs are loadedinfrequently (i.e., only when a new classification model isrequired). The coprocessor buffers support a DSV ×NSV of 64.If additional storage is required to represent the classificationmodel, the control block permits expansion by allowing upto 16 384 write sequences from the processor cache or fromoff-chip memory to the local buffers. As an example, 4095

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 9. Variable-precision MAC unit. Partial product additions can be terminated at CBA-0/1/2 to scale the precision for 8/10/12-bit inputs.

Fig. 10. Scaling precision of input data enabling a second level of energyoptimization. The figure shows energy savings of 17.6% while scaling thedata-representation precision from 12 to 8 b.

SVs and 256 feature dimensions can be supported, along withany other combination that results in the same product.

2) Variable-Precision MAC: Due to the wide range ofSVs and feature dimensions across applications, the precisionrequirements of the classifier computation are variable. Sev-eral approaches for scalable-precision multipliers have beenreported (e.g., [46]). The approach used here exploits theefficiency of the Booth encoding algorithm [48].

Fig. 9 shows the architecture of the variable-precision MACunit. In the MAC unit, the BOOTH ENC blocks compute thepartial products based on the select bits of the multiplier (y).The shifted partial products are output as P Pi , i ∈ [0, 5]. Thisallows a maximum precision of 12 b for the input operands(corresponding to six partial products). In Fig. 9, P P0 andP P1 are the partial products used when precisions of 12 and10 b, respectively, are required; otherwise the precision is 8b. The carry-bypass adders (CBAs) consist of M = 4-bit fulladder chains, and N represents the total input bit width ofeach adder. The common partial products required for the8/10/12 precision bits are added using 3:2 and 2:1 compressorsin a Wallace tree. The outputs of C B A-0/1/2 are read outvia a precision-selection multiplexer (for 8/10/12-bit preci-sion, respectively). The unused CBAs can be power-gated.

Fig. 10 shows the energy reductions due to precision scaling.Although the minimum-energy Vdd remains the same, scalingthe precision from 12 to 8 b reduces the energy per multi-plication by 17.6%. Following precision selection, the outputof the multiplier has either a 24-, 20-, or 16-bit output. Thetruncation-selection multiplexer selects a level of truncation(to 12, 10, or 8 b, programmable via the status register). Theoutput of the truncation-selection multiplexer is accumulatedinto an output register using a 16-bit final CBA.

3) MAC Delay Estimation: Based on the proposed MACarchitecture (Fig. 9), the critical path delay through a MACunit can be estimated as follows:

Tdelay = TC Q + TBOOTH + TWAL + TCBA,19-4 + TCBA,23-4

+2TAND + 3TMUX + TCBA,16-4 + TSU

where TCBA,N-M is the delay of a CBA, which has a segmentlength of M and an operator bit length of N ; TBOOTH isthe delay through the Booth encoder unit; TWAL is the delaythrough the Wallace-tree compressor chain; TC Q and TSU arethe clock-to-output and clock setup delays, respectively; andTAND and TMUX are the AND gate and multiplexer delays,respectively. Further, these delays can be simplified using asum of delays through basic subblocks as follows:

TCBA,N-M = TSU + MTC

(N

M− 1

)TMUX

+(M − 1)TC + TS

TBOOTH = TMUX + 4TNOR + TCBA,12-4

TWAL = 3TS + TCBA,14-4

where TC and TS are the delays for the carry and sum pathsin a full adder, respectively, and TNOR is the NOR gate delay.

The delay through the critical path of the MAC unit can thusbe estimated systematically. Estimating the MAC delay basedon the performance of the mentioned subblocks thus facilitatesrapid estimation of the maximum operating frequency, levelof parallelism, and, hence, the associated system parameters,

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SHOAIB et al.: DESIGN SPACE EXPLORATION OF DOMAIN-SPECIFIC MEDICAL-SENSOR PROCESSORS 9

Fig. 11. Programmable kernel enabling choice among kernels of degree 1–4.

Fig. 12. FD-SOI device having a steep subthreshold slope to minimizeleakage and maintain high transistor on-to-off ratios in subthreshold CMOSgates.

thereby overcoming the need for extensive transistor-levelsimulations in the early phases of system design.

4) Flexible Polynomial Kernel: A polynomial kernel canbe selected to transform the dot product output from theMAC engine. The flexible kernel module comprises two12 × 12 × 24 multipliers to support polynomial transforma-tions of order 2–4. Only one such multiplier is needed for asecond-order polynomial function. Going from a second-orderto a third-order polynomial kernel, however, incurs the costof using an additional 12 × 12 × 24 multiplier. Further, thedifference between a fourth-order and a third-order polynomialfunction is only an additional array of multiplexers. Theseaspects are summarized in Fig. 11.

D. Choice of Technology

Owing to the modest performance requirements of typicalbiomedical applications (i.e., to the relatively low bandwidthof physiological signals), employing a technology that isaggressively optimized for low leakage is beneficial. As anexample, for arrhythmia detection, a performance on the orderof 5 million MACs per second is required. The technologywe use in this paper is thus a 1.5 V 150-nm ultralow-leakageFD-SOI CMOS process [49]. FD-SOI allows the technologyto exhibit reduced process variations due to a reduction inrandom dopant fluctuations and gives the transistors steepsubthreshold slopes; the Id –Vgs characteristic for the devicesis shown in Fig. 12. Additionally, the devices are designedto have high threshold voltages to reduce the leakage current(i.e., Vt,N = 0.65 V , |Vt,P | = 0.53 V ) [49].

V. RESULTS AND ANALYSIS

In this section, we present post-layout simulation results ofthe coprocessor.

Table VI summarizes the impact of the architectural opti-mizations considered (which include feature extraction on theTensilica processor and classification using custom instruc-tions or a coprocessor). A 2× improvement in the energyconsumption per SV dimension is obtained while going froman implementation on a base Tensilica processor (which con-sumes 11.85 nJ) to a design with custom instructions (whichconsumes 5.89 nJ). The limited energy reductions, as men-tioned in Section III-B, is due to the high-dimensional inputdata required for classification. More aggressive specialization,through the use of a coprocessor, leads to a 519× energyreduction over the base case. Subsequent voltage scaling,taking advantage of the parallelism possible in the coprocessorimplementation, leads to energy reductions of 2119× and2231× for Vdd corresponding to 0.6 and 0.4 V, respectively.

A. Coprocessor Energy Measurements

In this section, we present an analysis of the energyconsumption of the classification coprocessor versus NSV

and DSV . Further, we also quantify the energy reductionsachievable through voltage and precision scaling.

1) Energy Versus NSV and DSV : We perform energymeasurements on the post-layout extracted netlist at variousvalues of NSV for the wavelet and morphological features.Fig. 13 shows the scaling in the energy consumption versus thenumber of SVs for the classification computation. The energyfor classification scales roughly linearly with NSV . A similarbehavior is observed with DSV (as shown in Fig. 14 wherethe energy numbers are provided at Vdd = 1.2 V). Fig. 13shows results for a fourth-order polynomial kernel and spectralwavelet features (DSV = 256) for arrhythmia detection. At100 000 SVs, the optimization of a base Tensilica processorwith custom instructions leads to an energy reduction by1.96× (Section III-B). For the FD-SOI Coprocessor-baseddesign, energy reductions by 228× are achieved at a supplyvoltage of 1.2 V. Voltage and precision scaling applied to thecoprocessor leads to further energy reductions. For instance,as shown in Fig. 13, classification computations after waveletfeature extraction (DSV = 256) consume 12.00–120.05 μJusing 10 000–100 000 SVs at 8 b of data precision and asupply voltage of 0.4 V. This represents 1170× lower energythan that of a Tensilica processor with custom instructionsalone. Classification computations after morphological featureextraction, at the same precision and voltage levels, consume10.24–24.51 μJ of energy, which is 1548× smaller than onebased on a custom-instruction-based design. Detailed analysisresults for the voltage- and precision-scaling experiments arepresented next.

2) Energy Versus Vdd: Eact and Elk measurements from thecoprocessor are shown in Fig. 15, demonstrating the benefitof voltage scaling. The results are shown at 12 b of dataprecision. A second-order polynomial kernel is used for thesimulations. On average, for a data precision of 12 b, Eact

accounts for 98.1%, 95.1%, and 71.2% of the total energy at

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE VI

ENERGY PER SV DIMENSION

Tensilica(at 1.2 V)

Tensilica +Custom Instr.

Tensilica + Custom Instr. + Coprocessor

V COPROCdd = 1.2 V V COPROC

dd = 0.6 V V COPROCdd = 0.4 V

11.85 nJ – 1× 5.89 nJ – ↓2× 22.84 pJ – ↓519× 5.59 pJ – ↓2119× 5.31 pJ – ↓2231×

Fig. 13. Classification coprocessor enabling energy reductions of 228× at1.2 V. Energy reductions are increased to 1170× at Vdd = 0.4 V and 8 b ofprecision.

Fig. 14. Coprocessor energy versus DSV per TV (at 12-bit data precision).

supply voltages of 1.2, 0.6, and 0.4 V, respectively. Voltagescaling thus enables energy reduction by up to 77%.

3) Energy Versus Precision: Table VII shows the Eactand Elk measurements from post-layout simulation of thecoprocessor using 8 b of data-representation precision.A second-order polynomial kernel is used for the resultsshown. On average, for a data precision of 12 b, Eact accountsfor 98.1%, 95.2%, and 71.3% of the total energy at supplyvoltages of 1.2, 0.6, and 0.4 V, respectively. Consequently,performing computations at a data-representation precision of8 b enables an overall 9.25%, 9.09%, and 9.09% reductionin total energy as compared to an implementation that relies

Fig. 15. Coprocessor energy versus Vdd per TV (at NSV = 10, DSV = 8,and 12-bit data precision). Vdd scaling enables energy reduction by up to77%.

Fig. 16. Total application energy (segmentation + feature extraction +classification) reduced by up to 1062× through the use of a coprocessor forclassification.

on 12 b of data precision [most savings come from a 17.6%reduction from the MAC unit alone (see Fig. 10)].

B. Energy Versus Kernel Selection

Table VIII shows the energy consumption for various clas-sifier kernels at a data precision of 12 and 8 b. Since the kerneltransform is not the dominant computation, the energy scalingis modest.

On average, going from a second- to a third- to a fourth-order polynomial costs 30.92% and 7.56% extra energy for12 b and 33.57% and 4.12% for 8 b of data-representationprecision, respectively. The incremental change in energybetween a third- and a fourth-order polynomial kernel isdue to the optimization enabled by the programmable kernelarchitecture [16].

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SHOAIB et al.: DESIGN SPACE EXPLORATION OF DOMAIN-SPECIFIC MEDICAL-SENSOR PROCESSORS 11

TABLE VII

PRECISION SCALING ENABLING UP TO 9.25% REDUCTION IN THE COPROCESSOR ENERGY

Vdd (V) DSV NSV Precision Eact (pJ) Elk (pJ) Etotal (pJ) fop , T = 287K

1.2 8 1012 bit 3089.1 53.7 3142.8 }

9.25%↓ 10 MHz8 bit 2799.8 52.4 2852.2

0.6 8 1012 bit 730.5 31.0 761.5 }

9.09%↓ 2 MHz8 bit 667.2 25.1 692.3

0.4 8 10 12 bit 508.7 213.2 721.9 }9.09%↓ 550 kHz8 bit 457.1 199.2 656.3

TABLE VIII

COPROCESSOR ENERGY SCALING WITH RESPECT TO THE KERNEL FUNCTION

Specification Eact (pJ) Elk (pJ) Etotal (pJ) Kernel Order

Vdd = 1.2V, NSV = 10,DSV = 8, 12-bit precision

3078.7 64.1 3142.8 }30.92%↓ Poly2

4036.2 78.2 4114.4 }7.56%↓ Poly3

4344.4 81.1 4425.5 Poly4

Vdd = 1.2V, NSV = 10,DSV = 8, 8-bit precision

2799.8 52.4 2852.2 }33.57%↓ Poly2

3738.7 70.9 3809.6 }4.12%↓ Poly3

3914.6 74.7 3989.3 Poly4

HW-SW CODESIGN

(a) (b) (c) (d) (e)

Fig. 17. Energy proportions for preprocessing, feature extraction, and classi-fication computations, illustrating the benefits of voltage and precision scalingapplied to the classification coprocessor for low-energy operation. (a) BaseTensilica processor operating at Vdd = 1.2 V. (b) Custom-instruction-basedimplementation. (c) Hardware–software codesign with custom instructionsfor preprocessing + feature extraction and the classification computationsimplemented on a coprocessor at Vdd = 1.2 V and 12 b of data precision.(d) Coprocessor at Vdd = 1.2 V and 8 b. (e) Coprocessor at Vdd = 0.4 Vand 8 b.

C. Subblock Energy Measurements

Table IX shows the computational energy contributions forthe coprocessor subblocks during online classification at adata-representation precision of 8 and 12 b. In the table,the energy consumed in the TV and SV buffers, the MACarray engine, kernel, and the control block are shown in theBUF, MAC, KER, and CNTRL columns, respectively. TheMAC engine consists of six MAC units and the KER blockconsists of three MUL 12 × 12 × 24 multipliers, which aresubblocks within a MAC unit. As shown, the MAC + KERenergy dominates (∼ 84%), validating the need to focus onits energy through the optimization discussed in Section IV-B.Although the buffers dominate the transistor count, their lowenergy contribution shown in the table is due to the lowleakage enabled by the choice of technology. The buffers havea very weak influence on the minimum energy point of thecoprocessor due to the low activity factor and low leakageenergy.

D. Processor-Level Energy Measurements

In this section, we present energy measurements for theentire processor. The architecture, which employs custominstructions for feature extraction and a coprocessor for clas-sification, achieves over two orders of magnitude energyreductions as compared to an implementation that employsonly custom instructions for the entire computation.

1) Energy Versus NSV : In this section, we show energyscaling results versus NSV for the full signal detection process(which, along with classification, includes the energy numbersfor a custom-instruction-based implementation of preprocess-ing and feature extraction computations).

Fig. 16 shows simulation results for a fourth-order poly-nomial kernel and spectral wavelet features (DSV = 256).Significant energy reductions are observed for the total detec-tion process. For instance, using NSV = 10 000, the totaldetection energy at 1.2 V and 12 b of data precision is73.98 μJ. It is reduced to 63.69 μJ using 8 b of precision at1.2 V. Voltage scaling applied to this optimized coprocessorconfiguration results in a total detection energy of 24.29 μJat 0.4 V (this is about 580× lower than an implementationusing a base Tensilica processor, which consumes 14.08 mJ atVdd = 1.2 V).

2) Computational-Energy Contributions: Fig. 17 shows theproportions of energy consumption for the preprocessing,feature extraction, and classification computations followingthe various optimizations. The design space ranges from afull software-based Tensilica implementation to a coprocessor-based architecture. The proportions are shown for NSV =20 000. It is observed that, even after a custom-instruction-based optimization, the classification computations dominatefeature extraction and preprocessing (less than 1% of the totalenergy). After an optimization through the use of the FD-SOIcoprocessor, however, the classification energy is reduced sub-stantially. The energy proportions are 1.8:1 for classification

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE IX

ENERGY (PER TV) AND AREA OF THE SUB-BLOCKS

Meas. Condition Total BUF MAC, pJ KER, pJ CNTRL

Vdd DSV NSV pJ pJ (% Total) (% Total) pJ

8-bit Data Precision

1.2 V

8 10

2852.2 38.8 1511.8 (53.0) 900.9 (31.6) 400.7

0.6 V 692.3 18.9 360.7 (52.1) 220.2 (31.8) 92.5

0.4 V 656.3 18.5 343.8 (52.4) 203.9 (31.1) 90.1

12-bit Data Precision

1.2 V

8 10

3142.8 40.2 1640.5 (52.2) 990.0 (31.5) 472.1

0.6 V 761.5 19.3 402.8 (52.9) 242.2 (31.8) 97.2

0.4 V 721.9 20.1 383.3 (53.1) 223.8 (31.0) 94.7

Area (in mm2) 2.90 1.62 0.61 0.65 0.02

Fig. 18. Benefits of voltage scaling applied to the coprocessor (with theTensilica processor at 1.2 V) using a fourth-order polynomial kernel and a beatclassification rate (RCLASS) of 3 beats/s with wavelet features (DSV = 256).

and preprocessing + feature extraction computations. How-ever, at NSV = 100 000, the corresponding proportions are 9:1(the total energy consumption for the associated computationsbeing 120.32 μJ at a supply voltage of 0.4 V and a data-representation precision of 8 b).

3) Energy Versus Vdd: We next illustrate the benefits ofvoltage scaling on the coprocessor. This leads to operation ofthe arrhythmia detector at a power level < 500 μW.

Figs. 18 and 19 show the energy optimizations achievedfor the detection process using a polynomial kernel of orderfour for the wavelet and morphological features, respectively.It is observed that voltage and precision scaling applied tothe coprocessor enable computations in data-driven biomedicalmonitoring algorithms within an energy consumption rangeof 24.29–132.33 μJ for NSV = 10 000–100 000, respectively(for wavelet features). Similar experiments with morpholog-ical features demonstrate the full detection process within10.24–24.51 μJ for NSV = 10 000–100 000, respectively (seeFig. 19).

VI. CONCLUSION

Machine-learning-based algorithms for biomedical detectionare emerging as a highly promising means to detect specificstates in physiologically complex signals. The structure inthese algorithms can be exploited toward the design of a gen-eralizable low-energy computation platform. In the detectionalgorithms, kernel-based classification is found to pose theprimary energy bottleneck and is thus targeted for optimization

Fig. 19. Voltage scaling enabling the full detector computations at less than500 μW. Results are shown for morphological features (DSV = 26) with aquartic kernel at RCLASS = 3 beats/s.

through the use of hardware specialization. It is observed that,although feature-extraction computations can be implementedefficiently as custom instructions on a low-power processor,the energy reductions achievable through the use of custominstructions for classification are limited due to the largenumber of operands involved in the dot product computation.We thus explored opportunities for optimization through theuse of a hardware coprocessor. This specialization provides anapproach for hardware–software codesign expanding the scopeof the biomedical processor architecture to a broader rangeof applications. In the classification coprocessor, the fixedkernel computations required were exploited, but selectiveflexibility required across a range of applications was alsoincorporated through specific hardware configurability. Theoptimized coprocessor reduced the computational energy ofthe biomedical platform by over three orders of magnitudecompared to that of a low-power processor with custominstructions alone. Implementation of data-driven algorithmson a base Tensilica processor consumed about 100 mW for theentire computation. Thus, a wearable device (which runs on atypical 3 V, 560-mAh capacity coin-cell battery) employing aTensilica-like processor would have an average recharge cycleof 16.8 h. If the computational power for the entire processorcan be reduced by 2–3 orders of magnitude, continuous patientmonitoring can become more viable (with battery lifetimesextended to 2–24 months). This introduces great promise forhealthcare networks, as an increasingly wide range of real-time patient signal correlations are being discovered with newclinical states of interest.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SHOAIB et al.: DESIGN SPACE EXPLORATION OF DOMAIN-SPECIFIC MEDICAL-SENSOR PROCESSORS 13

REFERENCES

[1] E. Dishman, “Inventing wellness systems for aging in place,” IEEEComput., vol. 37, no. 5, pp. 34–41, May 2004.

[2] A. Csavoy, G. Molnar, and T. Denison, “Creating support circuits forthe nervous system: Considerations for brain-machine interfacing,” inProc. Int. Symp. VLSI Circuits Conf., Jun. 2009, pp. 4–7.

[3] D. Hau and E. Coiera, “Learning qualitative models from physiologicalsignals,” in Proc. AAAI Symp. Artif. Intell. Med. Conf., 1994, pp. 67–71.

[4] G. Meyfroidt, F. Guiza, J. Ramon, and M. Bruynooghe, “Machinelearning techniques to examine large patient databases,” Best Pract. Res.Clin. Anaesthesiol., vol. 23, no. 1, pp. 127–143, Mar. 2009.

[5] A. Shoeb, S. Schachter, D. Schomer, B. Bourgeois, S. T. Treves,and J. Guttag, “Detecting seizure onset in the ambulatory setting:Demonstrating feasibility,” in Proc. IEEE Int. Conf. Eng. Med. Biol.,Sep. 2005, pp. 3546–3550.

[6] A. Chandrakasan, N. Verma, and D. Daly, “Ultralow-power electronicsfor biomedical applications,” Annu. Rev. Biomed. Eng., vol. 4, pp. 247–274, Aug. 2008.

[7] J. Kwong and A. P. Chandrakasan, “An energy-efficient biomedicalsignal processing platform,” IEEE J. Solid-State Circuits, vol. 46, no. 7,pp. 1742–1753, Jul. 2011.

[8] H. Kim, R. F. Yazicioglu, T. Torfs, P. Merken, C. V. Hoof, and H.-J. Yoo,“An integrated circuit for wireless ambulatory arrhythmia monitoringsystems,” in Proc. IEEE Int. Conf. Eng. Med. Biol., Sep. 2009, pp.5409–5412.

[9] S. R. Sridhara, M. DiRenzo, S. Lingam, S.-J. Lee, R. Blazquez,J. Maxey, S. Ghanem, Y.-H. Lee, R. Abdallah, P. Singh, and M. Goel,“Microwatt embedded processor platform for medical system-on-chipapplications,” in Proc. Symp. VLSI Circuits Conf., Feb. 2010, pp.15–16.

[10] S. Cadambi, I. Durdanovic, V. Jakkula, M. Sankaradass, E. Cosatto,S. Chakradhar, and H. P. Graf, “A massively parallel FPGA-based coprocessor for support vector machines,” in Proc. Int. Symp.Field Program. Custom Comput. Mach. Conf., Apr. 2009, pp.115–122.

[11] T.-W. Chen, C.-S. Tang, S.-F. Tsai, C.-H. Tsai, S.-Y. Chien, and L.-G.Chen, “Tera-scale performance machine learning SoC (MLSoC) withdual stream processor architecture for multimedia content analysis,”IEEE J. Solid-State Circuits, vol. 45, no. 11, pp. 2321–2329, Nov. 2010.

[12] N. Verma, A. Shoeb, J. Guttag, and A. Chandrakasan, “A micro-powerEEG acquisition SoC with integrated seizure detection processor forcontinuous patient monitoring,” in Proc. Symp. VLSI Circuits Conf., Jun.2009, pp. 62–63.

[13] A. Shoeb, D. Carlson, E. Panken, and T. Denison, “A micropower sup-port vector machine based seizure detection architecture for embeddedmedical devices,” in Proc. IEEE Int. Conf. Eng. Med. Biol., Sep. 2009,pp. 4202–4205.

[14] B. Gyselinckx, R. Vullers, C. V. Hoof, J. Ryckaert, R. F. Yaziciogl,P. Fiorini, and V. Leonov, “Human++: Emerging technology for bodyarea networks,” in Proc. IEEE/IFIP Int. Conf. VLSI, Oct. 2006, pp. 175–180.

[15] Z. D. Nie, L. Wang, W. G. Chen, T. Zhang, and Y. T. Zhang, “A lowpower biomedical signal processor ASIC based on hardware softwarecodesign,” in Proc. IEEE Int. Conf. Eng. Med. Biol., Sep. 2009, pp.2559–2562.

[16] M. Shoaib, N. K. Jha, and N. Verma, “A low-energy computationplatform for data-driven biomedical monitoring algorithms,” in Proc.Design Autom. Conf., Jun. 2011, pp. 591–596.

[17] Physionet. (2012). MIT-BIH Physionet Database, Cape Town,South Africa [Online]. Available: http://www.physionet.org/physiobank/database

[18] J. Pan and W. J. Tompkins, “A real time QRS detection algorithm,”IEEE Trans. Biomed. Eng., vol. 32, no. 3, pp. 232–236, Mar. 1985.

[19] P. Laguna, N. V. Thakor, P. Caminal, R. Jane, H.-R. Yoon, A. B. D. Luna,V. Marti, and J. Guindo, “New algorithm for QT interval analysis in24-hour Holter ECG: Performance and applications,” Med. Biol. Eng.Comput., vol. 28, no. 1, pp. 67–73, Jan. 1990.

[20] P. Sajda, “Machine learning for detection and diagnosis of disease,”Annu. Rev. Biomed. Eng., vol. 8, pp. 537–565, Aug. 2006.

[21] U. Manne, R.-G. Srivastava, and S. Srivastava, “Keynote review: Recentadvances in biomarkers for cancer diagnosis and treatment,” Drug Dis.Today, vol. 10, no. 14, pp. 965–976, Jul. 2005.

[22] T. Sunderland, R. E. Gur, and S. E. Arnold, “The use of biomarkers inthe elderly: Current and future challenges,” Biol. Psych., vol. 58, no. 4,pp. 272–276, Aug. 2005.

[23] P. D. Chazal, M. O’Dwyer, and R. B. Reilly, “Automatic classificationof heartbeats using ECG morphology and heartbeat interval features,”IEEE Trans. Biomed. Eng., vol. 51, no. 7, pp. 1196–1206, Jul. 2004.

[24] T. H. Yeap, F. Johnson, and M. Rachniowski, “ECG beat classificationby a neural network,” in Proc. IEEE Int. Conf. Eng. Med. Biol. Soc.,Nov. 1990, pp. 1457–1458.

[25] O. Stanislaw and T. H. Linh, “ECG beat recognition using fuzzy hybridneural network,” IEEE Trans. Biomed. Eng., vol. 48, no. 11, pp. 1265–1271, Nov. 2001.

[26] L. Senhadji, G. Carrault, J. J. Bellanger, and G. Passariello, “Comparingwavelet transforms for recognizing cardiac patterns,” IEEE Eng. Med.Biol. Mag., vol. 14, no. 2, pp. 167–173, Mar.–Apr. 1995.

[27] A. S. Jaffe, L. Babuin, and F. S. Apple, “Biomarkers in acute cardiacdisease: The present and the future,” J. Amer. Coll. Cardiol., vol. 48,pp. 1–11, Jun. 2006.

[28] P. D. Chazal and R. B. Reilley, “A comparison of the ECG classificationperformance of different feature sets,” in Proc. IEEE Conf. Comput.Cardiol., Aug. 2000, pp. 327–330.

[29] E. D. Ubeyli, “ECG beats classification using multiclass support vectormachines with error correcting output codes,” Digit. Signal Process.,vol. 17, no. 3, pp. 675–684, May 2007.

[30] T. Joachims. (2010). SVM-Light, Support Vector Machine [Online].Available: http://svmlight/joachims.org

[31] O. Chapelle and V. Vapnik, “Model selection for support vectormachines,” in Advances in Neural Information Processing Systems.Cambridge, MA: MIT Press, 2000, pp. 230–236.

[32] R. Somorjai, M. Alexander, R. Baumgartner, S. Booth, C. Bowman,A. Demko, B. Dolenko, M. Mandelzweig, A. Nikulin, and N. Pizzi, “Adata-driven, flexible machine learning strategy for the classification ofbiomedical data,” Artif. Intell. Meth. Tools Syst. Biol., Comput. Biol.,vol. 5, pp. 67–85, Jan. 2004.

[33] K. H. Lee, S.-Y. Kung, and N. Verma, “Improving kernel-energy trade-offs for machine learning in implantable and wearable biomedicalapplications,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Conf.,May 2011, pp. 1597–1600.

[34] C. H. H. Tang, P. M. Middleton, A. V. Savkin, G. S. H. Chan, S. Bishop,and N. H. Lovell, “Non-invasive classification of severe sepsis andsystemic inflammatory response syndrome using a nonlinear supportvector machine: A preliminary study,” Physiol. Meas., vol. 31, no. 6,pp. 775–793, May 2010.

[35] A. H. Shoeb, “Application of machine learning to epileptic seizureonset detection and treatment,” Ph.D. thesis, Dept. Electr. Med. Eng.,Massachusetts Inst. Technol., Boston, Sep. 2009.

[36] Tensilica Inc. (2012). The Xtensa Processor, Santa Clara, CA [Online].Available: http://www.tensilica.com

[37] F. Sun, S. Ravi, A. Raghunathan, and N. K. Jha, “Custom-instructionsynthesis for extensible-processor platforms,” IEEE Trans. Comput.-Aided Design, vol. 23, no. 2, pp. 216–228, Feb. 2004.

[38] F. Sun, S. Ravi, A. Raghunathan, and N. K. Jha, “Application-specificheterogeneous multiprocessor synthesis using extensible processors,”IEEE Trans. Comput.-Aided Design, vol. 25, no. 9, pp. 1589–1602, Sep.2006.

[39] F. Sun, S. Ravi, A. Raghunathan, and N. K. Jha, “A scalable synthesismethodology for application-specific processors,” IEEE Trans. VeryLarge Scale Integr. (VLSI) Syst., vol. 14, no. 11, pp. 1175–1188, Nov.2006.

[40] F. Sun, S. Ravi, A. Raghunathan, and N. K. Jha, “A synthesis method-ology for hybrid custom instruction and coprocessor generation forextensible processors,” IEEE Trans. Comput.-Aided Design, vol. 26,no. 11, pp. 2035–2045, Nov. 2007.

[41] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, P. Dubey, S. Junkins,A. Lake, R. Cavin, R. Espasa, E. Grochowski, T. Juan, M. Abrash,J. Sugerman, and P. Hanrahan, “Larrabee: A many-core ×86 architecturefor visual computing,” IEEE Micro, vol. 29, no. 1, pp. 10–21, Jan. 2009.

[42] F.-M. Schleif, M. Lindemann, M. Diaz, P. Maab, J. Decker, T. Elssner,M. Kuhn, and H. Thiele, “Support vector classification of proteomicprofile spectra based on feature extraction with the bi-orthogonal discretewavelet transform,” Comput. Vis. Sci., vol. 4, no. 12, pp. 189–199,Dec. 2007.

[43] N. Krupa, M. A. Mohammed, E. Zahedi, S. Ahmed, and F. M. Hassan,“Antepartum fetal heart rate feature extraction and classification usingemperical mode decomposition and support vector machine,” Biomed.Eng. Online, vol. 10, pp. 1–15, Jan. 2011.

[44] A.-T. Avestruz, W. Santa, D. Carlson, R. Jensen, S. Stanslaski,A. Helfenstine, and T. Denison, “A 5 μW/channel spectral analysis ICfor chronic bidirectional brain-machine interfaces,” IEEE J. Solid-StateCircuits, vol. 43, no. 12, pp. 3006–3024, Dec. 2008.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

[45] W. B. Mendelson, D. A. Sack, S. P. James, J. V. Martin, R. Wagner,D. Garnett, J. Milton, and T. A. Wehr, “Frequency analysis of the sleepEEG in depression,” Psych. Res., vol. 21, no. 2, pp. 89–94, 2007.

[46] A. Wang and A. P. Chandrakasan, “A 180-mV subthreshold FFTprocessor using a minimum energy design methodology,” IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 310–319, Jan. 2005.

[47] V. Sze and A. P. Chandrakasan, “A 0.4 V UWB baseband processor,” inProc. IEEE Int. Symp. Low Power Electron. Design Conf., Aug. 2007,pp. 262–267.

[48] I. S. Abu-Khater, A. Bellaouar, and M. I. Elmasry, “Circuit techniquesfor CMOS low-power high-performance multipliers,” IEEE J. Solid-State Circuits, vol. 31, no. 10, pp. 1535–1546, Oct. 1996.

[49] S. A. Vitale, P. W. Wyatt, N. Checka, J. Kedzierski, and C. L. Keast,“FD-SOI process technology for subthreshold-operation ultralow-powerelectronics,” Proc. IEEE, vol. 98, no. 2, pp. 333–342, Feb. 2010.

Mohammed Shoaib (S’08) received the B.Tech.and M.Tech. degrees in electrical engineering with aspecialization in microelectronics and VLSI designfrom the Indian Institute of Technology (IIT)Madras, Chennai, India, in 2007 and 2008, respec-tively, and the M.A. degree in electrical engineeringfrom Princeton University, Princeton, NJ, in 2010,where he is currently pursuing the Ph.D. degree inelectrical engineering.

His current research interests include data-drivenplatforms for intelligent biomedical systems, with an

emphasis on low-energy circuits and signal processing. He has authored orco-authored more than 20 technical papers, and he holds 4 U.S. patents.

Dr. Shoaib was a recipient of the IIT Madras MCM Scholarship from 2004to 2007, the Ph.D. Fellowship and an Excellence in Electrical EngineeringAward from Princeton University in 2008 and 2011, respectively, the Roberto-Padovani Scholarship in 2011, the Harold W. Dodds (Princeton UniversityHonorific) Fellowship in 2012, and the Princeton University Gordon Wu Prizefor Excellence in 2012. He was a co-recipient of the Qualcomm InnovationPh.D. Fellowship in 2011, a nominee for the Best Paper Award at Healthcommin 2011. He is a fellow of the McGraw Center for Teaching and Learningand has had industrial experience through internships with research groupsat IBM Zurich Research Laboratory, Ricoh California Research Center, andQualcomm Corporate Research and Development, CA.

Niraj K. Jha (S’85–M’85–SM’93–F’98) receivedthe B.Tech. degree in electronics and electrical com-munication engineering from the Indian Institute ofTechnology Kharagpur, Kharagpur, India, the M.S.degree in electrical engineering from the State Uni-versity of New York at Stony Brook, Stony Brook,and the Ph.D. degree in electrical engineering fromthe University of Illinois, Urbana, in 1981, 1982, and1985, respectively.

He is currently a Professor of electrical engineer-ing with Princeton University, Princeton, NJ. He

has authored or co-authored more than 360 technical papers, and authored

12 book chapters. He has co-authored or co-edited five books entitled Testingand Reliable Design of CMOS Circuits (Kluwer, 1990), High-Level PowerAnalysis and Optimization (Kluwer, 1998), Testing of Digital Systems (Cam-bridge University Press, 2003), Switching and Finite Automata Theory, 3rdedn. (Cambridge University Press, 2009), and Nanoelectronic Circuit Design(Springer, 2010). He holds 13 U.S. patents. His research interests includeFinFETs, low-power hardware and software design, computer-aided design ofintegrated circuits and systems, secure computing, quantum computing, andenergy-efficient buildings.

Dr. Jha is a recipient of the AT&T Foundation Award, the NEC Preceptor-ship Award for Research Excellence, the NCR Award for Teaching Excellence,and the Princeton University Graduate Mentoring Award. He is a fellow ofACM. He was the Editor-in-Chief of the IEEE TRANSACTIONS ON VLSISYSTEMS and an Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS

AND SYSTEMS I AND II, the IEEE TRANSACTIONS ON COMPUTER-AIDED-DESIGN, the IEEE TRANSACTIONS ON VLSI SYSTEMS, and the Journalof Electronic Testing: Theory and Applications. He is currently an AssociateEditor of the IEEE TRANSACTIONS ON COMPUTERS, the Journal of LowPower Electronics, and the Journal of Nanotechnology. He was the ProgramChairman of the Workshop on Fault-Tolerant Parallel and Distributed Sys-tems in 1992, the International Conference on Embedded and UbiquitousComputing in 2004, and the International Conference on VLSI Design in2010. He was the Director of the Center for Embedded System-on-a-ChipDesign funded by the New Jersey Commission on Science and Technology.He has co-authored 14 papers, which have won various awards, including theBest Paper Award at ICCD’93, FTCS’97, ICVLSID’98, DAC’99, PDCS’02,ICVLSID’03, CODES’06, ICCD’09, and CLOUD’10. A paper of his wasselected for “The Best of ICCAD: A collection of the best IEEE InternationalConference on Computer-Aided Design papers of the past 20 years,” twopapers by the IEEE Micro Magazine as one of the top picks from the ComputerArchitecture conferences in 2005 and 2007, and two others as being amongthe most influential papers of the last 10 years at the IEEE Design Automationand Test Conference, in Europe. He has co-authored another six papers thatare nominated for Best Paper or Most Influential Paper Awards. He has givenseveral keynote speeches in the area of nanoelectronic design and test.

Naveen Verma (M’05) received the B.A.Sc. degreein electrical and computer engineering from theUniversity of British Columbia, Vancouver, Canada,in 2003, and the M.S. and Ph.D. degrees in electri-cal engineering from the Massachusetts Institute ofTechnology, Cambridge, in 2005 and 2009, respec-tively.

He has been an Assistant Professor of electricalengineering with Princeton University, Princeton,NJ, since 2009. His research focuses on ultralow-power integrated circuits, including low-voltage dig-

ital logic and SRAMs, low-noise analog instrumentation and data conversion,and energy-efficient processing algorithms especially for biomedical applica-tions.

Prof. Verma was a co-recipient of the ISSCC Jack Kilby Award forOutstanding Student Paper in 2008 and the DAC/ISSCC Student DesignContest Award in 2006.