INTEGRATION, the VLSI journal · 2013-01-10 · Application development experience abstract...

INTEGRATION, the VLSI journal 46 (2013) 89–103

Contents lists available at SciVerse ScienceDirect

INTEGRATION, the VLSI journal

0167-92

http://d

n Corr

E-m

sersanm

jresano@

aplaza@

journal homepage: www.elsevier.com/locate/vlsi

Use of FPGA or GPU-based architectures for remotely sensed hyperspectralimage processing

Carlos Gonzalez a,n, Sergio Sanchez b, Abel Paz b, Javier Resano c, Daniel Mozos a, Antonio Plaza b

a Department of Computer Architecture and Automatics, Computer Science Faculty, Complutense University of Madrid, 28040 Madrid, Spainb Hyperspectral Computing Laboratory, Department of Technology of Computers and Communications, University of Extremadura, 10003 Caceres, Spainc Department of Computer and Systems Engineering (DIIS), Engineering Research Institute of Aragon (I3A), University of Zaragoza, 50018 Zaragoza, Spain

a r t i c l e i n f o

Article history:

Received 22 January 2011

Received in revised form

5 April 2012

Accepted 5 April 2012Available online 19 April 2012

Keywords:

Hyperspectral imaging

Hardware accelerators

FPGAs

GPUs

Application development experience

60/$ - see front matter & 2012 Elsevier B.V. A

x.doi.org/10.1016/j.vlsi.2012.04.002

esponding author.

ail addresses: [email protected] (C. G

[email protected] (S. Sanchez), [email protected] (A

unizar.es (J. Resano), [email protected] (D.

unex.es (A. Plaza).

a b s t r a c t

Hyperspectral imaging is a growing area in remote sensing in which an imaging spectrometer collects

hundreds of images (at different wavelength channels) for the same area on the surface of the Earth.

Hyperspectral images are extremely high-dimensional, and require advanced on-board processing

algorithms able to satisfy near real-time constraints in applications such as wildland fire monitoring,

mapping of oil spills and chemical contamination, etc. One of the most widely used techniques for

analyzing hyperspectral images is spectral unmixing, which allows for sub-pixel data characterization.

This is particularly important since the available spatial resolution in hyperspectral images is typically

of several meters, and therefore it is reasonable to assume that several spectrally pure substances

(called endmembers in hyperspectral imaging terminology) can be found within each imaged pixel. In

this paper we explore the role of hardware accelerators in hyperspectral remote sensing missions and

further inter-compare two types of solutions: field programmable gate arrays (FPGAs) and graphics

processing units (GPUs). A full spectral unmixing chain is implemented and tested in this work,

using both types of accelerators, in the context of a real hyperspectral mapping application using

hyperspectral data collected by NASA’s Airborne Visible Infra-Red Imaging Spectrometer (AVIRIS). The

paper provides a thoughtful perspective on the potential and emerging challenges of applying these

types of accelerators in hyperspectral remote sensing missions, indicating that the reconfigurability of

FPGA systems (on the one hand) and the low cost of GPU systems (on the other) open many innovative

perspectives toward fast on-board and on-the-ground processing of remotely sensed hyperspectral

images.

& 2012 Elsevier B.V. All rights reserved.

1. Introduction

Hyperspectral imaging is concerned with the measurement,analysis, and interpretation of spectra acquired from a givenscene (or specific object) at a short, medium or long distance byan airborne or satellite sensor [1]. The wealth of spectral informa-tion available from latest-generation hyperspectral imaginginstruments, which have substantially increased their spatial,spectral and temporal resolutions, has quickly introduced newchallenges in the analysis and interpretation of hyperspectral datasets. For instance, the NASA Jet Propulsion Laboratory’s AirborneVisible Infra-Red Imaging Spectrometer (AVIRIS) [2] is now ableto record the visible and near-infrared spectrum (wavelength

ll rights reserved.

onzalez),

. Paz),

Mozos),

region from 0.4 to 2:5 mm) of the reflected light of an area2–12 km wide and several kilometers long using 224 spectralbands. The resulting data cube (see Fig. 1) is a stack of images inwhich each pixel (vector) has an associated spectral signature or‘fingerprint’ that uniquely characterizes the underlying objects,and the resulting data volume typically comprises severalGigabytes per flight. This often leads to the requirement ofhardware accelerators to speed-up computations, in particular,in analysis scenarios with real-time constraints in which on-board processing is generally required [3]. It is expected that, infuture years, hyperspectral sensors will continue increasing theirspatial, spectral and temporal resolutions (images with thousandsof spectral bands are currently in operation or under develop-ment). Such wealth of information has opened groundbreakingperspectives in several applications [4] (many of which with real-time processing requirements) such as environmental modelingand assessment for Earth-based and atmospheric studies, risk/hazard prevention and response including wild land fire tracking,biological threat detection, monitoring of oil spills and other types

www.elsevier.com/locate/vlsi

www.elsevier.com/locate/vlsi

dx.doi.org/10.1016/j.vlsi.2012.04.002

dx.doi.org/10.1016/j.vlsi.2012.04.002

dx.doi.org/10.1016/j.vlsi.2012.04.002

mailto:[email protected]






dx.doi.org/10.1016/j.vlsi.2012.04.002

Fig. 1. An illustration of the processing demands introduced by the ever increasing dimensionality of remotely sensed hyperspectral imaging instruments.

C. Gonzalez et al. / INTEGRATION, the VLSI journal 46 (2013) 89–10390

of chemical contamination, target detection for military and defense/security purposes, urban planning and management studies, etc. [5].

Even though hyperspectral image processing algorithms gen-erally map quite nicely to parallel systems such as clusters ornetworks of computers [6,7], these systems are generally expen-sive and difficult to adapt to on-board data processing scenarios,in which low-weight and low-power integrated components areessential to reduce mission payload and obtain analysis results inreal-time, i.e. at the same time as the data is collected by thesensor [3]. Enabling on-board data processing introduces manyadvantages, such as the possibility to reduce the data down-linkbandwidth requirements by both pre-processing data and select-ing data to be transmitted based upon some predeterminedcontent-based criteria [8]. In this regard, an exciting new devel-opment in the field of commodity computing is the emergence ofprogrammable hardware accelerators such as field programmablegate arrays (FPGAs) [9] and graphic processing units (GPUs) [10],which can bridge the gap toward on-board and real-time analysisof hyperspectral data [8,11].

The appealing perspectives introduced by hardware accelera-tors such as FPGAs (on-the-fly reconfigurability [12] and soft-ware-hardware co-design [13]) and GPUs (very high performanceat low cost [14]) also introduce significant advantages withregards to more traditional cluster-based systems. First and fore-most, a cluster occupies much more space than an FPGA or a GPU.This aspect significantly limits the exploitation of cluster-basedsystems in on-board processing scenarios, in which the weight(and the power consumption) of processing hardware must belimited in order to satisfy mission payload requirements [3]. Onthe other hand, the maintenance of a large cluster represents amajor investment in terms of time and finance. Although a clusteris a relatively inexpensive parallel architecture, the cost ofmaintaining a cluster can increase significantly with the numberof nodes [6]. Quite opposite, FPGAs and GPUs are characterizedby their low weight and size, and by their capacity to providesimilar computing performance at lower costs in the context ofhyperspectral imaging applications [11,12,14–18]. In addition,FPGAs offer the appealing possibility of adaptively selecting a

hyperspectral processing algorithm to be applied (out of a pool ofavailable algorithms) from a control station on Earth. This featureis possible thanks to the inherent re-configurability of FPGAdevices [9], which are generally more expensive than GPU devices[14]. In this regard, the adaptivity of FPGA systems for on-boardoperation, as well as the low cost and portability of GPU systems,open innovative perspectives.

In this paper, we discuss the role of FPGAs and GPUs in the taskof accelerating hyperspectral imaging computations. A full spec-tral unmixing chain is used as a case study throughout the paperand implemented using both types of accelerators in a realhyperspectral application (extraction of geological features atthe Cuprite mining district in Nevada, USA) using hyperspectraldata collected by AVIRIS. The remainder of the paper is organizedas follows. Section 2 describes a hyperspectral processing chainbased on spectral unmixing, a widely used technique to analyzehyperspectral data with sub-pixel precision. Section 3 describes anFPGA implementation of the considered chain. Section 4 describesa GPU implementation of the same chain. Section 5 provides anexperimental comparison of the proposed parallel implementa-tions using AVIRIS hyperspectral data. Finally, Section 6 concludeswith some remarks and hints at plausible future research lines.

2. Hyperspectral unmixing chain

In this section, we present spectral unmixing [19] as a hyper-spectral image processing case study. No matter the spatial resolu-tion, the spectral signatures collected in natural environments areinvariably a mixture of the signatures of the various materials foundwithin the spatial extent of the ground instantaneous field view ofthe imaging instrument [20]. The availability of hyperspectralinstruments with a number of spectral bands that exceeds thenumber of spectral mixture components allow us to approach thisproblem as follows. Given a set of spectral vectors acquired from agiven area, spectral unmixing aims at inferring the pure spectralsignatures, called endmembers [21,22], and the material fractions,called fractional abundances [23], at each pixel. Let us assume that a

Fig. 2. Hyperspectral unmixing chain.

Fig. 3. Toy example illustrating the PPI endmember extraction algorithm in a two-

dimensional space.

C. Gonzalez et al. / INTEGRATION, the VLSI journal 46 (2013) 89–103 91

hyperspectral scene with N bands is denoted by F, in which a pixelof the scene is represented by a vector f i ¼ ½f i1,f i2, . . . ,f in�ARN ,where R denotes the set of real numbers in which the pixel’sspectral response fik at sensor wavelengths k¼1,y,N is included.Under the linear mixture model assumption [19], each pixel vectorcan be modeled using the following expression [24]:

f i ¼Xp

j ¼ 1

ej �Fjþn, ð1Þ

where ej ¼ ½ej1,ej2, . . . ,ejn� denotes the spectral response of an end-member, Fj is a scalar value designating the fractional abundance ofthe endmember ej, p is the total number of endmembers, and n is anoise vector. The solution of the linear spectral mixture problemdescribed in (1) relies on the correct determination of a set fejg

pj ¼ 1

of endmembers and their abundance fractions fFjgpj ¼ 1 at each pixel

f i.In this work we have considered a standard hyperspectral

unmixing chain which is available in commercial software packagessuch as ITTVis Environment for Visualizing Images (ENVI).1 Theunmixing chain is graphically illustrated by a flowchart in Fig. 2 andconsists of two main steps: (1) endmember extraction, implemen-ted in this work using the pixel purity index (PPI) algorithm [25],and (2) non-negative abundance estimation, implemented in thiswork using the image space reconstruction algorithm (ISRA), atechnique for solving linear inverse problems with positive con-straints [26]. It should be noted that an alternative approach tothe hyperspectral unmixing chain described in Fig. 2 is based onincluding a dimensionality reduction step prior to the analysis.However, this step is mainly intended to reduce processing timebut often discards relevant information in the spectral domain. As aresult, in our implementation we do not include the dimensionalityreduction step in order to work with the full spectral informationavailable in the hyperspectral data cube. With the aforementionedobservations in mind, we describe below the two steps of theconsidered hyperspectral unmixing chain.

1 http://www.ittvis.com.

2.1. Endmember extraction

In this stage we use the PPI algorithm, a popular approachwhich calculates a spectral purity score for each n-dimensionalpixel in the original data by generating random unit vectors(called skewers), so that all pixel vectors are projected onto theskewers and the ones falling at the extremes of each skewer arecounted. After many repeated projections to different skewers,those pixels that count above a certain cut-off threshold aredeclared ‘‘pure’’ (see Fig. 3).

The inputs to the PPI algorithm in Fig. 3 are a hyperspectralimage cube F with N spectral bands; a maximum number ofprojections, K; a cut-off threshold value, vc, used to select as finalendmembers only those pixels that have been selected as extremepixels at least vc times throughout the process; and a thresholdangle, va, used to discard redundant endmembers during the

http://www.ittvis.com


process. The output is a set of p endmembers fejgpj ¼ 1. The

algorithm can be summarized by the following steps:

1.
Skewer generation. Produce a set of K randomly generated unitvectors, denoted by fskewerjg
Kj ¼ 1.

2.
Extreme projections. For each skewerj, all sample pixel vectorsf i in the original data set F are projected onto skewerj via dotproducts of f i � skewerj to find sample vectors at its extreme(maximum and minimum) projections, forming an extrema setfor skewerj which is denoted by SextremaðskewerjÞ.
3.
Calculation of pixel purity scores. Define an indicator function ofa set S, denoted by ISðf iÞ, to denote membership of an elementf i to that particular set as ISðf iÞ ¼ 1 if (f iAS) else 0. Usingthe function above, calculate the number of times that agiven pixel has been selected as extreme using the followingequation
NPPIðf iÞ ¼XK

j ¼ 1

ISextremaðskewerjÞðf iÞ: ð2Þ

4.
Endmember selection. Find the pixels with value of NPPIðf iÞ
above vc and form a unique set of p endmembers fejgpj ¼ 1 by

calculating the spectral angle (SA) [20,27] for all possibleendmember pairs and discarding those which result in anangle value below va. The SA is invariant to multiplicativescalings that may arise due to differences in illumination andsensor observation angle [24].

The most time consuming stage of the PPI algorithm is given bystep 2 (extreme projections). For example, running step 2 on ahyperspectral image with 614�512 pixels (the standard number ofpixels produced by NASA’s AVIRIS instrument in a single frame, eachwith N¼224 spectral bands) using K¼400 skewers (a configurationwhich empirically provides good results based on extensive experi-mentation reported in previous work [21]) requires the calculationof more than 2�1011 multiplication/accumulation (MAC) opera-tions, i.e. a few hours of non-stop computation on a 500 MHzmicroprocessor with 256 MB SDRAM [6]. In [12], another exampleis reported in step 2 of the PPI algorithm on ENVI 4.5 version tookmore than 50 min of computation to project every data samplevector of a hyperspectral image with the same size reported aboveonto K¼104 skewers in a PC with AMD Athlon 2.6 GHz processorand 512 MB of RAM. Fortunately, the PPI algorithm is well suitedfor parallel implementation. In [28,29], two parallel architecturesfor implementation of the PPI are proposed. Both are based on a2-dimensional processor array tightly connected to a few memorybanks. A speedup of 80 is obtained through an FPGA implementa-tion on the Wildforce board (4 Xilinx XC4036EX plus 4 memorybanks of 512 KB) [30]. More recent work presented in [31,13] usesthe concept of blocks of skewers (BOSs) to generate the skewersmore efficiently. Although these works have demonstrated theefficiency of a hardware implementation on reconfigurable FPGAboards, they rely on a different strategy for generating the skewersand, hence, we do not to use them for comparisons at this point. Tothe best of our knowledge, no GPU implementations of the PPIalgorithm are reported in the literature.

Fig. 4. (a) Basic unit. (b) Parallelization strategy by pixels. (c) Parallelization

strategy by skewers. (d) Parallelization strategy by skewers and pixels.

2.2. Abundance estimation

Once a set of p endmembers E¼ fejgpj ¼ 1 have been identified

(note that E can also be seen as an n�p matrix where n is thenumber of spectral bands and p is the number of endmembers),a positively constrained abundance estimation, i.e. FjZ0 for1r jrp, can be obtained using ISRA [26,32], a multiplicative

algorithm based on the following iterative expression:

Fkþ1

j ¼ Fk

j

Pnl ¼ 1ðejl � f ilÞ

Pnl ¼ 1ðejl � ejlÞ � F

k

j

0@

1A, ð3Þ

where the endmember abundances at pixel fi ¼ ½f i1,f i2, . . . ,f in� areiteratively estimated, so that the abundances at the kþ1-thiteration for a given endmember ej, F

kþ1

j , depend on the abun-dances estimated for the same endmember at the k-th iteration,

Fk

j . The procedure starts with an initial estimation (e.g. equal

abundance for all endmembers in the pixel) which is progres-sively refined in a given number of iterations. It is important toemphasize that the calculations of the fractional abundances foreach pixel are independent, so they can be calculated simulta-neously without data dependencies, thus increasing the possibi-lity of parallelization.

3. FPGA implementation

In this section we present an FPGA implementation for theendmember extraction and the abundance estimation parts of thespectral unmixing chain described in Section 2. The proposedarchitecture specification can be easily adapted to different plat-forms. Also, our architecture is scalable depending on the amountof available resources.

3.1. FPGA implementation of endmember extraction

The most time consuming stage of the PPI algorithm (extremeprojections) computes a very large number of dot products, allof which can be performed simultaneously. At this point, it isimportant to recall that the dot product unit in our FPGAimplementation is intended to compute the dot product betweena pixel vector f i and a skewerj. If we consider a simple basic unitsuch as the one displayed in Fig. 4(a) as the baseline for parallelcomputations, then we can perform the parallel computationsby pixels [see Fig. 4(b)], by skewers [see Fig. 4(c)], or by pixelsand skewers [see Fig. 4(d)]. The max/min units are intended tocalculate the maximum on the pixel vector f i for a given skewerj.


The sequential design of the basic units ensures that the increasein the number of parallel computations (in any consideredparallel scenario) does not increase the critical path, thereforethe clock cycle remains constant. Furthermore, if we increased thenumber of parallel computations, the required area would growproportionally with the number of basic units (again, in anyconsidered parallel scenario). In parallelization by skewers [seeFig. 4(c)] once we have calculated the projections of all hyper-spectral pixels with a skewer, we know the maximum and theminimum for this skewer. However, in the parallelization bypixels [see Fig. 4(b)] or by skewers and pixels [see Fig. 4(d)], a fewextra clock cycles are required to evaluate the maximum andminimum of a skewer using a cascade computation.

Taking in mind the above rationale, in this work we haveselected the parallelization strategy based on skewers. Apart fromthe aforementioned advantages with regards to other possiblestrategies, the main reason for our selection is that the paralle-lization strategy based on skewers fits very well the procedurefor data collection (in a pixel-by-pixel fashion) at the imaginginstrument. Therefore, parallelization by skewers is the one thatbest fits the data entry mechanism since each pixel can beprocessed immediately as collected. Specifically, our hardwaresystem should be able to compute K dot products against thesame pixel f i at the same time, K being the number of skewers.

Fig. 5 shows the architecture of the hardware used to imple-ment the PPI algorithm, along with the I/O communications. Fordata input, we use a DDR2 SDRAM and a DMA (controlled by aPowerPC) with a FIFO to store pixel data. For data output, we usea PowerPC to write the position of the endmembers in the DDR2SDRAM. Finally, a systolic array, a random generation module andan occurrences memory are also used. For illustrative purposes,Fig. 6 describes the architecture of the dot product processors

Fig. 5. Hardware architecture to implement the endmember extraction step.

Fig. 6. Hardware architecture of a dot product processor.

used in our systolic array. Basically, a systolic cycle consists ofcomputation of a single dot product between a pixel and askewer, and memorization of the index of the pixel if the dotproduct is higher or smaller than a previously computed max/minvalue. It has been shown the in previous work [28,29] that theskewer values can be limited to a very small set of integers whentheir dimensionality is large, as in the case of hyperspectralimages. A particular and interesting set is { 1, �1 } since it avoidsthe multiplication. The dot product is thus reduced to an accu-mulation of positive and negative values. As a result, each dotproduct processor only needs to accumulate the positive ornegative values of the pixel input according to the skewer input.These units are thus only composed of a single addition/subtrac-tion operator and a register. The min/max unit receives the resultof the dot product and compares it with the previous minimumand maximum values. If the result is a new minimum ormaximum, it will be stored for future comparisons.

On the other hand, the incorporation of a hardware-basedrandom generation module is one of the main features of oursystem. This module significantly reduces the I/O communica-tions that were the main bottleneck of the system in theprevious implementations [15,28,29]. It should be noted that,in a digital system, it is not possible to generate 100% randomnumbers. In our design, we have implemented a randomgenerator module which provides pseudo-random and uni-formly distributed sequences using registers and XOR gates. Itrequires an affordable amount of space (178 slices for 100skewers) and it is able to generate the next component of everyskewer in only one clock cycle and operates at a high clockfrequency (328 MHz).

To calculate the number of times each pixel has been selected asextreme (step 3 of the algorithm) we use the occurrences memory,which is initialized to zero. Once we have calculated the minimumand maximum value for each of the skewers, we update the numberof occurrences by reading the previous values stored for theextremes in the occurrences memory and then, writing these valuesincreased by one. When this step is completed, the PowerPC readsthe total number of occurrences for each pixel. If this numberexceeds the threshold value vc, it is selected as an endmember. Afterthat, the PowerPC calculates the SA for all possible endmember pairsand discards those which result in an angle value below va (step 4of the algorithm). Finally, the PowerPC writes the non-redundantendmember positions in the DDR2 SDRAM.

In order to make a flexible description of the design, most partsof it are done using generic parameters that allow us to instantiatethe design following the characteristic of the FPGA without the needto make any change in the code. We only have to set the appropriatevalue to each design parameter (e.g. the number of parallel units).This reconfigurability property is one of the most important advan-tages of FPGAs (dynamic and flexible design) over application-specific integrated circuits (ASICs), characterized by their staticdesign. Moreover, in our FPGA-based architecture we first read fromthe DDR2 SDRAM all configuration parameters. With this strategywe provide our design with the flexibility needed to adapt (in run-time) to different hyperspectral scenes, different acquisition condi-tions, even from different hyperspectral sensors.

To conclude this section, we provide a step-by-step descriptionof how the proposed architecture performs the extraction of a setof endmembers from a hyperspectral image

�
Firstly, we read from the DDR2 SDRAM all configurationparameters, i.e. the number of skewers K, the number of pixelsand bands N of the hyperspectral scene, the cut-off thresholdvalue vc to decide which pixels are to be selected as end-members, and the threshold angle va to discard redundantendmembers. These values are stored in registers.


�
To initialize the random generation module, the PowerPCgenerates two seeds of K bits (where K is the number ofskewers) and writes them to the FIFO. � Afterwards, the control unit reads the seeds and sends them to
a random generation module which stores them. Hence, therandom generation module can provide the systolic array with1 bit for each skewer on each clock cycle.
� After the PowerPC has written the two seeds, it places an order
to the DMA so that it starts copying a piece of the image fromthe DDR2 SDRAM to the FIFO. As mentioned before, the mainbottleneck in this kind of system is data input which is addressedin our implementation by the incorporation of a DMA thateliminates most I/O overheads. Moreover, the PowerPC moni-tors the input FIFO and sends a new order to the DMA everytime that it detects that the FIFO is half-empty. This time,the DMA will bring a piece of the image that occupies half ofthe FIFO.
�
Fig. 7. Hardware architecture to implement the abundance estimation step.

When the data of the first pixel has been written in the FIFO,the systolic array and the random generation module startworking. Every clock cycle a new pixel is read by the controlunit and sent to the systolic array. In parallel, the k-thcomponent of each skewer is also sent to the systolic arrayby the random generation module.
� During N clock cycles data of a pixel are accumulated positively
or negatively depending of the skewer component. In the nextclock cycle, the min/max unit updates the pixel and the randomgeneration module restores the original two seeds, concludingthe systolic cycle. In order to process the hyperspectral image, weneed as many systolic cycles as pixels in the image.
� When the entire image is processed, we update the number of
occurrences as extreme for all minima and maxima throughreads and writes to the occurrences memory.
� The aforementioned steps are repeated many times according
to the number of skewers that can be parallelized and the totalnumber of skewers we want to evaluate.
�
Fig. 8. Schematic overview of a GPU architecture.

Now the PowerPC reads the pixel purity score associated toeach pixel. If it exceeds the preset threshold value, vc, this pixelis selected as an endmember. After this, the PowerPC discardsredundant endmembers and writes the non-redundant end-member positions in the DDR2 SDRAM.

3.2. FPGA implementation of abundance estimation

Fig. 7 shows the architecture of the hardware used to imple-ment the ISRA for abundance estimation, along with the I/Ocommunications, following a scheme similar to the previoussubsection. The general structure can be seen as a three-stagepipelined architecture: the first stage provides the necessary datafor the system (endmembers and image data), the second stagecarries out the abundance estimation process for each pixel, andfinally the third stage sends such fractional abundances via anRS232 port. Furthermore, in our implementation these three stepswork in parallel. Additional details about the hardware imple-mentation in Fig. 7 can be found in [32].

Since the fractional abundances can be calculated simulta-neously for different pixels, the calculation of the ISRA algorithmcan be parallelized by replicating each ISRA basic unit u times andby adding two referees, one to read from the write FIFO and tosend data to the corresponding basic unit or units, and the otherto write the abundance fractions to the read FIFO. The numberof times that we can replicate the ISRA basic unit is determinedby the amount of available hardware resources and fixes thespeedup of the parallel implementation. Additional details aboutthe behavior of the referees, along with a step-by-step descriptionof how the proposed architecture performs the abundance esti-mation can be found in [32].

4. GPU implementation

GPUs can be abstracted in terms of a stream model, underwhich all data sets are represented as streams (i.e. ordered datasets) [33]. Fig. 8 shows the architecture of a GPU, which can beseen as a set of multiprocessors (MPs). Each multiprocessor ischaracterized by a single instruction multiple data (SIMD) archi-tecture, i.e. in each clock cycle each processor executes the sameinstruction but operating on multiple data streams. Each proces-sor has access to a local shared memory and also to local cachememories in the multiprocessor, while the multiprocessors haveaccess to the global GPU (device) memory. Algorithms are con-structed by chaining so-called kernels which operate on entirestreams and which are executed by a multiprocessor, taking one


or more streams as inputs and producing one or more streams asoutputs. The kernels can perform a kind of batch processingarranged in the form of a grid of blocks (see Fig. 9), where eachblock is composed by a group of threads which share dataefficiently through the shared local memory and synchronize theirexecution for coordinating accesses to memory. As a result, thereare different levels of memory in the GPU for the thread, block andgrid concepts (see Fig. 10). There is also a maximum number ofthreads that a block can contain but the number of threads thatcan be concurrently executed is much larger (several blocksexecuted by the same kernel can be managed concurrently, atthe expense of reducing the cooperation between threads sincethe threads in different blocks of the same grid cannot synchronizewith the other threads, as explored in the context of hyperspectralimaging implementations in [34]). Our implementation of the

Fig. 9. Batch processing in the GPU: grids of blocks of threads.

Fig. 10. Different levels of memory in the GPU for the thread, block and grid

concepts.

endmember extraction and the abundance estimation parts of theconsidered hyperspectral unmixing chain has been carried outusing the compute unified device architecture (CUDA).2

4.1. GPU implementation of endmember extraction

The first issue that needs to be addressed is how to map ahyperspectral image onto the memory of the GPU. If the size ofhyperspectral image exceeds the capacity of the GPU memory,we split the image into multiple spatial-domain partitions [6]made up of entire pixel vectors. In order to perform randomgeneration of skewers in the GPU, we have adopted one of themost widely used methods for random number generation insoftware, the Mersenne twister, which has extremely good statis-tical quality [35]. This method is implemented in the RandomGPU

kernel available in CUDA.Once the skewer generation process is completed, the most

time-consuming kernel in our GPU implementation is the onetaking care of the extreme projections step of the endmemberextraction step. This kernel is denoted by PPI and displayed inFig. 11 for illustrative purposes. The first parameter of the PPI

kernel, d_image, is the original hyperspectral image. The secondparameter is a structure that contains the randomly generatedskewers. The third parameter, d_res_partial, is the structure inwhich the output of the kernel will be stored. The kernel alsoreceives as input parameters the dimensions of the hyperspectralimage, i.e. num_lines, num_samples and num_bands. The struc-ture l_rand (local to each thread) is used to store a skewerthrough a processing cycle. It should be noted that each threadworks with a different skewer and the values of the skewer arecontinuously used by the thread in each iteration, hence it isreasonable to store skewers in the registers associated to eachthread since these memories are hundreds of times faster thanthe global GPU memory. The second structure used by the kernelis s_pixels, which is shared by all threads. Since each threadneeds to perform the dot product using the same image pixels, itis reasonable to make use of a shared structure which can beaccessed by all threads, thus avoiding that such threads access theglobal GPU memory separately. Since the s_pixels structure isalso stored in shared memory, the accesses are also much faster. Itshould be noted that the local memories in a GPU are usuallyquite small in order to guarantee very fast accesses, therefore thes_pixels structure can only accommodate a block of v pixels.

Once a skewer and a group of v pixels are stored in the localmemory associated to a thread, the next step is to perform the dotproduct between each pixel vector and the skewer. For thispurpose, each thread uses two variables pemin and pemax whichstore the minima and maxima projection values, respectively, andother two variables imin and imax which store the relative indexof the pixels resulting in the maxima and minima projectionvalues. Once the projection process is finalized for a group of v

pixels, another group is loaded. The process finalizes when allhyperspectral image pixels have been projected onto the skewer.Finally, the d_res_partial structure is updated with theminima and maxima projection values. This structure is reshapedinto a matrix of pixel purity indices by simply counting thenumber of times that each pixel was selected as extreme duringthe process and updating its associated score with such number,thus producing a final structure d_res_total that stores the finalvalues of NPPIðf iÞ for a given pixel f i. Additional CUDA kernels (notdescribed here for space considerations) have been developed toperform the other steps involved in the PPI algorithm, thusleading to the selection of a final set of p endmembers fejg

pj ¼ 1.

2 http://www.nvidia.com/object/cuda_home_new.html.

http://www.nvidia.com/object/cuda_home_new.html

Fig. 11. CUDA kernel PPI developed to implement the extreme projections step of the PPI endmember extraction algorithm on the GPU.


4.2. GPU implementation of abundance estimation

Our GPU version of the abundance estimation step is based ona CUDA kernel called ISRA, which implements the ANC-constrainedabundance estimation procedure described in Section 2.2. The ISRAkernel is shown in Fig. 12. Here, d_image_vector is the structurethat stores the hyperspectral image in the device (GPU) andd_image_unmixed is the final outcome of the process, i.e. theabundances associated to the p endmembers fejg

pj ¼ 1 derived by the

PPI algorithm. These endmembers are stored in a structure calleds_end. The kernel performs (in parallel) the calculations associatedto Eq. (3) in iterative fashion for each pixel in the hyperspectralimage, providing a set of positive abundances.

To conclude this section, we emphasize that our GPU imple-mentation of the full unmixing chain (endmember extraction plusabundance estimation) has been carefully optimized takinginto account the considered architecture (summarized in Fig. 8),

including the global memory available, the local shared memoryin each multiprocessor, and also the local cache memories.Whenever possible, we have accommodated blocks of pixels insmall local memories in the GPU in order to guarantee very fastaccesses, thus performing block-by-block processing to speed upthe computations as much as possible. In the following, weanalyze the performance of our parallel implementations indifferent architectures.

5. Experimental results

This section is organized as follows. In Section 5.1 we describethe hardware used in our experiments. Section 5.2 describesthe hyperspectral data set that will be used for demonstrationpurposes. Section 5.3 evaluates the unmixing accuracy of theconsidered implementations. Section 5.4 inter-compares the parallel

Fig. 12. CUDA kernel ISRA that computes endmember abundances in each pixel of the hyperspectral image.


performance of the FPGA and GPU implementations respectivelydescribed in Sections 3 and 4. Finally, Section 5.5 discusses theobtained results.

5.1. Hardware accelerators

The hardware architecture described in Section 3 has beenimplemented using VHDL language for the specification of thesystolic array. Further, we have used the Xilinx ISE environmentand the Embedded Development Kit (EDK) environment3 tospecify the complete system. The full system has been imple-mented on a low-cost reconfigurable board (XUPV2P type) with asingle Virtex-II PRO XC2VP30 FPGA component, a DDR SDRAMDIMM memory slot with 2 GB of main memory, a RS232 port, andsome additional components not used by our implementation

3 http://www.xilinx.com/ise/embedded/edk_pstudio.htm.

(see Fig. 13). This FPGA has a total of 13,696 slices, 27,392 sliceflip flops, and 27,392 four input LUTs available. In additionthe FPGA includes some heterogeneous resources, such as twoPowerPCs and distributed block RAMs. This is a rather old low-cost board but we have selected it for this study because it issimilar to other FPGAs that have been certified by severalinternational agencies for remote sensing applications.4

On the other hand, the implementation described in Section 4has been tested on the NVidia Tesla C1060 GPU (see Fig. 14),which features 240 processor cores operating at 1.296 GHz, withsingle precision floating point performance of 933 Gflops, doubleprecision floating point performance of 78 Gflops, total dedicatedmemory of 4 GB, 800 MHz memory (with 512-bit GDDR3 inter-face) and memory bandwidth of 102 GB/s. The GPU is connected

4 http://www.xilinx.com/publications/prod_mktg/virtex5qv-product-table.

pdf.

http://www.xilinx.com/ise/embedded/edk_pstudio.htm

http://www.xilinx.com/publications/prod_mktg/virtex5qv-product-table.pdf

http://www.xilinx.com/publications/prod_mktg/virtex5qv-product-table.pdf

Fig. 13. Xilinx reconfigurable board XUPV2P (http://www.xilinx.com/univ/XUPV2P/Documentation/ug069.pdf).

Fig. 14. NVidia Tesla C1060 GPU (http://www.nvidia.com/object/product_tesla_

c1060_us.html).


to an Intel core i7 920 CPU at 2.67 GHz with eight cores, whichuses a motherboard Asus P6T7 WS SuperComputer.

5.2. Hyperspectral data

The hyperspectral data set used in experiments is the well-known AVIRIS Cuprite scene [see Fig. 15(a)], available online inreflectance units.5 This scene has been widely used to validate theperformance of endmember extraction and unmixing algorithms.The scene comprises a relatively large area (350 lines by 350samples and 20-m pixels) and 224 spectral bands between0.4 and 2:5 mm, with nominal spectral resolution of 10 nm. Bands

5 http://aviris.jpl.nasa.gov/html/aviris.freedata.html.

1–3, 105–115 and 150–170 were removed prior to the analysisdue to water absorption and low SNR in those bands. The site iswell understood mineralogically, and has several exposed miner-als of interest including alunite, buddingtonite, calcite, kaoliniteand muscovite. Reference ground signatures of the above minerals[see Fig. 15(b)], available in the form of a U.S. Geological Survey(USGS) library,6 will be used to assess endmember signaturepurity in this work.

5.3. Unmixing accuracy

Before empirically investigating the parallel performance ofthe proposed implementations, we first evaluate their unmixingaccuracy in the context of the considered application. Prior to afull examination and discussion of results, it is important tooutline the parameter values used for the considered unmixingchain. In all the considered implementations, the number ofendmembers to be extracted was set to p¼22 after estimatingthe dimensionality of the data using the virtual dimensionality

concept [36]. In addition, the number of skewers was set toK¼104 (although values of K¼103 and K¼105 were also tested,we experimentally observed that the use of K¼103 resulted in theloss of important endmembers, while the endmembers obtainedusing K¼105 were essentially the same as those found usingK¼104). The threshold angle parameter was set to va ¼ 101, whichis a reasonable limit of tolerance for this metric, while the cut-offthreshold value parameter vc was set to the mean of NPPI scoresobtained after K¼104 iterations. These parameter values are inagreement with those used before in the literature [21]. Thenumber of iterations for the ISRA algorithm for abundanceestimation was set to 50 after empirically observing that theroot mean square error (RMSE) between the original and the

6 http://speclab.cr.usgs.gov/spectral-lib.html.

http://www.xilinx.com/univ/XUPV2P/Documentation/ug069.pdf

http://www.nvidia.com/object/product_tesla_c1060_us.html

http://www.nvidia.com/object/product_tesla_c1060_us.html

http://aviris.jpl.nasa.gov/html/aviris.freedata.html

http://speclab.cr.usgs.gov/spectral-lib.html

Fig. 15. (a) False color composition of the AVIRIS hyperspectral image collected over the Cuprite mining district in Nevada. (b) U.S. Geological Survey mineral spectral

signatures used for validation purposes. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)

Table 1Spectral angle scores (in degrees) between the endmembers extracted from the AVIRIS Cuprite hyperspectral image by different implementations

and some selected USGS reference signatures.

USGS mineral ENVI software (deg.) Cþþ implementation (deg.) FPGA implementation (deg.) GPU implementation (deg.)

Alunite 4.81 4.81 4.81 4.81

Buddingtonite 4.07 4.06 4.06 4.06

Calcite 5.09 5.09 5.09 5.09

Kaolinite 7.72 7.67 7.67 7.67

Muscovite 5.27 6.47 6.47 6.47


reconstructed hyperspectral image (using the p¼22 endmembersderived by PPI and their per-pixel abundances estimated by ISRA)was very low, and further decreased very slowly for a highernumber of iterations.

Table 1 shows the SA between the most similar endmembersdetected by the original implementation of the chain availablein ENVI 4.5, an optimized implementation of the same chaindescribed in Section 2 (coded using the Cþþ programming lan-guage), and the proposed FPGA and GPU-based implementations.The SA between an endmember ej selected by the PPI and areference spectral signature si is given by

SAðej,siÞ ¼ cos�1 ej � si

JejJ � JsiJ: ð4Þ

It should be noted that Table 1 only reports the SA scoresassociated to the most similar spectral endmember with regardsto its corresponding USGS signature. Smaller SA values indicatehigher spectral similarity. As shown in Table 1, the endmembersfound by our parallel implementations were exactly the same asthe ones found by the serial implementation of the processingchain implemented in Cþþ. However, our parallel implementationsproduced slightly different endmembers than those found byENVI’s implementation. In any event, we experimentally testedthat the SA scores between the endmembers that were differentbetween the ENVI and the parallel algorithms were always verylow (below 0.851), a fact that reveals that the final endmember setswere almost identical in the spectral sense. Finally, it is worthnoting that an evaluation of abundance estimation in real analysisscenarios is very difficult due to the lack of ground-truth toquantitatively substantiate the obtained results. However, wequalitatively observed that the abundance maps derived by ISRAfor the p¼22 endmembers derived by PPI were in good agreementwith those derived by other methods in previous work (see forinstance [24,37,38]). These maps are not displayed here for spaceconsiderations.

5.4. Parallel performance

5.4.1. Endmember extraction

Table 2 shows the resources used for our FPGA implementa-tion of the PPI-based endmember extraction process for differentnumbers of skewers (ranging from K¼10 to K¼100), tested onthe Virtex-II PRO xc2vp30 FPGA of the XUPV2P board. As shownin Table 2, we can scale our design up to 100 skewers (therefore,P¼100 algorithm passes are needed in order to process K¼104

skewers). In our design the clock cycle remains constant at187 MHz. It should be noted that, in the current implementation,the complete AVIRIS hyperspectral image is stored in an externalDDR2 SDRAM. However, with an appropriate controller, otheroptions could be supported, such as using flash memory to storethe hyperspectral data.

In previous FPGA designs of the PPI algorithm [15,31], themodule for random generation of skewers was situated in anexternal processor. Hence, frequent communications betweenthe host (CPU) and the device (FPGA) were needed. However, inour implementation the hardware random generation module isimplemented internally in the FPGA board. This approach sig-nificantly reduced the communications, leading to increasedparallel performance. To further reduce the communicationoverheads we have included a DMA and applied a prefetchingapproach in order to hide the communication latency. Basically,while the systolic array is processing a set of data, the DMA isfetching the following set, and storing it in the FIFO. Having inmind the proposed optimization concerning the use of availableresources, it is important to find a balance between the number ofDMA operations and the capacity of the destination FIFO. In otherwords, we need to fit enough information in the FIFO so that thesystolic array never needs to stop. In addition, the greater the FIFOcapacity, the fewer DMA operations will be required. We haveevaluated several FIFO sizes and identified that, for 1024 positionsor more, there are no penalties due to reading of the input data.With the above considerations in mind, we emphasize that our

Table 2Summary of resource utilization for the FPGA-based implementation of end-

member extraction from the AVIRIS Cuprite hyperspectral image.

Number of

skewers

Number of

slice flip

flops

Number of

four input

LUTs

Number of

slices

Percentage

of total

Maximum

operation

frequency

(MHz)

10 1120 1933 1043 7.61 187

20 2240 3865 2085 15.22 187

30 3360 5796 3127 22.83 187

40 4480 7728 4170 30.44 187

50 5600 9659 5212 38.05 187

60 6720 11,591 6254 45.66 187

70 7840 13,522 7296 53.27 187

80 8960 15,454 8339 60.88 187

90 10,080 17,385 9381 68.49 187

100 11,200 19,317 10,423 76.1 187


implementation in the considered FPGA board was able to achievea total processing time for the considered AVIRIS scene of 31.23 s.Compared with a recent FPGA-based implementation usingthe same hyperspectral scene and presented in [39], our FPGAimplementation of the PPI shows a significant increase in perfor-mance, with a speedup of 2 with regard to that implementation.We must consider that the FPGA used in [39] has 2.5 times moreslices than the one used in our implementation of the PPIalgorithm. This result is far from real-time endmember extrac-tion, since the cross-track line scan time in AVIRIS, a push-broominstrument, is quite fast (8.3 ms to collect 512 full pixel vectors).This introduces the need to process the considered scene(350�350 pixels) in approximately 2 s in order to fully achievereal-time performance. In this regard, it is worth noting that weused the internal clock of 100 MHz (the maximum frequency forthe XUPV2P board) for the execution instead of any externalclock. Therefore, if faster FPGAs are certified for aerospace use, itwould be straightforward to achieve a 1.9 speedup with regardsto our current implementation. Moreover, as FPGA technology hasgreatly improved in recent years, soon larger FPGAs will be readyfor aerospace applications, and since our design is fully scalablewe could also achieve important speedups. Theoretically, if wescale our design to an FPGA with 10 times more area using a187 MHz clock it should be possible to fully achieve real-timeendmember extraction performance. These FPGAs are alreadyavailable in the market although they have not been yet beencertified for aerospace applications.

On the other hand, the execution time for the endmemberextraction stage implemented on the NVidia C1060 Tesla GPUwas 17.59 s (closer to real-time performance than the FPGAimplementation but still far). In our experiments on the IntelCore i7 920 CPU, the Cþþ implementation took 3454.53 s toprocess the considered AVIRIS scene (using just one of theavailable cores). This means that the speedup achieved by ourGPU implementation with regards to the serial implementation(using just one of the available cores) was approximately 196.39.

For illustrative purposes, Fig. 16 shows the percentage of thetotal execution time consumed by the PPI kernel described inFig. 11, which implements the extreme projections step of theunmixing chain, and by the RandomGPU kernel, which imple-ments the skewer generation step in the chain. Fig. 16 also displaysthe number of times that each kernel was invoked (in theparentheses). These values were obtained after profiling theimplementation using the CUDA Visual Profiler tool.7 In the figure,the percentage of time for data movements from host (CPU) to

7 http://developer.nvidia.com/object/cuda_3_1_downloads.html.

device (GPU) and from device to host are also displayed. It shouldbe noted that two data movements from host to device areneeded to transfer the original hyperspectral image and theskewers to the GPU, while only one movement from device tohost (final result) is needed. As shown by Fig. 16, the PPI kernelconsumes about 99% of the total GPU time devoted to end-member extraction from the AVIRIS Cuprite image, while theRandomGPU kernel comparatively occupies a much smaller frac-tion. Finally, the data movement operations are not significant,which indicates that most of the GPU processing time is investedin the most time-consuming operation, i.e. the calculation ofskewer projections and identification of maxima and minimaprojection values leading to endmember identification.

5.4.2. Abundance estimation

Table 3 shows the resources used for our hardware imple-mentation of the proposed ISRA design for different values of theISRA basic units that can operate in parallel, u. As shown inTable 3, our design can be scaled up to u¼10 ISRA basic units inparallel. An interesting behavior of the proposed parallel ISRAmodule is that it can be scaled without significant increase in thecritical path delay (the clock frequency remains almost constant).For the maximum number of basic units used in our experiments(u¼10), the percentage of total hardware utilization is 90.07%thus reaching almost full hardware occupancy. However, othervalues of u tested in our experiments, e.g. u¼7, still leave roomin the FPGA for additional algorithms. In any case, this valuecould be significantly increased in more recent FPGA boards. Forexample, in a Virtex-4 FPGA XQR4VLX200 (89,088 slices) certifiedfor space, we could have 6.5 times more ISRA units in parallelsimply by synthesizing the second stage of the hardware archi-tecture for the new number of units in parallel. In this way, wecould achieve a speedup of 6.5 without modifications in ourproposed design. In the case of an airborne platform withoutthe need for space-certified hardware, we could use a Virtex-6XQ6VLX550T (550,000 logics cells) which has nearly 40 timesmore logic cells than the FPGA used in our experiments.

In our current implementation, the processing time of ISRAfor 50 iterations was 1303.1 s in the FPGA version (which is faraway from real-time abundance estimation performance) and47.84 min in the CPU version, which has been extensively fine-tuned using autovectorization capabilities and optimization flags.This resulted in a speedup of 2.2. We emphasize that other FPGAimplementations of ISRA are available in the literature [40,41].However, the authors indicate that their full FPGA design andimplementation (tested on a Virtex II Pro FPGA) required moreexecution time than the serial implementation presented in [42].This was due to data transmission and I/O bottlenecks. Since ourapproach optimizes I/O significantly, a comparison with regardsto the approach in [40,41] was not deemed suitable.

Regarding the execution time of the abundance estimationstep in the GPU, we measured a processing time of 24.37 s in theNVidia C1060 Tesla GPU. This means that the speedup achievedby the GPU implementation with regards to the Intel Core i7 920CPU implementation (using just one of the available cores) wasapproximately 60.81. In another study focused on GPU imple-mentation of ISRA [43], the execution time obtained after runningthe algorithm with 50 iterations for a smaller hyperspectral imagecollected also by AVIRIS (132 samples by 167 lines and 224spectral bands, for a total size of 9.73 MB) was much higher (205 sand about 2.67 speedup with regards to the serial version).Although our results are more encouraging in terms of speedup,it should be noted that the times measured in [43] were reportedfor an older NVidia GPU, the 8800 GTX with 128 cores operatingat 1.33 GHz. Note also that the difference in the speedup achievedin the NVidia C1060 Tesla GPU with regards to the speedup

http://developer.nvidia.com/object/cuda_3_1_downloads.html

Fig. 16. Summary plot describing the percentage of the total GPU time consumed by the endmember extraction step of the unmixing chain on the NVidia Tesla C1060 GPU.

Table 3Summary of resource utilization for the FPGA-based implementation of ISRA for different numbers of modules in parallel.

Component Number of

modules (u)

Number of

MULT18X18s

Number of

slice flip flops

Number of four

input LUTs

Number of

slices

Percentage

of total

Maximum operation

frequency (MHz)

Parallel ISRA module 4 32 1334 5410 3084 22.51 44.7

5 40 2001 8115 4626 33.77 44.5

6 48 2668 10,820 6168 45.03 44.4

7 56 3335 13,525 7710 56.29 44.3

8 64 4002 16,230 9252 67.55 44.1

9 72 4669 18,935 10,794 78.81 43.9

10 80 5336 21,640 12,336 90.07 43.8

RS232 transmitter – 0 69 128 71 0.28 208

DMA controller – 0 170 531 367 1.45 102

Fig. 17. Summary plot describing the percentage of the total GPU time consumed by the abundance estimation step of the unmixing chain on the NVidia Tesla C1060 GPU.


reported for the Virtex-II PRO XC2VP30 FPGA. This is mainlybecause the ISRA algorithm relies on iterative multiplicationoperations which cannot be implemented in this small FPGA asefficiently as in the GPU. For illustrative purposes, Fig. 17 showsthe percentage of the total execution time employed by each ofthe CUDA kernels obtained after profiling the abundance estima-tion step for the AVIRIS Cuprite scene in the NVidia Tesla C1060architecture. The first bar represents the percentage of timeemployed by the ISRA kernel, while the second and third barsrespectively denote the percentage of time for data movements fromhost (CPU) to device (GPU) and from device to host. The reason whythere are four data movements from host to device is because notonly the hyperspectral image, but also the endmember signatures, thenumber of endmembers, and the structure where the abundances arestored need to be forwarded to the GPU, which returns the structurewith the estimated abundances after the calculation is completed. Asshown in Fig. 17, the ISRA kernel dominates the computation andconsumes more than 99% of the GPU time.

5.5. Discussion and result comparison

The performance results obtained by our two proposed imple-mentations of the full spectral unmixing chain are quite satisfac-tory, achieving important speedups when compared to the resultsobtained by a high performance CPU. This demonstrates that bothoptions represent promising technologies for aerospace applica-tions, although the number of GPU platforms which have beencertified for space operation is still very limited as compared toFPGAs. Although in our experiments the GPU implementationclearly achieved the best performance results (17.59þ24.37 s forunmixing (endmember extraction þ abundance estimation) the

considered hyperspectral data set), it should be noted that theFPGA implementation is using an older technology. As a result,the performance results reported for the FPGA (31.23þ1303.1 sfor unmixing the same scene) are also reasonable.

Regarding the platform cost, the cost of the FPGA used is just300$ (although an equivalent board certified for aerospace appli-cations may be significantly more expensive), whereas the cost ofthe GPU is around 1000$. Hence, both architectures are veryaffordable solutions for aerospace applications. Another impor-tant feature to be discussed is the scalability of the providedsolutions. In this case, the FPGA implementation has been carriedout in order to be fully scalable as it can work in parallel with anynumber of skewers as long as enough hardware area is available.In the case of the GPU implementation, the developed code canalso be made scalable but it is certainly more dependent onthe considered GPU platform, hence its portability to other GPUplatforms is more challenging than in the case of the FPGAimplementation, which could be easily adapted to other boards.

Flexibility is also an important parameter to be discussed inaerospace applications, since the available space is very restricted.In this regard, both platforms offer a low-weight solution com-pliant with mission payload requirements, although power con-sumption is higher in the GPU solution than in the FPGA solution.In both cases the solutions can be easily reused to deal with otherdifferent processing problems. In the GPU case, this is as simple asexecuting a different code, whereas in the FPGA case this can beachieved by reconfiguring the platform.

Finally it is also important to discuss the design effort neededin both cases. Compared to a Cþþ code developed for conventionalCPUs, both solutions require additional design efforts specially sincedesigners must learn different design paradigm and development


environments, and also take into account several implementationlow-level details. After comparing the design effort complexity of thetwo options that we have implemented, we believe that probablyFPGA design is a little bit more complex than GPU design due to twomain reasons. On the one hand, in the FPGA implementation theplatform needs to be designed whereas in the GPU implementationthe platform just needs to be used. On the other hand, for large FPGAdesigns hardware debugging is a complex issue. However, we believethat, in both cases, the performance achievements are more signifi-cant than the increase in the design complexity.

6. Conclusions and future research lines

Through the detailed analysis of a representative spectralunmixing chain, we have illustrated the increase in computa-tional performance that can be achieved by incorporating hard-ware accelerators to hyperspectral image processing problems. Amajor advantage of the incorporation of such hardware accel-erators aboard airborne and satellite platforms for Earth observa-tion is to overcome an existing limitation in many remote sensingand observatory systems: the bottleneck introduced by thebandwidth of the down-link connection from the observatoryplatform. Experimental results demonstrate that our hardwareimplementations make appropriate use of computing resources inthe considered FPGA and GPU architectures, and further providesignificant speedups using only one hardware device as opposedto previous efforts using clusters, and with fewer on-boardrestrictions in terms of cost, power consumption and size, whichare important when defining mission payload in remote sensingmissions (defined as the maximum load allowed in the airborneor satellite platform that carries the imaging instrument).Although the response times measured are not strictly in real-time, they are still believed to be acceptable in most remotesensing applications. In addition, the reconfigurability of FPGAsystems (on the one hand) and the low cost of GPU systems (onthe other) open many innovative perspectives from an applicationpoint of view, ranging from the appealing possibility of being ableto adaptively select one out of a pool of available data processingalgorithms (which could be applied on the fly aboard the air-borne/satellite platform, or even from a control station on Earth),to the possibility of providing a response in applications withreal-time constraints. It should be noted that on-board processingis not an absolute requirement for our proposed developments,which could also be used in on-ground processing systems.

Although the experimental results presented in this work areencouraging, further work is still needed to optimize the pro-posed implementation and fully achieve real-time performance.Radiation-tolerance and power consumption issues for thesehardware accelerators should be explored in more detail in futuredevelopments. Optimal parallel designs and implementations arestill needed for other more sophisticated processing algorithmsthat have been used extensively in the hyperspectral remotesensing community, such as advanced supervised classifiers (e.g.support vector machines) which do not exhibit regular patterns ofcomputation and communication [4]. Future work will also befocused on optimizing the ISRA abundance estimation algorithmdiscussed in this work on FPGA platforms and we will also studyother abundance estimation algorithms, such as the FCLSU, tocheck if they are more suitable for FPGA implementation.

Acknowledgments

This work has been supported by the European Community’sMarie Curie Research Training Networks Programme under

reference MRTN-CT-2006-035927, Hyperspectral Imaging Net-work (HYPER-I-NET). This work has also been supported by theSpanish Ministry of Science and Innovation (HYPERCOMP/EODIXproject, reference AYA2008-05965-C04-02), AYA2009-13300-C03-02 y TIN2009-09806.

References

[1] A.F.H. Goetz, G. Vane, J.E. Solomon, B.N. Rock, Imaging spectrometry for Earthremote sensing, Science 228 (1985) 1147–1153.

[2] R.O. Green, et al., Imaging spectroscopy and the airborne visible/infraredimaging spectrometer (AVIRIS), Remote Sensing of Environment 65 (1998)227–248.

[3] A. Plaza, Special issue on architectures and techniques for real-time proces-sing of remotely sensed images, Journal of Real-Time Image Processing 4(2009) 191–193.

[4] A. Plaza, J.A. Benediktsson, J. Boardman, J. Brazile, L. Bruzzone, G. Camps-Valls, J. Chanussot, M. Fauvel, P. Gamba, J. Gualtieri, M. Marconcini, J.C. Tilton,G. Trianni, Recent advances in techniques for hyperspectral image processing,Remote Sensing of Environment 113 (2009) 110–122.

[5] A. Plaza, C.-I. Chang, High Performance Computing in Remote Sensing, Taylor& Francis, Boca Raton, FL, 2007.

[6] A. Plaza, D. Valencia, J. Plaza, P. Martinez, Commodity cluster-based parallelprocessing of hyperspectral Imagery, Journal of Parallel and DistributedComputing 66 (2006) 345–358.

[7] A. Plaza, D. Valencia, J. Plaza, An experimental comparison of parallelalgorithms for hyperspectral analysis using homogeneous and heterogeneousnetworks of workstations, Parallel Computing 34 (2008) 92–114.

[8] Q. Du, R. Nekovei, Fast real-time onboard processing of hyperspectralimagery for detection and classification, Journal of Real-Time Image Proces-sing 22 (2009) 438–448.

[9] S. Hauck, The roles of FPGAs in reprogrammable systems, Proceedings of theIEEE 86 (1998) 615–638.

[10] E. Lindholm, J. Nickolls, S. Oberman, J. Montrym, NVIDIA Tesla: a unifiedgraphics and computing architecture, IEEE Micro 28 (2008) 39–55.

[11] U. Thomas, D. Rosenbaum, F. Kurz, S. Suri, P. Reinartz, A new software/hardware architecture for real time image processing of wide area airbornecamera images, Journal of Real-Time Image Processing 5 (2009) 229–244.

[12] C. Gonzalez, J. Resano, D. Mozos, A. Plaza, D. Valencia, FPGA implementation ofthe pixel purity index algorithm for remotely sensed hyperspectral imageanalysis, EURASIP Journal on Advances in Signal Processing 969806 (2010) 1–13.

[13] J. Theiler, J. Frigo, M. Gokhale, J.J. Szymanski, Co-design of software andhardware to implement remote sensing algorithms, Proceedings of SPIE 4480(2001) 200–210.

[14] A. Paz, A. Plaza, Clusters versus GPUs for parallel automatic target detectionin remotely sensed hyperspectral images, EURASIP Journal on Advances inSignal Processing 915639 (2010) 1–18.

[15] A. Plaza, C.-I. Chang, Clusters versus FPGA for parallel processing of hyper-spectral imagery, International Journal of High Performance ComputingApplications 22 (2008) 366–385.

[16] E. El-Araby, T. El-Ghazawi, J.L. Moigne, R. Irish, Reconfigurable processing forsatellite on-board automatic cloud cover assessment, Journal of Real-TimeImage Processing 5 (2009) 245–259.

[17] J. Setoain, M. Prieto, C. Tenllado, F. Tirado, GPU for parallel on-boardhyperspectral image processing, International Journal of High PerformanceComputing Applications 22 (2008) 424–437.

[18] Y. Tarabalka, T.V. Haavardsholm, I. Kasen, T. Skauli, Real-time anomalydetection in hyperspectral images using multivariate normal mixture modelsand GPU processing, Journal of Real-Time Image Processing 4 (2009) 1–14.

[19] J.B. Adams, M.O. Smith, P.E. Johnson, Spectral mixture modeling: a newanalysis of rock and soil types at the Viking Lander 1 site, Journal ofGeophysical Research 91 (1986) 8098–8112.

[20] N. Keshava, J.F. Mustard, Spectral unmixing, IEEE Signal Processing Magazine19 (2002) 44–57.

[21] A. Plaza, P. Martinez, R. Perez, J. Plaza, A quantitative and comparativeanalysis of endmember extraction algorithms from hyperspectral data, IEEETransactions on Geoscience and Remote Sensing 42 (2004) 650–663.

[22] Q. Du, N. Raksuntorn, N.H. Younan, R.L. King, End-member extraction forhyperspectral image analysis, Applied Optics 47 (2008) 77–84.

[23] D. Heinz, C.-I. Chang, Fully constrained least squares linear mixture analysisfor material quantification in hyperspectral imagery, IEEE Transactions onGeoscience and Remote Sensing 39 (2000) 529–545.

[24] C.-I. Chang, Hyperspectral Data Exploitation: Theory and Applications, JohnWiley & Sons, New York, 2007.

[25] J.W. Boardman, F.A. Kruse, R.O. Green, Mapping target signatures via partialunmixing of aviris data, in: Proceedings of the JPL Airborne Earth ScienceWorkshop, 1995, pp. 23–26.

[26] M.E. Daube-Witherspoon, G. Muehllehner, An iterative image space recon-struction algorithm suitable for volume ECT, IEEE Transactions on MedicalImaging 5 (1985) 61–66.

[27] C.-I. Chang, Hyperspectral Imaging: Techniques for Spectral Detection andClassification, Kluwer Academic/Plenum Publishers, New York, 2003.


[28] D. Lavernier, E. Fabiani, S. Derrien, C. Wagner, Systolic array for computingthe pixel purity index algorithm on hyperspectral images, Proceedings ofSPIE 4480 (1999) 130–138.

[29] D. Lavernier, J. Theiler, J. Szymanski, M. Gokhale, J. Frigo, FPGA implementationof the pixel purity index algorithm, Proceedings of SPIE 4693 (2002) 30–41.

[30] Wildfore Reference Manual, Technical Report, Revision 3.4, Annapolis MicroSystem Inc., 1999.

[31] M. Hsueh, C.-I. Chang, Field programmable gate arrays (FPGA) for pixel purityindex using blocks of skewers for endmember extraction in hyperspectralimagery, International Journal of High Performance Computing Applications22 (2008) 408–423.

[32] C. Gonzalez, J. Resano, A. Plaza, D. Mozos, FPGA implementation of abun-dance estimation for spectral unmixing of hyperspectral data using theimage space reconstruction algorithm, IEEE Journal of Selected Topics inApplied Earth Observations and Remote Sensing 5 (2012) 248–261.

[33] J. Setoain, M. Prieto, C. Tenllado, A. Plaza, F. Tirado, Parallel morphologicalendmember extraction using commodity graphics hardware, IEEE Geoscienceand Remote Sensing Letters 43 (2007) 441–445.

[34] S. Sanchez, A. Paz, G. Martin, A. Plaza, Parallel unmixing of remotely sensedhyperspectral images on commodity graphics processing units, Concurrencyand Computation: Practice and Experience 23 (2011) 1538–1557.

[35] M. Matsumoto, T. Nishimura, Mersenne twister: a 623-dimensionally equi-distributed uniform pseudorandom number generator, ACM Transactions onModeling and Computer Simulation 8 (1998) 3–30.

[36] Q. Du, C.-I. Chang, Estimation of number of spectrally distinct signal sourcesin hyperspectral imagery, IEEE Transactions on Geoscience and RemoteSensing 42 (2004) 608–619.

[37] M.E. Winter, N-FINDR: an algorithm for fast autonomous spectral end-member determination in hyperspectral data, Proceedings of SPIE ImageSpectrometry V 3753 (2003) 266–277.

[38] J.M.P. Nascimento, J.M. Bioucas-Dias, Vertex component analysis: a fastalgorithm to unmix hyperspectral data, IEEE Transactions on Geoscienceand Remote Sensing 43 (2005) 898–910.

[39] D. Valencia, A. Plaza, M.A. Vega-Rodriguez, R.M. Perez, FPGA design andimplementation of a fast pixel purity index algorithm for endmemberextraction in hyperspectral imagery, Chemical and Biological Standoff Detec-tion III, Proceedings of SPIE 5995 (2006) 69–78.

[40] J. Morales, N. Medero, N.G. Santiago, J. Sosa, Hardware implementation ofimage space reconstruction algorithm using FPGAs, in: Proceedings of theIEEE International Midwest Symposium on Circuits and Systems, vol. 1, 2006,pp. 433–436.

[41] J. Morales, N.G. Santiago, A. Morales, An FPGA implementation of image spacereconstruction algorithm for hyperspectral imaging analysis, Proceedings ofSPIE 6565 (2007) 1–18.

[42] S. Rosario, Iterative Algorithms for Abundance Estimation on Unmixing ofHyperspectral Imagery, Master Thesis, University of Puerto Rico, Mayaguez, 2004.

[43] D. Gonzalez, C. Sanchez, R. Veguilla, N.G. Santiago, S. Rosario-Torres,M. Velez-Reyes, Abundance estimation algorithms using NVIDIA CUDAtechnology, Proceedings of SPIE 6966 (2008) 1–9.

Carlos Gonzalez received the M.Sc. degree in 2008 andis currently a Teaching Assistant in the Department ofComputer Architecture and Automatics, ComplutenseUniversity of Madrid, Spain. In his work he mainlyfocuses on Applying run-time reconfiguration in aero-space applications. He has recently started with thistopic, working with algorithms that deal with hyper-spectral images. He is also interested in the accelera-tion of artificial intelligence algorithms applied togames. He won the Design Competition of the IEEEInternational Conference on Field Programmable Tech-nology in 2009 (FPT’09) and in 2010 (FPT’10).

Sergio Sanchez received the M.Sc. degree in 2010 andis currently a Research Associate with the Hyperspec-tral Computing Laboratory, Department of Technologyof Computers and Communications, University ofExtremadura, Spain. His main research interests com-prise hyperspectral image analysis and efficient imple-mentations of large-scale scientific problems oncommodity graphical processing units (GPUs).

Abel Paz received the M.Sc. degree in 2007 and iscurrently a staff member of Bull Spain working inthe Center for Advanced Technologies of Extremadura(CETA), and also a Research Associate with the Hyper-spectral Computing Laboratory, Department of Tech-nology of Computers and Communications, Universityof Extremadura, Spain. His main research interestscomprise hyperspectral image analysis and efficientimplementations of large-scale scientific problems oncommodity Beowulf clusters, heterogeneous networksof computers and grids, and specialized computerarchitectures such as clusters of graphical processing
units (GPUs).
Javier Resano received the Bachelor Degree in Physicsin 1997, a Master Degree in Computer Science in 1999,and the Ph.D. degree in 2005 at the UniversidadComplutense of Madrid, Spain. Currently he is Associ-ate Professor at the Computer Eng. Department of theUniversidad of Zaragoza, and he is a member of theGHADIR research group, from Universidad Complu-tense, and the GAZ research group, from Universidadde Zaragoza. He is also member of the EngineeringResearch Institute of Aragon (I3A). His research hasbeen focused in hardware/software co-design, taskscheduling techniques, Dynamically Reconfigurable
Hardware and FPGA design. His FPGA designs have
received several international awards including the first prize in the DesignCompetition of the IEEE International Conference on Field Programmable Tech-nology in 2009 (FPT’09) and in 2010 (FPT’10).

Daniel Mozos is a permanent professor in the Depart-ment of Computer Architecture and Automatics of theComplutense University of Madrid, where he leads theGHADIR research group on dynamically reconfigurablearchitectures. His research interests include designautomation, computer architecture, and reconfigurablecomputing. Mozos has a B.S. in physics and a Ph.D. incomputer science from the Complutense University ofMadrid.

Antonio Plaza received the M.Sc. degree in 1999, andthe Ph.D. degree in 2002, all in Computer Engineering.Dr. Plaza is the Head of the Hyperspectral ComputingLaboratory and an Associate Professor with theDepartment of Technology of Computers and Commu-nications, University of Extremadura, Spain. His mainresearch interests comprise hyperspectral image ana-lysis, signal processing, and efficient implementationsof large-scale scientific problems on high performancecomputing architectures, including commodity Beo-wulf clusters, heterogeneous networks of computersand grids, and specialized computer architectures such
as field programmable gate arrays (FPGAs) or graphical
processing units (GPUs). He is an Associate Editor for the IEEE Transactions onGeoscience and Remote Sensing journal in the areas of Hyperspectral ImageAnalysis and Signal Processing. He is also an Associate Editor for the Journal ofReal-Time Image Processing. He is the project coordinator of the HyperspectralImaging Network, a four-year Marie Curie Research Training Network designed tobuild an interdisciplinary European research community focused on hyperspectralimaging activities. He is also a member of the Management Committee andcoordinator of a working group in the Open European Network for HighPerformance Computing on Complex Environments, funded by the EuropeanCooperation in Science and Technology programme. Dr. Plaza is a Senior Memberof IEEE, and received the recognition of Best Reviewers of the IEEE Geoscience andRemote Sensing Letters journal in 2009, a Top Cited Article award of Elsevier’sJournal of Parallel and Distributed Computing in 2005–2010, and the 2008 BestPaper award at the IEEE Symposium on Signal Processing and InformationTechnology. Additional information about the research activities of Dr. Plaza isavailable at http://www.umbc.edu/rssipl/people/aplaza.

http://www.umbc.edu/rssipl/people/aplaza

INTEGRATION, the VLSI journal · 2013-01-10 · Application development experience abstract...

Documents

Transcript of INTEGRATION, the VLSI journal · 2013-01-10 · Application development experience abstract...