UNIVERSITY OF MALTA Msida Malta DIPARTIMENT TA -I D C TAL ...

14
Dr Ing. Carl James Debono 12 th November 2009 CCE 3013 –Computer Architecture Assignment VisoMT: A Collaborative Multithreading Multicore Processor for Multimedia Applications with a Fast Data Switching Mechanism The paper attached deals with a multicore solution for multimedia applications. (a) Discuss the contents of this paper, highlighting the advantages and disadvantages of the proposed architecture. (60% of marks) (b) Suggest methods to enhance the efficiency of this architecture and single out any defects that you think are incorporated within this design. (40% of marks) Use any published material to sustain your arguments. The submitted report should follow A4 IEEE double column format with single- spaced, twelve-point font in the text. The maximum report length is five (5) pages. Reports in excess of five pages will not be read and a zero mark will be assigned. All figures, tables, references, etc. are included in the page limit. A template in Word or Latex can be downloaded from http://www.ieee.org/go/conferencepublishing/templates. Hard deadline for the submission of the assignment: 15 th January 2010 at 12:00. No Assignment will be accepted after this date and time. Students can work in a group, but each group is limited to a maximum of two. UNIVERSITY OF MALTA Msida Malta DEPARTMENT OF COMMUNICATIONS AND COMPUTER ENGINEERING L-UNIVERSITA` TA` MALTA Msida Malta DIPARTIMENT TAL-INGINERIJA TAL-KOMUNIKAZZJONI U KOMPJUTER

Transcript of UNIVERSITY OF MALTA Msida Malta DIPARTIMENT TA -I D C TAL ...

Page 1: UNIVERSITY OF MALTA Msida Malta DIPARTIMENT TA -I D C TAL ...

Dr Ing. Carl James Debono

12th November 2009

CCE 3013 –Computer Architecture Assignment VisoMT: A Collaborative Multithreading Multicore Processor for Multimedia Applications with a Fast Data Switching Mechanism

The paper attached deals with a multicore solution for multimedia applications.

(a) Discuss the contents of this paper, highlighting the advantages and disadvantages of the proposed architecture.

(60% of marks)

(b) Suggest methods to enhance the efficiency of this architecture and single out

any defects that you think are incorporated within this design.

(40% of marks)

Use any published material to sustain your arguments. The submitted report should follow A4 IEEE double column format with single-spaced, twelve-point font in the text. The maximum report length is five (5) pages. Reports in excess of five pages will not be read and a zero mark will be assigned. All figures, tables, references, etc. are included in the page limit. A template in Word or Latex can be downloaded from http://www.ieee.org/go/conferencepublishing/templates.

Hard deadline for the submission of the assignment: 15th January 2010 at 12:00. No Assignment will be accepted after this date and time. Students can work in a group, but each group is limited to a maximum of two.

UNIVERSITY OF MALTA

Msida − MaltaDEPARTMENT OF COMMUNICATIONS

AND COMPUTER ENGINEERING

L-UNIVERSITA` TA` MALTA Msida − Malta DIPARTIMENT TA’ L-INGINERIJA TAL-KOMUNIKAZZJONI U KOMPJUTER

Page 2: UNIVERSITY OF MALTA Msida Malta DIPARTIMENT TA -I D C TAL ...

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 11, NOVEMBER 2009 1633

VisoMT: A Collaborative Multithreading MulticoreProcessor for Multimedia Applications with a Fast

Data Switching MechanismWei-Chun Ku, Shu-Hsuan Chou, Jui-Chin Chu, Chi-Lin Liu, Tien-Fu Chen, Jiun-In Guo,

and Jinn-Shyan Wang

Abstract—Multithreading and multicore processing are power-ful ways to take advantage of parallelism in applications in orderto boost a system’s performance. However, exploring sufficientparallelism and achieving data locality with low communicationoverhead are still important research issues in embedded mul-tithreading/multicore design. This paper introduces the designof a fast data switching mechanism between multilevel storagestructures in a new multicore architecture. This paper makesseveral contributions to the development of contemporary sophis-ticated multimedia applications with advanced standards suchas H.264. The first contribution, collaborative-multithreading,tightly unifies reduced instruction set computer and collaborativemultithreading digital signal processing (DSP) in order to exploithigh parallelism to provide sufficient computing power to applica-tions. Each collaborative thread of our DSP is constructed by aheterogeneous-simultaneously multithreading single instruction,multiple data structure, and four media processing cores, whichis connected by a fast switch for providing a fast data exchangemechanism among correlative streams on a thread-level basis.Our second contribution is one-stop streaming processing, whichaims to keep data in the system for as long as possible untilit is no longer needed, thus making data more efficient toaccess. Our third contribution is a chunk threading programmingmodel, including a thread management library and threadingcommunication directives for reducing data communication andsynchronization overhead. By a combination of coarse-grainedand fine-grained threading, programmers can choose variousthreading levels based on the amount of data exchange in aprogram. With our proposed techniques and an appropriateprogramming model, we can reduce processing time by 54.9%in H.264 video encoding (common intermediate format videoat 16.574 f/s) with the 1-virtual independent and streamingprocessing by open collaborative multithreading configuration,compared to the Texas Instruments C62 core that owns 8function units. We realize our design as a prototype by chipimplementation, and fabricate it as a chip based on the TaiwanSemiconductor Manufacturing Company Ltd. 0.13 µm process.The die size of the processor core is 16.12 mm2, including 414klogic transistors and 34.4 kB of on-chip static random access

Manuscript received February 16, 2009; revised May 24, 2009 and August2, 2009. First version published September 1, 2009; current version publishedOctober 30, 2009. This paper was recommended by Associate Editor M.Mattavelli.

The authors are with the Department of Computer Science and InformationEngineering, National Chung Cheng University, Chia-Yi 621, Taiwan(e-mail: [email protected]; [email protected]; [email protected];[email protected]; [email protected]; [email protected];[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSVT.2009.2031524

memory. The processor runs at 180 MH0z/1.2-V and consumes245 mW by postsimulation results.

Index Terms—Computer architecture, digital signal processors,multiprocessor interconnection, programming.

I. Introduction

AS CONTEMPORARY portable and home audio videoelectronics demand increasing multimedia processing

power and support multimode functionality on a single system,a powerful programmable processor-based solution is prefer-able to a solution using dedicated hardware. Moreover, suchapplications also necessitate a configurable architecture thatcan be adapted to use a low-cost embedded processor forvarious embedded applications without significantly changingits basic architecture. At present, several limitations of em-bedded multimedia applications make the design of embeddedprocessors complex, and these difficulties arise from the trade-offs between performance, chip area, and power consumption.

Multimedia applications usually have two properties: gooddata locality and high data parallelism. Data parallelism isgood for multithreaded programming as it improves perfor-mance. Good data locality helps reduce external memorytraffic by taking advantage of cache, local memory, andthe hierarchical memory system. There are many ways toexploit the computational power of data parallelism, such assuperscalar, very long instruction word (VLIW), simultane-ously multithreading (SMT) [1], vector machine [2]–[5], andmulticore on chip [6]. Multithreading and multicore processingare powerful ways to boost a system’s performance by takingadvantage of parallelism in applications. Many phenomenaabout multimedia programs that run on multicore architecturesare worthy of discussion.

We choose H.264 program segments as an example. Thefirst phenomenon is intra-prediction, as illustrated in Fig. 1(a).A macroblock (MB) references other MBs that have beenprocessed. Generally, a multicore architecture exchanges databetween different cores through local storage as relay stations,without directly accessing other cores’ data register files.Another example is motion estimation, shown in Fig. 1(b).Motion estimation transmits a large number of data blocks.The P-frame and B-frame may reference many frames inH.264 programming, but in most cases, these frames are

1051-8215/$26.00 c© 2009 IEEE

Authorized licensed use limited to: University of Malta. Downloaded on November 12, 2009 at 03:14 from IEEE Xplore. Restrictions apply.

Page 3: UNIVERSITY OF MALTA Msida Malta DIPARTIMENT TA -I D C TAL ...

1634 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 11, NOVEMBER 2009

Fig. 1. Issues of H.264 encoder in the multithreading style. (a) Smallblock data communication: reference in current frame. (b) Large block datacommunication: reference in different frames.

stored in a different core’s internal local storage or externalmemory. Therefore, efficient memory management is the mostimportant issue in media applications. Overall, several factorsdetermine the success of a multithreading or multicore archi-tecture, including: 1) the parallel architecture; 2) the memorysystem; and 3) the efficiency of the programming model.

This paper introduces multithreading multicore processorarchitecture. We aim at the design of fast data switchingmechanisms between different levels of storage structures.Each core is an embedded collaborative multithreading core,called virtual independent and streaming processing by opencollaborative multithreading (VisoMT), as shown in Fig. 2. Wealso present a programming model for software development.There are several key contributions in this paper to the devel-opment of contemporary sophisticated multimedia applicationswith advanced standards such as H.264.

The first contribution, collaborative-multithreading, unifiesreduced instruction set computer (RISC) and multiple multi-threading digital signal processing (DSP) by tight-coupled dataswitches to provide closely collaborative load sharing amongthreads. Each DSP core is constructed by a heterogeneous-SMT single instruction, multiple data (SIMD) structure, andfour DSP cores are connected by a fast switch, providinga fast data exchange mechanism among correlative streamson a thread-level basis. The multithreading DSP is basedon a heterogeneous-SMT structure, which is an efficient hy-brid design that combines simultaneous multithreading data-paths (program-control), data-intensive data-paths (computingpower), and large numbers of banked register-files for fast dataexchange (we call these “streaming RFs”). Conceptually, fourDSPs allow concurrent execution of several threads, whereeach thread is a two-way VLIW architecture that issues twoinstructions at the same time. Because of the thread resourcesharing at several levels in the middle of execution, the archi-tecture allows multiple heterogeneous threads to be executedtightly and collaboratively.

Our second contribution is one-stop streaming processing,which aims to keep data circulating in the system for as longas possible, until it is no longer needed, thus increasing theefficiency of data access. This is accomplished by a fast dataswitching mechanism between multilevel storage structures to

Fig. 2. Fast data switching mechanism in multicore system architecture.

solve huge bandwidth requirements for multimedia. It supportsmultilevel storage, including streaming register files, localstorage, external memory, and nonuniform cache switch. Theabove storage switching operations are controlled by a fastswitching and migrating unit. The functionality of this unitincludes: 1) core direct-access storage at each level; 2) datamigration between any two levels; and 3) data communicationbetween cores. In order to reduce data packing overhead andgain more parallelism, this unit also supports sophisticated datatranspositions when transferring data between different levels.

The third contribution is a chunk threading programmingmodel, including a thread management library and threadingcommunication directives for reducing data communicationand synchronization overhead. The chunk thread programmingmodel is a combination of coarse-grained and fine-grainedthreading. In general, the data-transmission between coarse-grained threads is less than fine-grained threads. The otherphenomenon is that the speed of data transfer between coresis low, so a coarse-grained thread will be assigned to aVisoMT core. Regarding collaborative fine-grained threading,threads process work collaboratively, and this threading levelis used as much as possible in order to exchange largeamounts of data in a streaming register file. Programmerscan choose the different threading level by adjusting the dataexchange amount in the program. This paper adopts streamingprocessing to construct the parallelism of programs, wheredata-transmission is through one-stop mechanisms to achievethe best performance.

The rest of this paper is organized as follows. In Section II,we review several papers on multicore and multithreadedarchitectures and their programming models, and comparethem to VisoMT. Section III introduces the VisoMT processorarchitecture. Section IV describes fast data switching mecha-nism. Section V describes the programming model of VisoMT.As a case study, Section VI presents how we realized theH.264 video encoder [7] on the VisoMT. Section VII gives im-plementation results. Finally, Section VIII concludes the paper.

II. Related Work

The CELL processor [8], [9] consists mainly of a Power PCprocessor core (PPE) and eight synergistic processor elements(SPEs). The job of the PPE core is control: for example, taskscheduling management and task assignment. In a streamingprocessing model, each task in an SPE accepts input data,

Authorized licensed use limited to: University of Malta. Downloaded on November 12, 2009 at 03:14 from IEEE Xplore. Restrictions apply.

Page 4: UNIVERSITY OF MALTA Msida Malta DIPARTIMENT TA -I D C TAL ...

KU et al.: VISOMT: A COLLABORATIVE MULTITHREADING MULTICORE PROCESSOR FOR MULTIMEDIA APPLICATIONS WITH FAST DATA SWITCHING 1635

processes it and then passes to the next SPE in the chain. Thestreaming chain is implemented by CELLs element intercon-nect bus interconnection and a large internal memory in eachSPE. Thus, it can successfully minimize memory bandwidthand execution time. In the Sun Microsystems UltraSPARC T1[10], [11], there are eight four-threaded cores sharing the L2cache with four banks by a crossbar for good inter-core com-munication. Intra-core multithreading increases resource uti-lization to hide memory latency and dependency hazards, andthere are up to 32 physical threads in the SPARC T1 [10], [11].This design is efficient because it shares resources to a highdegree, thus generating unprecedented levels of throughput perwatt. The vector-thread architecture [2] exemplifies a lowercommunication overhead embedded multicore design, and itsseamless intermixes of vector and multithreading computingmodels encode application parallelism and locality flexiblyand compactly. The multiple instruction stream processor [12]provides an operating system (OS)-based threading interfacethat allows multithreaded applications to scale to a largernumber of cores regardless of OS support. It is a novel multipleinstruction multiple data extension, providing an alternative tothe OS-based approach by which a multithreaded applicationprogram can directly manage multiple architectural processor-core resources.

CELL [8], [9], Vector [2], and this paper belong to hetero-geneous multicore, but SPARC T1 [10], [11] is homogeneousmulticore. CELL [8], [9] and Vector [2] own a master coreto control other accelerated threads in slave cores. This papertakes the strategy to separate the flow-control jobs on RISCand T-core. RISC maintains different coarse-grained threads,and T-core controls all fine-grained threads in one coarse-grained thread. This design’s advantage is that it could avoidunbalance workload and improve the performance. For cachesystem, SPARC T1 [10], [11] is designed for general purposeapplication, so each core has its own L1 cache with cache-coherency protocol and shares L2 cache. In another, thoseDSP-like processors, CELL [8], [9], Vector [2], and proposedVisoMT, do not need a cache coherence mechanism, becausethese architectures provide distributed local buffers and uni-fied cache. Each SPE in CELL [8], [9] supports one largelocal buffer (256 kB). There are many small and distributedbuffers in Vector [2]. VisoMTs local buffers shared by allcores are banked and hierarchical. In programming model,CELL [8], [9], Vector and this paper all support streamingprocessing, but their methodologies are different. CELL [8],[9] is supported by a high bandwidth RING bus. It couldpass the data to any core, but transformation time still exists.Vector [2] uses dedicated data paths to implement it. Thosepaths are fixed, so the transferring path flow cannot be settingdynamically. This paper provides configurable parallel accessswitch (CPAS) to support it. The advantage is low-latency,high bandwidth, and high flexibility. We will describe these ad-vantages in Section IV, which is about the fast data switchingmechanism.

Programming strategies are important for exploiting per-formance for multicore architecture. Several papers in theliterature focus on optimizing multimedia applications orrewriting programs in a multithreading style. Meenderinck

et al. [13] discuss thread parallelism that can have one of twoparallel styles: task level parallelism or data level parallelism.Task level parallelism treats a basic block or function asa scheduling unit. Although task-level partitioning is veryeasy, the disadvantage is that the task number often is smalland it is hard to have a balanced workload. In the H.264program, data level parallelism has several levels, includingthe group-of-picture (GOP) level, frame level, slice level,MB level, and block level. Rodriguez et al. [14] adopts acombination of the GOP level and slice level. They assignedan independent GOP to different processor group, becauseeach slice is independent between the data, so each processorgroup within each processor deals with a slice. The benefitsof such an approach are an effective program with multime-dia features, the two level independent rapid distribution ofinformation to all of the processors, and improved workloadbalance. The disadvantages are that information distributionand transmission time are too long and that increasing thenumber of slices per frame increases the bit rate for the samequality level. Momcilovic and Sousa [15] implements motionestimation (ME) for the H.264 encoder in a CELL processor[8] and [9]. The power processor unit runs the main task,which includes almost all video encoding processes, exceptfor the ME part, and the control procedure. All SPE pro-cesses use the same motion estimation procedure, but differentMBs.

Surveying from the existing H.264 video decoders[16]–[22], we conclude that there are several key de-sign considerations in the H.264 video decoder. First, thedecoding symbol rate of context-adaptive variable-lengthcoding/context-adaptive binary arithmetic coding will in-fluence the performance of bitstream parser, especially inthe processor-based solution. This is because that the bit-manipulation operations is not suitable executed by processor.Second, the coding tools used by H.264 are much morecomplex than the tools in the previous standards, like MPEG-2 and MPEG-4. Therefore, the processor should incorporateinstruction set (ISA) architectural extensions and a load/storeunit optimized for the video-processing domain. The ISAextensions improve the performance on video processingkernels. Third, the memory bandwidth requirement is quitehuge for motion compensation. Hence, the data cache policesand prefetching techniques should be designed to allow forefficient access to multimedia data.

III. VisoMT Processor Architecture

To enable efficient communication and lightweight synchro-nization in embedded systems, we have to reorganize thismulticore architecture and minimize its complexity. In fact, theflow control instructions in DSP are generally used less oftenthan data computation instructions. Hence, we separate theDSP into two parts: the SIMD data-path and the general flowcontrol path. The four control paths of four DSPs are mergedas a four-way simultaneous multithreading microcontrollerunit, and the other four SIMD paths still independently offerhigh data parallelism for computationally intensive processing.They are tightly coupled by shared streaming registers, which

Authorized licensed use limited to: University of Malta. Downloaded on November 12, 2009 at 03:14 from IEEE Xplore. Restrictions apply.

Page 5: UNIVERSITY OF MALTA Msida Malta DIPARTIMENT TA -I D C TAL ...

1636 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 11, NOVEMBER 2009

Fig. 3. Block diagram of the VisoMT processor.

achieve fast data communication and streaming synchronizedprocessing.

As the VisoMT targets at multimedia embedded system,several complex multicore designs should be tailored, likecache coherency, memory hierarchy, interconnect, and soon. Therefore, the proposed one-stop streaming processingpreserves both fine-grained and coarse-grained data localitywith efficient sharing, and a large percentage of communi-cation overhead is improved as shown in our experiments.In another, a unified L1-cache is provided for RISC andmultithreading DSP sharing, and two-port static random ac-cess memory (SRAM) is supported for true two-port cachedesign without performance degrading by conflicts. SharingL1-cache also facilitates to reduce communication overheadand cache-miss, especially in fine-grained collaborative multi-threading.

A. Architecture Overview

Fig. 3 presents the block diagram of the VisoMT processor,which contains a VisoMT RISC, proposed multithreadingDSPs, a two-ported I/D-cache, and our one-stop streamingprocessing. The VisoMT RISC is a scalar RISC architecturewith a thread-management ISA and interface, and it plays amaster processor role in coordinating the main control. Themultithreading DSP includes four data-intensive media cores(64-bit customized SIMD path, called the M-core) to providefour SIMD execution streams in parallel, and a multithreadingcontrol core with four physical thread contexts (called theT-core) to handle all the nondata-intensive instructions(program flow control) in a SMT. Moreover, four SIMDthreads share multiple streaming register files to achieve fastdata communication on a thread-level basis by CPAS. TheCPAS switch supports a maximum bandwidth of 160 B/cyclefor parallel access by multiple SIMD streams.

The scalar RISC and SIMD multithreading DSP are tightlycoupled by a directive event trigger connection and a shared L1cache. The multithreading DSP is a heterogeneous-SMT archi-tecture with a two-way VLIW instruction bundle. Furthermore,the background bulk data-transfer unit links with multiplestreaming register files and efficient streaming buffers to keepdata in the core as long as possible (called one-stop streamingprocessing). Overall, our design provides resource sharingfor common control components and combines multithreading

with data-intensive media cores, which improves performanceand reduces cost for embedded multimedia system. The designis open and configurable for different application requirements:users can add customized SIMD media cores as plug-ins andcollaboratively share the streaming register file. As a result,the VisoMT stands for the concept of virtual independent andstreaming processing by open collaborative multithreading.

The entire processor adopts a collaborative threading mech-anism with six physical threads, including a main thread forprogram control (running on the VisoMT RISC), several chunkthreads for media processing acceleration (running on theT-Core and M-Cores), and a data transfer thread (runningon BB Data Transfer). The main thread running on theVisoMT RISC is responsible for control and synchronizationof data communication among threads: that is, the mainthread plays a scheduler role to manage other hetero-threads.The chunk threads provide high computation capabilities andlightweight program flow-control capability for multistandardmedia application acceleration. Such a means for efficientdata communication at the thread level is realized by multiplebanked register files and a configurable parallel access switch.The approach moves data communication to the register filelevel and preserves fine-grained data locality sharing at runtime. Data in register banks can be sent to other chunk threadsfor collaboration by four bank usage instructions: split, copy,share, and swap. With fast data communication, multimediaapplications can enjoy a high degree of data reuse on the threadlevel.

One-stop streaming processing aims to keep media datain the memory hierarchy for as long as possible until it isno longer needed. The memory system is implemented by abackground bulk data-transfer (BB Data Transfer) unit oneight streaming register files and three levels of streamingbuffers. The streaming register files with a CPAS switch offerfine-grained locality for data like MBs, and on-chip SRAM isusable in several search ranges as the coarse-grained locality.The goal of the design mechanism is that once data are broughtinto the system, they will not be moved into or out of externalmemory during their lifetimes.

Finally, we target at flexible configurations for differ-ent application requirements, including number of M-cores(max. 4) and size of streaming register files in SIMD multi-threading DSP. The 2-VisoMT configuration, delivering higherperformance, is proposed for more scalable by extending one-stop streaming processing. Briefly, the 2-VisoMT is composedof one scalar RISC and two SIMD multithreading DSPs, andeach DSP has its own instruction cache, streaming registerfiles, local memory, and data-transfer unit as an independentsub-system. Between RISC and DSPs, there are still existingdirective event trigger connections for fast synchronization,and a simple interconnect is provided for inter-core communi-cation. In order to reduce coherency complexity and cost, theoriginal two-port data cache is shared by a crossbar. Therefore,RISC and DSPs perform independent processing by their sub-system without structure-dependent interference. In addition,low communication/synchronization overhead by extendingone-stop will promise the efficiency for both fine-grained andcoarse-grained multithreading.

Authorized licensed use limited to: University of Malta. Downloaded on November 12, 2009 at 03:14 from IEEE Xplore. Restrictions apply.

Page 6: UNIVERSITY OF MALTA Msida Malta DIPARTIMENT TA -I D C TAL ...

KU et al.: VISOMT: A COLLABORATIVE MULTITHREADING MULTICORE PROCESSOR FOR MULTIMEDIA APPLICATIONS WITH FAST DATA SWITCHING 1637

Fig. 4. Micro-architecture of the VisoMT multithreading DSP (M-Cores:data-intensive media cores; T-Core: threading control core).

B. VisoMT Multithreading DSP

The VisoMT multithreading DSP is constructed by theheterogeneous-SMT structure, which is an efficient hybridinternal multicore design that combines simultaneous mul-tithreading data-paths (four thread contexts), data-intensivedata-paths (four 64-bit SIMD streams), and a large numberof banked register-files for fast data exchange (160 B/cycle,parallel access). In addition, it is open and configurablefor different applications’ requirements; different customizedSIMD cores can plug into a unified interface and a high band-width data-path. Fig. 4 illustrates the micro-architecture of theVisoMT multithreading DSP in detail, including the data-pathof the heterogeneous-SMT front-end, threading control core(T-Core), data-intensive media cores (M-cores), and streamingregister files.

There are four physical chunk threads that are definedin our programming model, and every chunk thread has itsown flow control and data computation. The chunk thread’sinstruction stream consists of two-way VLIW bundles, whicheach consist of one control micro-operation (control µ-op orCµ-op for short) and one data computation micro-operation(data µ-op or Dµ-op for short). Each VLIW instruction inthe stream will separately issue one Cµ-op to T-core and alsodirects the other Dµ-op to the corresponding M-cores by theheterogeneous-SMT front-end and the thread scheduler. Thethread scheduler in the T-core serves four chunk threads’ Cµ-ops with a fair round robin scheduling policy and synchronizeswith the heterogeneous-SMT front-end for flow-control. Com-pared to the design of a traditional SMT front-end, our SMTfront-end only partially shares some critical function units(T-core) in order to the overall complexity, and it also providessimple control data-path and easily configurable properties forcustomized SIMD cores (M-cores).

As control µ-ops are executed less frequently compared todata computation µ-ops in most multimedia applications, twosimultaneously multithreaded physical concurrent data-pathsare enough for the T-core to process the Cµ-op’s types for fourthreads’ contexts in our experiments. Fig. 4 (T-core) shows thedata-path partition, which efficiently separates data processing

Fig. 5. Dataflow scheme of VisoMT multithreading DSP.

and load/store processing into two functional units in orderto parallelize threads. We support a nonblocking load/storemechanism on cache misses in order to hide memory latencyefficiently and improve parallelism. Because the T-core isthe kernel of the VisoMT multithreading DSP, it is not onlyresponsible for internal control, but also communicates withthe VisoMT RISC. Therefore, it supports several customizedinstructions supported, including thread management amongthreads, renaming of the streaming RF, special moves betweenstreaming RF and local RF, and commands for backgroundbulk data transfer. The thread management instructions handleevents between the VisoMT RISC and multithreading DSP,including thread creation, thread halting, parameter passing,and thread synchronization. An interrupt-based mechanismmakes programming flexible and reduces useless waitingtime.

Eight banked streaming RFs (total 2 kB, each bank is 2R1W,32-bit-width, 32 entries) with a configurable parallel accessswitch (CPAS switch) are mainly for the M-cores’ access andcomputing. However, data input–output among streaming RFsand control of the CPAS switch should also be consideredfor flexibility and efficiency. Therefore, the T-core can accessstreaming RFs (maximum 48 B/cycle) by some special moveinstructions. Therefore, the T-core may process the data at firstand then transfer it to streaming RFs for M-core processing,or the M-core can process it at first and then transfer to theT-core for program-flow judgments. Furthermore, the data canbe processed in the T-core and M-core by turns, and this can beeasily done by the VLIW bundle without any synchronization.Moreover, several fast data exchanges (SWAP, SHARE, and soon) between chunk threads are done for streaming processingby register bank selection. The instructions supported by theBB Data Transfer unit include concurrent prefetches, post-stores, and sophisticated data transpositions between streamingRFs and the streaming buffers.

Fig. 5 shows the dataflow scheme of the heterogeneous-SMT with streaming RF architecture. The dataflow schemeshows that several slices of instruction slots interact withstreaming RFs by four chunk threads and transfer threadparallel execution, and it also illustrates the fast data exchangemechanism. The T-core exhibits two-issue (data-processing

Authorized licensed use limited to: University of Malta. Downloaded on November 12, 2009 at 03:14 from IEEE Xplore. Restrictions apply.

Page 7: UNIVERSITY OF MALTA Msida Malta DIPARTIMENT TA -I D C TAL ...

1638 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 11, NOVEMBER 2009

Fig. 6. Concept of fast data switching and data movement.

and load/store processing) simultaneous multithreaded pro-cessing of instruction slots with four chunk threads’ contexts,and dynamically accesses different banked streaming RFswith different threads. The T-core sends several commandsto the BB Data Transfer unit for prefetching, poststoring,and sophisticated data transposition, as the input and outputof streaming RFs.

The VLIW instruction bundle is obviously implemented inone control µ-op in the T-core and one data computation µ-opin the M-core, and the bundle can be issued as only one controlµ-op, only one data computation µ-op, or both two µ-ops at thesame time. As shown in M-core 0, the first slice of instructionslots accesses bank 1 and bank 2 of streaming RFs, andbank 2 is then swapped to the M-core 1 for another functioncomputation by register renaming and thread synchronization.The second instruction slice in M-core 0 shares bank 3 withM-core 1 by renaming to the same register bank, and the bankmay be read-only or have its entries partitioned for differentM-cores. In the dataflow example, four chunk threads areresponsible for four different functions as software pipeliningwith block data independence, and it also calls streamingprocessing. The example illustrates independent data blockpropagation among threads, and they are prefetched from thetransfer thread, then to chunk thread 0, thread 1, thread 2,thread 3, and finally stored into memory by the transfer thread.Overall, at most six instructions execute in parallel by fourchunk threads, and they dynamically share streaming RFs witha transfer thread prefetching and poststoring. The proposedarchitecture is highly suitable for independent data blockstreaming processing. It runs several collaborative threads todo the job efficiently.

IV. Fast Data Switching Mechanism

This mechanism includes two parts: data switching and datamovement. In data communication on a general multicoreplatform, data usually need to move to global storage as arelay station, but this always wastes a lot of access time.Therefore, we wish to take advantage of multilevel switchingto lower access time. Therefore, when data communication

Fig. 7. Data access path of one-stop streaming processing.

occurs, cores can read each other’s data directly by switching,without the removal of bulk data. Data switching supportsbank to bank, local storage to local storage and cache entryswitching. RF bank switching is achieved by the CPAS. Thetime for reading data from different RF banks that are in thesame VisoMT core is equal to that the time for each coreto read its own RF bank. Although the rest of switching issupported by the BB data transfer unit, when the core readsdata many times or bulk data through switching, it will savemore time through the data movement mechanism to movedata into its own local storage.

As shown in Fig. 6, the BB data transfer unit transfersmedium data blocks to local storage and RF banks in eachVisoMT cores from off-chip memory at the beginning. AfterM-core 1 processes data in VisoMT core 1, M-core 2 candirectly read data through the CPAS, without moving datainto its own bank. Next, VisoMT core 3 also reads the RFbank of VisoMT core 2 through the switch directly. Thisswitching that takes place between different cores is not asfast as the CPAS, so that when the number of reads increases,the BB data transfer unit transfers this data directly intothe RF bank of the VisoMT Core 3. In addition to RFbank switching, this mechanism also supports local storageswitching. As concerns flow control, this mechanism alsosupports variable switching by nonuniform cache switching.Finally, the BB data transfer unit stores these processed databack into off-chip memory.

A. One-stop Streaming Processing

Multimedia applications usually have the properties ofblock data independence and high data locality. The blockdata independence property, known as data-level parallelism(DLP), improves the performance of multithreaded programs.Therefore, in this paper, we propose four collaborative SIMDthreads to maximize the DLP. The high computational require-ments of multimedia applications are addressed by multipleSIMD threads, but memory bandwidth is still required byprogrammable processors. Fortunately, we can utilize the highdata locality property to reduce external memory bandwidthrequirements by the hierarchical memory system, and storagehas larger bandwidth and shorter latency advantages whenit is closer to the core. Our proposed one-stop streamingbuffers (memory system) include streaming RFs (L1), internalmemory (L2), zero-bus turnaround (ZBT) memory (L3), andexternal memory. Fig. 7 shows the specification of each levelof streaming buffer. The BB Data Transfer unit is respon-sible for the control of proposed streaming buffers, and it

Authorized licensed use limited to: University of Malta. Downloaded on November 12, 2009 at 03:14 from IEEE Xplore. Restrictions apply.

Page 8: UNIVERSITY OF MALTA Msida Malta DIPARTIMENT TA -I D C TAL ...

KU et al.: VISOMT: A COLLABORATIVE MULTITHREADING MULTICORE PROCESSOR FOR MULTIMEDIA APPLICATIONS WITH FAST DATA SWITCHING 1639

does concurrent prefetches, poststores, and sophisticated datatranspositions between them. Several types of sophisticateddata transpositions are supported: for example, linear-to-square, square-to-linear, row transfer, column transfer,transpose, and data-packing. This not only hides memorylatency but also reduces data transposition overhead, both ofwhich are time-consuming on programmable processors.

Fig. 7 shows the data access path of one-stop streamingprocessing as compared to the conventional direct synchronousdynamic random access memory path. It shows the concept ofthe proposed one-stop: keeping data in the core for as long aspossible. In conventional embedded memory systems, idlingcaused by memory latency will occupy a large percentage ofexecution time, reducing performance and jamming the bus.Compared to one-stop streaming processing, it may prefetcha frame for motion search to ZBT memory (L3), and thenbulk-move the selected search range to internal memory (L2)dynamically. The data transfer may frequently be betweenstreaming RFs and internal memory. Several MBs are trans-ferred into streaming RFs for chunk thread computing, andseveral MBs are transferred out and kept in internal memoryfor data reuse.

The one-stop streaming processing exploits three levelsof data reuse to minimize data bandwidth requirements, in-cluding the fastest data exchange among M-cores by us-ing the CPAS switch for streaming register files, a mid-grained data movement and transposition between registerbanks and internal memory via the BB Data Tranfer unit,and a direct memory access-like bulk movement betweenexternal and internal memories via the BB Data Transferunit. An independent thread does all the one-stop streamingprocessing.

For the data transfer example, our system supports twokinds of fast inter-process data communications with 64binternal bus bandwidth, including bulk data-transferring fromlocal memory and switching streaming register banks. Assharing 16 × 64b data (MB) by two processes, bank switch-ing only costs 1 cycle, and bulk data-transferring spends18 cycles (2 cycles for setting BB Data Transfer unit and16 cycles for 16 data transactions). However, conventionalinter-process communication in shared-bus (32b) multicorearchitecture would pay more penalty as shown in Fig. 8.It considers impact factors of external memory latency, busbandwidth, and traffic jam. In our estimation, average trans-action (32b) spends 40 cycles (intermixing burst and nonbursttransferring). Hence, the same 16 × 64b data communicationin external memory (worst case) might need 1280 cycles(40 × 32).

B. Configurable Parallel Access Switch

The configurable parallel access switch connects eightstreaming register files, the T-core (4R2W, each 8B bit-width), the M-cores (8R4W, each 8B bit-width), and theBB data transfer unit (1R1W, each 8B bit-width). It com-pletely supports parallel access between eight masters andeight slaves, and with bandwidth up to 160 B/cycle. Therefore,four chunk threads logically share the 2 kB streaming registerfiles, and the switch provides a mechanism for fast data

Fig. 8. Configurable parallel access switch with four-node ring interconnec-tions.

exchange on the thread level, super high bandwidth, low accesstime, and larger storage for exploiting locality.

However, point-to-point interconnection will cause high areacomplexity and higher wire delay in the physical layout. Thus,this paper proposes a low-cost, simple, and highly efficientinterconnection. We provide a circuit switch design with afour-node ring structure is provided; it contains four routernodes with one clockwise and one counterclockwise data-path.This unique topology gives it the characteristics of maximumparallel access, shortest transfer time, easy arbitration, and lowdesign complexity. There are only 16 paths of composition, sothe complexity of the arbiter is low. Furthermore, a directionprediction mechanism is provided for reducing the processingtime, and it only costs one cycle to read or write streaming reg-ister files. Fig. 8 shows the configurable parallel access switchwith two four-node ring structures. One ring is responsible forstreaming RF0, 2, 4, 6, and the other is for streaming RF1,3, 5, 7. The T-core, M-cores, and BB data transfer unit canaccess streaming RFs in parallel without the same destinationor the same ring conflicts.

V. Chunk Threading Programming Model

A. Parallelization Approach

In computer architecture, designing multithreaded programsfor multicore platforms is becoming a trend, but this re-quires programmers to change their styles from sequential tomultithreaded coding. Many papers try to provide a solutionfrom the software viewpoint or the hardware viewpoint, suchas Pthread, OpenMP, or transactional memory [23]. Thispaper provides chunk threading that comprises coarse-grainedthreading and fine-grained threading. As shown in Fig. 9,coarse-grained threading adopts GOP level parallelism andall parallel threads have independent GOPs. Each coarse-grained thread comprises many small fine-grained threads.For instance, each coarse-grained thread is an H.264 encoderprogram in Fig. 9, and this program could be decomposedinto various fine-grained threads. Fine-grained threads aredesigned for task level parallelism, and they achieve taskparallelism with streaming processing. Texture coding, for ex-ample, includes transform (T), quantization (Q), inverse trans-form (IT), inverse quantization (IQ), and lossless compression

Authorized licensed use limited to: University of Malta. Downloaded on November 12, 2009 at 03:14 from IEEE Xplore. Restrictions apply.

Page 9: UNIVERSITY OF MALTA Msida Malta DIPARTIMENT TA -I D C TAL ...

1640 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 11, NOVEMBER 2009

Fig. 9. Concept of chunk threading programming model.

(compression) operations. The main thread controls all threads.The data communication works by streaming register filebank exchange. In the following session, we will discussthe programming model, how to support one-stop streamingprocessing, streaming register file bank management, and anexample with chunk threading.

B. Programming Design Flow

This section describes how to write a program by ourprogramming model, including connecting threads, descriptionof threads’ execution sequence, the data bank status and itsapplication programming interface (API). The chunk threadingprogramming design flow has four steps, which we list belowand shown in Fig. 10.

1) Step 1: The top thread creates all child threads. Thesechild threads block until someone activates them. Ofcourse, a child thread could be another top thread thatcould create other child threads.

2) Step 2: Describes the relationships between thesethreads. Before all threads execute, we describe theexecution flow. When the top thread is active, about itdoes not need to know each child thread’s status. It justneeds to take care of one child thread that will returnthe processed data.

3) Step 3: Explains how to pass the data bank in threads.Each thread owns several data banks, but we must havea way to describe which banks are for input and whichare for output. We describe this with “ports.” In Fig. 10,the port type contains an “in port” and an “out port.”A thread processes data from the in-port, and thensends the data to the next thread by out-port. A threadsometimes just needs one bank, and then it gets andputs data through port 0. For data bank passing, we takeadvantage of architecture, bank switching by the CPAS,and the BB data transfer unit.

4) Step 4: Prepare data and run. When the above initiationhas completed, we prepare the data and put it into thebank. The next step is to start the top thread in theexecution flow. Finally, the top thread just needs to waitfor the data that is threads return. When top thread getsall the data, it will return the data and terminate.

Fig. 10. Design flow based on thread programming model.

Fig. 11. Five hardware directives for passing register banks among chunkthreads based on CPAS.

C. Register File Banks Management

Based on the CPAS and BB data transfer unit, we provideversatile functions to manage register file data banks inex-pensively. In Fig. 11, we provide data split, merge, and copyoperations for different threads. Two collaborative threads dothe same task on different data with data split and mergeoperations or do different tasks on the same data with acopy operation without any memory access. When severalcollaborative threads cooperate to do a single job concurrently,they usually have some overlapping data locality. At this time,we can use share or swap data operation with synchronizationcontrol carefully.

The design considerations in data bank management areperformance trade-offs, size of data banks and hardware cost.These functions are divided into two sets: RISC support orT-core support. RISC support functions include the assign,swap, and share primitives because these primitives involvebank selector setting and scheduling, which requires a cen-tralized unit to handle global setting. T-core support functionsinclude split, copy, and merge primitives, which also needRISC to set the global setting. However, these functions haveto move data between different data banks and they need moretime to finish transferring data among register banks.

D. A Program Example

The block diagram in Fig. 12 lists the texture codingprogram pseudo-code to demonstrate the concept. The Main

Authorized licensed use limited to: University of Malta. Downloaded on November 12, 2009 at 03:14 from IEEE Xplore. Restrictions apply.

Page 10: UNIVERSITY OF MALTA Msida Malta DIPARTIMENT TA -I D C TAL ...

KU et al.: VISOMT: A COLLABORATIVE MULTITHREADING MULTICORE PROCESSOR FOR MULTIMEDIA APPLICATIONS WITH FAST DATA SWITCHING 1641

Fig. 12. Example program.

function constructs coarse-grained threads, prepares and passesdata, and activates threads. The coarse-grained keyword isdesigned to help the library assign this thread to VisoMTcores. It says that these threads do not have a tight data depen-dency. The temp block has the switchable property, and thisvariable is controlled by the BB data transfer unit to movebetween different storage layers. Texture coding, a coarse-grained thread, contains several fine-grained threads. Thesefine-grained threads are connected by ports, and these portsalso describe the data passing sequence. The quant 4 × 4thread first arranges the data ingeniously with the data bankAPI. It improves performance with the M core API. Finally,it passes data to other threads according to the out-portdescription. The library will keep the output data in the mostrecently used register file bank. We will show the case study(the H.264 video encoder) in the next section. We designed itaccording to our programming model and library API.

VI. Case Study: H.264 Video Encoder

We adopt JM9.7 [24] as the reference software to implementthe H.264 baseline profile video encoder on VisoMT. Weexploit the thread-parallelism features of VisoMT to improveperformance. In addition, in order to eliminate data depen-dencies between processing threads, we adopt the MB-levelparallel coding strategy. To overcome the design challenges oflarge memory bandwidth for storing the intermediate results,we modify the CPAS to select the different banks of registerfile adaptively for the M-cores. Moreover, we adopt the one-stop streaming processing technique to maximize the data

Fig. 13. H.264 program design with one-stop streaming.

Fig. 14. Performance evaluation of the proposed design when realizingH.264 video encoder.

reuse for reducing memory bandwidth, as in Fig. 13. In thiscase study, we implement three major functions of H.264:motion estimation, intra-prediction, and texture coding, where75% of the complexity H.264 video encoder. In the following,we will describe the realization of the three functions in detail.

A. Motion Estimation

ME is the most complex task in the H.264 video encoder.It runs a block-matching algorithm between successive framesto determine a motion vector (MV) for each MB in the currentframe, and then gets the residual data to be compressed.First of all, we adopt our low complexity, high quality fastalgorithms for H.264 integer motion estimation (IME) andfraction motion estimation (FME), thus reducing over 90%of complexity compared to the full search ME algorithmsadopted in JM9.7 [24]. The key operation, sum of the absolutedifference (SAD), of ME is adopted in the matching criterion,which could be accelerated by the thread-parallel computationof M-cores. We load the current MB and the candidate blockto do the SAD operations. If the search range of ME is from−15 to +16, the candidate block will be a 48 × 48-block. ForH.264 encoding on common intermediate format (CIF) videoat 30 f/s, the total memory bandwidth is about 6685 MB, whichwould cause a lot of cache misses in the processor. Sincethere are many overlapped data when loading the candidateblocks for ME, we reduce memory bandwidth by using theproposed one-stop streaming processing technique. Fig. 13shows the one-stop streaming processing technique for ME.First, we load multiple frames’ data into the off-chip SRAM,and then move the candidate for the current MB into the on-chip SRAM. Hence, we can perform the SAD operations byloading data from on-chip SRAM, thus reducing the number ofcache misses. To determine the MV for the next MB, we onlymove the necessary data for the new candidate block because

Authorized licensed use limited to: University of Malta. Downloaded on November 12, 2009 at 03:14 from IEEE Xplore. Restrictions apply.

Page 11: UNIVERSITY OF MALTA Msida Malta DIPARTIMENT TA -I D C TAL ...

1642 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 11, NOVEMBER 2009

Fig. 15. Comparison of RISC and single media-core.

of the high degree of overlap of data between the adjacentcandidates. With this technique, we can eliminate about 88%of data bandwidth for doing H.264 ME, as indicated in Fig. 14.

B. Intra-Prediction

In implementing the H.264 intra-prediction, we adopt asimple, high-quality intra-prediction algorithm [13], whicheliminates over 50% of the complexity of H.264 JM9.7.Similar to ME, intra-prediction in H.264 also performs SADoperations to determine the best prediction mode. The maindifference between intra-prediction and ME is that the can-didate is calculated from the adjacent row/column pixels ofthe current frame. This part could be sped up by usingM-cores in VisoMT. By using the one-stop streaming process-ing technique, we store the calculated candidates for H.264intra-prediction in the on-chip memory instead of cache orexternal memory, thus reducing the number of cache misses.The simulation results are shown in Fig. 14 indicate that thistechnique reduces bandwidth usage by about 32%.

C. Texture Coding

Texture coding includes four functions: integer transform(IT), quantization (Q), inverse quantization (IQ), and in-verse integer transform (IIT). The implementation of tex-ture coding requires four block-load operations and fourblock-store operations. In order to reduce the times fordata load/store operations, we directly exchange the blockdata through the CPAS when computing these functions in theM-cores. The simulation results shown in Fig. 14 indicate thatthe use of one-stop streaming processing reduces bandwidthconsumption by about 65.2%.

Fig. 14 shows the performance gain from realizing the threemain functions of H.264 on the VisoMT with our techniques.The left hand side of Fig. 14 compares the external mem-ory bandwidth requirements with and without the one-stopstreaming processing technique. The right hand side of Fig. 14shows the improvement of the processing time compared tothe traditional single stream microprocessor.

VII. Implementation Results

In order to evaluate several critical factors of the perfor-mance improvement from our design techniques, we built aninstruction-level simulator of the VisoMT processor. We chosethe H.264 video encoder as our benchmark, compiling it withoptimization for the RISC ISA. As our experiment showed,

Fig. 16. H.264s main functions in different VisoMT configuration.

TABLE I

Speed Up and Access Time Reduction by CPAS Supporting

(Texture Coding)

Foreman, VGA, 100 frames Without CPAS With CPAS RatioCycle count 860 155 459 622 097 775 27.68%LD/ST times 141 120 000 360 000 000 74.49%

Fig. 17. Comparison of ARM11 MPcore, TI C62, and this paper.

motion estimation, intra-prediction, and texture coding accountfor about 75% of the complexity of the H.264 video encoder.Hence, we modify these functions that have characteristics ofdata parallelism and high data locality and incorporate themin our multithreaded programs. We simulate an embeddedmultimedia system with a VisoMT processor combined witha standard bus. VisoMT processes major functions and theother IPs process the remaining functions (VideoLAN Client,deblocking). Those functions can be pipelined processing, sothe data communication penalty is very slight.

We show the speed up of single media-core with one RISCin Fig. 15, we use the segment of H.264 to be our patterns.Because the media-core uses SIMD instructions and smartBB Data Transfer unit to pretransforms the type of data, thespeed up of media-core could be 7.9 to 16.99.

In Fig. 16, this experiment is based on different VisoMTcore’s design parameters. The design parameters contain:1) different main functions in H.264 encoder; 2) num-ber of SIMD threads in parallel; and 3) 1 or 2-issueheterogeneous-SMT T-core. As this figure shows, first, theperformance will be improved by increasing the threadsnumber about 23.36% on average with a 4-thread configu-ration. We can see other phenomenon in the 4-thread con-figuration as well. It shows the performance increasing of a4-hardware multithreading (HMT) thread is only larger thana 2-HMT thread configuration by 3.76%. Second, according

Authorized licensed use limited to: University of Malta. Downloaded on November 12, 2009 at 03:14 from IEEE Xplore. Restrictions apply.

Page 12: UNIVERSITY OF MALTA Msida Malta DIPARTIMENT TA -I D C TAL ...

KU et al.: VISOMT: A COLLABORATIVE MULTITHREADING MULTICORE PROCESSOR FOR MULTIMEDIA APPLICATIONS WITH FAST DATA SWITCHING 1643

Fig. 18. Die photo and summary of the VisoMT processor.

to the statistics, we can know that heterogeneous-SMT ar-chitecture in T-core can break the performance’s bottleneckefficiently. On average, four heterogeneous-SMT threads couldbe better than four HMT threads by 30.96% and better than aone-thread configuration by 54.3%.

We show a comparison about CPAS usage in Table I, itshows the cycle count and load/store access times by texturecoding program segment. By this table, we could know CPASmechanism could efficiently reduce memory access times(74.49%) and raise the execution speed (27.68%).

Statistics in Fig. 17 is a comparison with ARM11 MPcore,Texas Instruments (TI) C62, and this paper. ARM11MPcoreowns four arm RISC cores and TI C62 owns eight functionunits. By this figure, if the architecture just raises the corenumber, like ARM11 MPcore, we could know the performanceimprovement is limited. It needs some special instructions orfunction unit to speed up. In TI C62 processor, we couldknow the code that is compiled from VLIW compiler iscomplexity and low efficiency. In almost situation like thisfigure, programmer has to write assembly code. We can see theother code including flow controlling and memory handlingbecomes a performance bottleneck. It needs multithreadingarchitecture to recover this disadvantage, and good memoryarchitecture is also important. In VisoMT core, we use media-cores to speed up motion compensation and ME, and discretecosine transform and Q. By the smart BB Data transfer unit,the percentage of memory access is decreased. In 2-VisoMTconfiguration, we have to face a barrier—that is, two coreswill snatch the ownership of memory access. We use one-stop streaming processing to largely reduce the occurrence ofsnatching. This program uses GOP level data parallelism, sodifferent VisoMT core processes different GOP level data. Andthe data is running between on-chip SRAM and register file.VisoMT core only accesses GOP level source data from off-chip memory by BB data transfer unit. Because two coresdo the same work, the frequently of accessing off-chip memoryis same. We could arrange that two cores access the off-chipmemory sequentially. Snatching the ownership of memoryaccess will be reduced largely.

As shown in Fig. 18, the chip is fabricated in the TaiwanSemiconductor Manufacturing Company Ltd. 0.13 µm 1P8Mcomplementary metal oxide semiconductor general purposestandard process. The die size of the processor core is16.12 mm2, including 414k logic transistors and 34.4 kB on-chip SRAM. It consumes 245 mW at 180 MHz and 1.2-V inour postsimulation result. As shown in the die photo, about70% of the core area is filled with storage components, andthe internal memory is relatively low-cost compared to thesame amount of cache. Hence, another efficient solution is

to build an appropriate memory buffer to reduce the cachesize.

VIII. Conclusion

This paper has presented a processor that consisting ofseveral VisoMT cores that is designed by unifying RISC andmultithreading DSP for sophisticated multimedia applications.The paper also provided a fast data switching mechanismbetween multilevel storage structures to address huge band-width requirements for multimedia applications. The proposeddesign not only minimizes integration costs for embedded mul-tithreading and multicore designs by independent collaborativethreads, but also reduces memory bandwidth requirementswith a one-stop streaming buffer and a very fast data exchangemechanism.

We modified the major functions of H.264 video encoding(motion-estimation, texture-coding and intra-prediction) intomultithreaded programs by our programming model. In oursimulation experiment, we can implement H.264 encoding onCIF video at 33.2 f/s with the 2-VisoMT configuration, com-pared to TI C62 core that owns 8 function units (processingtime is 10.695 f/s). We implemented H.264 video encodingfor portable multimedia applications. We also presented a chipimplementation of our design in order to validate our designtechniques.

References

[1] P. Il, B. Falsafi, and T. N. Vijaykumar, “Implicitly-multithreaded pro-cessors,” in Proc. 30th Annu. ACM Int. Symp. Computer Architecture(ISCA), vol. 31, no. 2. May 2003, pp. 39–51.

[2] R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper,and K. Asanovic, “The vector-thread architecture,” in Proc. 37th Annu.IEEE/ACM Computer Machinery Int. Symp. Microarchitecture (MICRO-37), vol. 24, no. 6. Nov. 2004, pp. 84–90.

[3] C. Kozyrakis and D. Patterson, “Overcoming the limitations of conven-tional vector processors,” in Proc. 30th Ann. ACM Int. Symp ComputerArchitecture (ISCA), vol. 31, no. 2. May 2003, pp. 399–409.

[4] S. Ciricescu, R. Essick, B. Lucas, P. May, K. Moat, J. Norris, M.Schuette, and A. Saidi, “The reconfigurable streaming vector processor(RSVPTM),” in Proc. 36th Annu. IEEE/ACM Computer MachineryInt. Symp. Microarchitecture (MICRO-36), vol. 23, no. 6. Dec. 2003,pp. 141–150.

[5] C. G. Lee and M. G. Stoodley, “Simple vector microprocessors formultimedia applications,” in Proc. 31th Annu. IEEE/ACM ComputerMachinery Int. Symp. Microarchitecture (MICRO-31), vol. 18, no. 6.Dec. 1998, pp. 25–36.

[6] R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. I. Farkas,“Single-ISA heterogeneous multicore architectures for multithreadedworkload performance,” in Proc. 36th Annu. IEEE/ACM ComputerMachinery Int. Symp. Microarchitecture (MICRO-36), vol. 23, no. 6.Dec. 2003, pp. 81–92.

[7] “Recommendation and Final Draft International Standard Joint VideoSpecification,” Int. Telecommunication Unit-T, Rec. H. 264|ISO/IEC14496-10 AVC, Geneva, Switzerland, May 2003.

[8] D. C. Pham, T. Aipperspach, D. Boerstler, M. Bolliger, R. Chaudhry,D. Cox, P. Harvey, P. M. Harvey, H. P. Hofstee, C. Johns, J. Kahle, A.Kameyama, J. Keaty, Y. Masubuchi, M. Pham, J. Pille, S. Posluszny, M.Riley, D. L. Stasiak, M. Suzuoki, O. Takahashi, J. Warnock, S. Weitzel,D. Wendel, and K. Yazawa, “Overview of the architecture, circuit design,and physical implementation of a first-generation cell processor,” IEEEJ. Solid-State Circuits, vol. 41, no. 1, pp. 179–196, Jan. 2006.

[9] M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkin, Y. Watanabe,and T. Yamazaki, “Synergistic processing in cell’s multicore architec-ture,” in Proc. 39th Annu. IEEE/ACM Computer Machinery Int. Symp.Microarchitecture (MICRO-39), vol. 26, no. 2. Mar. 2006, pp. 10–24.

Authorized licensed use limited to: University of Malta. Downloaded on November 12, 2009 at 03:14 from IEEE Xplore. Restrictions apply.

Page 13: UNIVERSITY OF MALTA Msida Malta DIPARTIMENT TA -I D C TAL ...

1644 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 11, NOVEMBER 2009

[10] A. S. Leon, K. W. Tam, J. L. Shin, D. Weisner, and F. Schumacher, “Apower-efficient high-throughput 32-thread SPARC processor,” IEEE J.Solid-State Circuits, vol. 42, no. 1, pp. 7–16, Jan. 2007.

[11] P. Kongetira, K. Aingaran, and K. Olukotun, “Niagara: A 32-way mul-tithreaded Sparc processor,” in Proc. 38th Annu. IEEE/ACM ComputerMachinery Int. Symp. Microarchitecture (MICRO-38), vol. 25, no. 2.Mar. 2005, pp. 21–29.

[12] R. Hankins, G. Chinya, J. Collins, P. Wang, R. Rakvic, H. Wang, and J.Shen, “Multiple instruction stream processor,” in Proc. 33rd Annu. ACMInt. Symp. Computer Architecture (ISCA), vol. 34, no. 2. Aug. 2006,pp. 114–127.

[13] C. Meenderinck, A. Azevedo, M. Alvarez, B. Juurlink, and A. Ramirez,“Parallel Scalability of H. 264,” in Proc. 1st Workshop ProgrammabilityIssues MultiCore Computer, Jan. 2008.

[14] A. Rodriguez, A. Gonzalez, and M. P. Malumbres, “Hierarchical paral-lelization of an H.264/AVC video encoder,” in Proc. Int. Symp. ParallelComputing Electrical Engineering, Sep. 2006, pp. 363–368.

[15] S. Momcilovic and L. Sousa, “A parallel algorithm for advancedvideo motion estimation on multicore architectures,” in Proc. Int. Conf.Complex Intel. Software Intensive Syst., Barcelona, Spain, Mar. 2008,pp. 831–836.

[16] K. Hae-Yong, J. Kyung-Ah, B. Jung-Yang, L. Young-Su, and L. Seung-Ho, “MPEG4 AVC/H.264 decoder with scalable bus architecture anddual memory controller,” in Proc. Int. Symp. Circuits Syst., vol. 2. May2004, p. II-145-8.

[17] Y. Hu, A. Simpson, K. McAdoo, and J. Cush, “A high definitionH.264/AVC hardware video decoder core for multimedia SoC’s,” inProc. IEEE Int. Symp. Consumer Electron., Sep. 2004, pp. 385–389.

[18] C. To-Wei, H. Yu-Wen, C. Tung-Chien, C. Yu-Han, T. Chuan-Yung, andC. Liang-Gee, “Architecture design of H.264/AVC decoder with hybridtask pipelining for high definition videos,” in Proc. IEEE Int. Symp.Circuits Syst., vol. 3. 2005, pp. 2931–2934.

[19] T. M. Liu, T. A. Lin, S.-Z. Wang, W. P. Lee, K. C. Hou, J. Y. Yang,and C. Y. Lee, “An 865-mW H.264/AVC video decoder for mobileapplications,” in Proc. Asian Solid-State Circuits Conf., Nov. 2005,pp. 301–304.

[20] J.-W. van de Waerdt, S. Vassiliadis, S. Das, S. Mirolo, C. Yen, B. Zhong,C. Basto, J.-P. van Itegem, D. Amirtharaj, K. Kalra, P. Rodriguez, andH. van Antwerpen, “The TM3270 media-processor,” in Proc. 38th Annu.IEEE/ACM Computer Machinery Int. Symp. Microarchitecture (MICRO-38), vol. 25, no. 2. Mar. 2005, pp. 331–342.

[21] D. Wu, T. Hu, and D. Liu, “A Single Issue DSP based multi-standardmedia processor for mobile platform,” in Proc. 8th Workshop ParallelSystems Algorithms, Mar. 2006, pp. 333–342.

[22] H.-K. Peng, C.-H. Lee, J.-W. Chen, T.-J. Lo, Y.-H. Chang, S.-T. Hsu, Y.-C. Lin, P. Chao, W.-C. Hung, and K.-Y. Jan, “A highly integrated 8 mWH.264/AVC main profile real-time CIF video decoder on a 16MHz socplatform,” in Proc. 2007 Asia South Pacific Design Automation Conf.(ASP-DAC ’07), Jan. 2007, pp. 112–113.

[23] M. Herlihy and J. E. B. Moss, “Transactional memory: Architec-tural support for lock-free data structures,” in Proc. 20th Annu. ACMInt. Symp. Computer Architecture (ISCA), vol. 21, no. 2. May 1993,pp. 289–300.

[24] Joint Video Team (JVT) reference software JM97 [Online]. Available:http://iphome.hhi.de/suehring/tml/

Wei-Chun Ku was born in Miaoli, Taiwan, in 1981.He received the B.S. degree in 2004, from theDepartment of Computer Science and InformationEngineering, Tamkang University, Taipei, Taiwan, in2004. He is currently working toward the Ph.D. de-gree from the Department of Computer Science andInformation Engineering, National Chung ChengUniversity, Chia-Yi, Taiwan.

His research interests include embedded systemdesign, computer architecture, and system-on-chipdesign.

Shu-Hsuan Chou was born in Taipei, Taiwan, in1981. He received the B.S. degree in 2004, fromthe Department of Computer Science and Informa-tion Engineering, National Chung Cheng University,Chia-Yi, Taiwan, where he is currently workingtoward the Ph.D. degree.

His research interests include multicore system-on-chip design, embedded system design, and processorarchitecture.

Jui-Chin Chu was born in Kaohsiung, Taiwan, in1979. He received the B.S. and M.S. degrees in2001 and 2003, respectively, from the Departmentof Computer Science and Information Engineering,National Chung Cheng University, Chia-Yi, Taiwan,where he is currently working toward the Ph.D.degree.

His research interests include video processingalgorithms, VLSI architecture design, digital intel-lectual property design, and system-on-chip design.

Chi-Lin Liu was born in Taiwan in 1980. Hereceived the B.S. degree and M.S. degrees from theDepartment of Computer Science and InformationEngineering of National Chung Cheng University,Chia-Yi, Taiwan, in 2003 and 2005, respectively.

Tien-Fu Chen received the B.S. degree in computerscience from National Taiwan University, Taipei,Taiwan, in 1983. He received the M.S. and Ph.D.degrees in computer science and engineering fromthe University of Washington, Washington D.C., in1991 and 1993, respectively.

He joined Wang Computer Ltd., Taiwan, wherehe worked as a System Software Engineer for threeyears. Currently, he is a Professor with the Depart-ment of Computer Science and Information Engi-neering, National Chung Cheng University, Chia-Yi,

Taiwan. He has published several widely-cited papers on dynamic hardwareprefetching algorithms and designs. He has made contributions to processordesign and system-on-chip (SoC) design methodology. His recent research hasproduced multithreading/multicore media processors, on-chip networks, andlow-power architecture techniques, as well as related software support toolsand SoC design environments. His current research interests include computerarchitectures, SoC design, and embedded systems.

Jiun-In Guo was born in Kaohsiung, Taiwan, in1966. He received the B.S. and Ph.D. degrees inelectronics engineering from National Chiao TungUniversity, Hsinchu, Taiwan, in 1989 and 1993,respectively.

He is currently a Professor and the Chair ofthe Department of Computer Science and Informa-tion Engineering, National Chung Cheng University,Chia-Yi, Taiwan. From 2008 to 2009, he was theResearch Distinguished Professor at National ChungCheng University. He joined the System-on-Chip

(SoC) Research Center in March 2003, where he was involved in severalGrand Research Projects on low-power, high-performance processor designand multimedia intellectual property/SoC design.

Authorized licensed use limited to: University of Malta. Downloaded on November 12, 2009 at 03:14 from IEEE Xplore. Restrictions apply.

Page 14: UNIVERSITY OF MALTA Msida Malta DIPARTIMENT TA -I D C TAL ...

KU et al.: VISOMT: A COLLABORATIVE MULTITHREADING MULTICORE PROCESSOR FOR MULTIMEDIA APPLICATIONS WITH FAST DATA SWITCHING 1645

Jinn-Shyan Wang was born in Taiwan in 1959. Hereceived the B.S. degree in electrical engineeringfrom the National Cheng Kung University, Tainan,Taiwan, in 1982, and the M.S. and Ph.D. degreesfrom the Institute of Electronics, National ChiaoTung University, Hsinchu, Taiwan, in 1984 and1988, respectively.

From 1988 to 1995, he was with the IndustrialTechnology Research Institute (ITRI), where he wasengaged in application-specific integrated circuit andsystem design. After leaving ITRI, he was the Man-

ager of the Department of VLSI Design, Computer and CommunicationLaboratory. Since 1995, he has been with the Department of Computer Scienceand Information Engineering, National Chung Cheng University, Chia-Yi,Taiwan, where he is currently a Full Professor. He has published over 30journal papers and 50 conference papers, and holds over 30 patents on VLSIcircuits and architectures. His research interests are in low-power, low-voltageand high-speed digital integrated circuits and systems, intellectual propertyand system-on-chip design, analog integrated circuits, and complementarymetal–oxide–semiconductor image sensors.

Dr. Wang has served as an Institute of Information Security Professionalsmember of International Solid State Circuits Conference since 2008.

Authorized licensed use limited to: University of Malta. Downloaded on November 12, 2009 at 03:14 from IEEE Xplore. Restrictions apply.