[IEEE 2005 International Conference on Information and Communication Technology - Cairo, Egypt...

59

IMPLEMENTATION OF AN ARM COMPATIBLEPROCESSOR CORE FOR SOC DESIGNS

Ahmed A. Morgan aahmorgangyahoo.comMahmoud E. AllamMay A. Salama

Hala A. K. MansourFacuilty ofEngineering at Shoubra, Benha University, Egypt

Abstract: Hardware Description Languages (HDLs) are commonly used toconstruct hardware systems. Reuse of the design is a commonpractice to improve the productivity nowadays. In this paper, animplementation of a fully pipelined ARM compatible processorcore, which can be embedded into System-on-Chips (SOCs) ispresented. The implementation aims to support research,education, and development by opening the source codes. Thelogic description of the core is based on VHDL. Therefore, thecore can be applied to design tools of many vendors and can beeasily reused.

Keywords: ARM core. IP core-based SOC design, Top-down design.

1. INTRODUCTION

In recent years. Intellectual Property (IP) core-based Svstem-on-Chip(SOC) design has emerged as a new design paradigm [1]. Design teams faceincreasing design sizes and shrinking time scales. Many sub-systems have tobe implemented and design-from-scratch methodologies struggle to cope withthe challenge of SOC design. This encourages design engineers to reuseexisting IP cores. In order for these [P cores to be as highly reusable aspossible, they should be soft cores in a synthesizable hardware descriptionlanguage like VHDL [2]. This allows the designer to conceive the design atthe level of Register Transfer Logic (RTL) without reference to the final

technology. This high level design process has created a market of IPs orcomplete pre-designed and tested macro-cells, which can be incorporatedwithin the designer project hierarchy.

On the other hand, today's Field Programmable Gate Arrays (FPGAs)have had a great capacity of logic gates. Therefore, it is possible for a singlechip to accommodate a processor core for a controller and peripheral circuitsfor application [3]. The development period can be reduced by employing anexisting processor core. Although use of a processor core by IP vendors isavailable, it may be expensive, or cause the problem of unsuitability of theinterface. Therefore, it is difficult to employ the commercial core for researchor education. Since the most common element in designs is a processor core,one that has a flexible interface and is reusable with open source code isdesirable to reduce the development period and costs.

One issue is to select a processor core from the many availableprocessor cores. Selecting a processor core for SOC design is a fine balancebetween many conflicting goals such as higher performance, simplerinterface, higher reliability, and smaller size. The following benefits of theARM enforce the decision of selecting the ARM core [4]-[6]:* ARM is one of the most licensed and thus widely spread processor cores.* ARM processors are usually embedded in portable devices.* A wide range of internal and third-party tools support ARM processors,

forming a large software infrastructure.* ARM processors are able to find new and more efficient solutions to

existing problems by employing core extensions like Thumb, EnhancedDigital Signal Processor (DSP), and Jazelle java machine.

* ARM enhances the basic Reduced Instruction Set Computer (RISC)architecture to achieve high performance. Some of these enhancements areconditional execution of instructions, conditional setting of flags,combined shift and ALU execution, and load/store multiple instructions.

Carmona et al. tried to implement ARMv4 [7]. Parts of the architecturethat disturb the flow of the data in the pipeline were missed. Instructions likemultiply. coprocessor? and multiple registers transfer were not implemented.This paper addresses the implementation of a full-feature fully pipelinedARMv5 compatible processor core using VHDL into FPGA. The core can beused as a single processor as well as a template for SOC designs. Parts of thecore are developed, validated, and synthesized using FPGA Advantage [8],integrated software from Mentor Graphics Corporation. The synthesis processconverts the VHDL code into a netlist. The netlist is then used as the input forthe Xilinx ISE [91, integrated software from Xilinx Inc. The output of theplace and route utilities of the software is in binary format that can be used toconfigure the FPGA.

852

2. ARMV5 ARCHITECTURE

The name ARM is an acronym for "Advanced RISC Machine". ARMv5has the typical RISC architecture features [10]. It also enhances the basicRISC architecture by incorporating some additional features [5]. Theprocessor core has five pipeline stages: instruction fetch, instruction decode,execute, data cache access, and write back. ARMv5 supports user and sixother privileged modes of operations. Five of them are exception modes:abort, fast interrupt (FIQ), nornal interrupt (IRQ), supervisor, and undefined.The remaining mode is system mode. ARMv5 supports byte, halfword, andword data types. The processor core has 31 general-purpose 32-bit registers,and 6 status registers. Registers are arranged in partially overlapping banks,with a different register bank for each processor mode. At any time, 15general-purpose registers, one or two status registers, and the program counterare visible. ARMv5 architecture has Harvard bus architecture. It uses twoindependent flat address spaces of 232 bytes for data and instructions. Theseaddresses should be word-aligned. ARM instruction set can be divided intomany classes: branch instructions, data-processing instructions, multiplyinstructions, status register access instructions, load/store single registerinstructions, load/store multiple registers instructions, semaphore instructions,exception-generating instructions, and coprocessor instructions.

3. HANDLING HAZARDS

For any pipelined architecture, it is necessary to handle hazards andexceptions [11]. In implementing the processor core, hazards are solved in away that minimizes the stalls as much as possible. This requires morehardware to be implemented. Despite the increase in chip utilization, thetarget FPGA is capable of implementing the additional hardware. Thefollowing methods are used to handle hazards:

3.1 Structural HazardIt is solved by duplicating the resources. Duplicating the resources

prevents any stall due to structural hazard.

3.2 Data HazardIn ARMv5, Read-After-Write (RAW) [12] hazard can only occur.

RAW hazard is dealt with by restricting register writing to the first half-cycle,and register reading to the second half-cycle. This ensures that the data to beread is actually written. Data forwarding technique [11] is also used to passthe data that is not written back yet. If an instruction requests data that is notyet loaded by the preceding instruction, the pipeline is stalled. In this case, theduration of the stall is only one clock cycle. In the next cycle, the requested

853

data is loaded and hence becomes available to be forwarded.

3.3 Control HazardExceptions, being sudden events, are harder to handle in pipelined

processor. They cause performance degradation as they require some pipelinestages to be flushed [11]. On the other hand, branch hazard is dealt with byusing the predict-never-taken static branch prediction technique [13]. Formisprediction, the pipeline is flushed. The branch target address calculation ismoved from the execute to the decode stage. This reduces the cost ofmisprediction to one clock cycle as only the fetch stage has to be flushed.

4. DESIGNING THIE PROCESSOR CORE

A top-down approach is taken [141, [15]; the logic blocks arepartitioned into a collection of smaller and smaller modules until a point isreached at which VHDL description could be implemented. This descriptionis then verified to ensure that it is consistent with the design specification. Theverification techniques in this design are simulation-based. Verificationbegins with the creation of a test bench. This test bench is usually thespecification of the component functionality. The actual designs are comparedto these test benches. Verification is done with a bottom-up approach. Byverifying the hierarchy from one level of detail to the next, a synthesizableprocessor core that is error-free before downloading onto FPGA could beobtained. The design is accomplished through five steps:

4.1 Top-Level DesignThe specifications of the design are defined. According to the

architecture, the design has some specifications that have to be met. The restof the specifications are left for the designer. In this step of the design, precisedefinitions are proposed for both the unpredictable actions, and theimplementation-defined features of the architecture. The instruction set is thenanalyzed from the functional perspective to give a description of what theinstructions do. By this, the top-level datapaths are produced.

4.2 Register Transfer Logic DesignThis design level is much more detailed than the top-level. The

instruction set and the top-level datapaths are further explored and analyzed toexplain how the operations could be implemented in hardware. Theinstructions are decomposed into elementary microoperations in each pipelinestage. The required control signals are also specified. By reducing theoperations of the processor core to its fundamental level, it is possible todevelop an architecture that supports the ARM instruction set. The internalcomponents, buses, and control structure are produced. The function and

854

interface requirements of each component are accurately defined. Figure Ishows the structure used. Once the instruction is decoded in the decode stage,a configuration word is produced by the control unit This word is passedfrom stage to stage enabling the appropriate logic as it goes. The dataforwarding unit is used to solve data hazards. It detects data hazards, andpasses the correct data thrugh forwarding paths. To handle control hazard,the hazard detection unit is used that has a complete supervision over all thecomponents. It provides the required control signals for each hazard. It has theresponsibility of maintaining the proper operation of the whole pipeline.

Ss~~~~MIW U"'WWOW10.C

Figure 1. Processor Core Structure

4.3 VHDL ImplementationAfter the design is developed and analyzed in details, VHDL code is

written and compiled for each component. To ensure the required reusability,some elements of quality are considered in coding the design. Eachcomponent of the design is implemented in a separate file. This is to allow theuse of precompiled mega-cells, or the addition and removing of some

features. Meaningful names are selected for the entities, ports, and signals.The VHDL codes are also extensively commented to facilitate fiur usage.

4.4 Component VerificationEach component is individually verified. First, the VHDL code for each

component is functionally simulated to discover the logical errors in it. Then,the test bench is run to firther validate the component After the componentpasses the functional simulation, it is synthesized and converted into a netlistThe netlist is then used as the input for the place and route utilities. Thetiming back-annotation is performed to ensure the correctness of the gate-level implementation. The results and behavior of the gate-level timingsimulation are compared to those of the finctional simulation. The test

855

benches ape also used to verify the gate-level implementation. Figure 2explains the component verification method.

4.5 Whole Processor VerificationAfter each component is validated in a stand-alone manner, the

verification procedure is completely repeated for the whole processor core.The top-level design is extensively simulated for many test programs toincrease the confidence on the processor core to be error-free design.

5. RESULTS

The design is originally oriented at the implementation of an FPGAcore. But due to the process independent nature of the VHDL, it is a simpletask to produce a standard cell Application Specific Integrated Circuit (ASIC)implementation of the design. The processor core is synthesized and mappedon a Xilinx Virtex-LI family [16]. The target device is the XC2VI500. In thetrade of between speed and area, the design is synthesized for speed. Thetarget FPGA is more than capable of implementing the design. The processorrequires 4722 CLB slices (61.4%) and achieves a maximum clock frequencyof 80MHz with a power consumption of 120mW. Table I summarizes thewhole processor synthesis results. ARM Limited offers a synthesizable ASICsoft core for the same architecture [17]. For the purpose of comparison, thecharacteristics of this soft core are also listed in table 1.

856

Figure 2. Component Verification

As expected, the clock frequency of the implemented design is not asfast as that from ARM Limited due to the following reasons:* FPGA version of a digital circuit is always likely to be slower than an

equivalent ASIC version [14]. This is due to the fact that FPGA isconstructed from a series of logic blocks interconnected via regularstructure wiring channel rather than custom-built logic.

* The setup and hold times of the FPGA are usually larger than those of areal ASIC [3].

857

However, the clock frequency seems to be useful because the design isintended to be used as an embedded controller in SOC designs. From thisrespect, the controller frequency needs not to be so high. The controller willalso be tightly coupled with the other modules within the FPGA and thus theoverhead of the handshake is minimal.

Table I. Synthesis Results

Voltage (v i 3.31_1.62_! 1.08f .U: 4722

CLB Slices A 7680_ _ _ _ _ _ _ ~~A. 61.4

U 9443Area Function A 15360 85 mm2 4.7mm22Generator % 6.

U 2106Latches A 16944

! . ~~~%12.4Maximum frequency 80 200 250

Power m ) 120 200 112.5Standardized power 1.5 1 0.45

_ _ _ _ _ _ _0 _45

6. CONCLUSION

A fully pipelined ARM compatible processor core for SOC design hasbeen successfully implemented. The top-down design approach is used inwhich the processor core at the specification level is decomposed into submodules at lower level until VHDL code is generated. Using VHDL makesthe design independent on the final technology and offers the opportunity ofconfiguring the features of the processor, adding special hardware blocks forvarious applications, and using the core with different Electronic DesignAutomation (EDA) tools. These VHDL descriptions have been vefified andthe timing simulations show that the design acts as expected.

Future work on the design can include the employment of the core inbuilding SOC designs. The interface can be optimized according to therequired application. Using the state-of-the-art FPGAs, it is possible toimplement multiprocessor architectures and applications.

858

7. REFERENCES

(1} S. Agun. J. Chang, "Design of a Reusable Memory Management System", Proc. 14thAnnu. IEEE Int. ASIC/SOC Conf., Washington DC, USA, September 12-15, 2001, pp.369-373.

[21 J. Chang, S. Agun, "Designing Reusable Components in VHDL", Proc. 13a Annu. IEEEmnt ASIC/SOC Conference, Washington DC, USA, September 13-16, 2000, pp. 165-169.

[31 N. Ohba. K. Takano, "An SoC Design Methodology Using FPGAs and EmbeddedMicroprocessor", Proc. 4 St Annu. Conf. Design Automation, Califomia, USA. June 7-1 1,2004. pp. 747-752.

(4] D. Jagger. "ARM Architecture and Systems". IEEE Micro, vol. 17, issue 4, July/August1997. pp. 9-1 1.

(51 ARM Ltd., "ARM Architecture Reference Manual", Doc. no. ARM DDI OlOOE, 2000.[6] L. Adams, M. Ou, "Processor Integration in a Disk Controller", IEEE Micro, vol. 17, issue

4, July/August 1997. pp. 44-48.(71 F. Carmona. J. Tombs, M. Echanove, A. Torralba, "Implementation of a Fully Pipelined

ARM Compatible Mlicroprocessor Core", 17"t Conf. Design of Circuits and IntegratedSystems. DCIS'2002. Santander, Spain, November 2002. pp. 559-563.

(81 Mentor Graphics Corp., "Designing with FPGA Advantage", 2002.[9] Xilinx ' Inc.. "Xilinx ISE 7 software manuals". 2005, "http://

toolbox.xilinx.com/docsan/xilinx7/books/manuals.pdf".[101 M. Mano."Computer System Architecture", 3d edition, Prentice-Hall, 1993. pp. 282-285.[II] J. Hennessy, D. Patterson, "Computer Architecture: A Quantitative Approach", 3rd edition,

Morgan Kaufmann Pubs, 2003, Appendix A.(12] C. Ramamoorthy, H. Li, "'Pipeline Architecture', ACM Computing Surveys, vol. 9, issue

1, March 1977, pp. 61-102.(131 P. Dubey, M. Flynn, "Branch Strategies: Modeling and Optimization", IEEE Trans.

Computers, vol. 40. issue 10, October 1991, pp. 1159-1167.[141 L. Lloyd, K. Heron, A. Koelmans, A. Yakovlev, "Asynchronous Microprocessor: From

High Level Model to FPGA Implementation", Journal System Architecture. vol. 45, issue12-13. June 1999. pp.975-1000.

[15] R. McGraw. J. Aylor. R. Klenke, "A Top-Down Design Environment for DevelopingPipelined Datapaths". Proc. 35db Annu. Conf. Design Automation, California USA, Jun.15-19, 1998, pp. 236-241.

(161 Xilinx Inc.. "Virtev-1I Platform FPGAs", 2005."http:/'w-ww.xilinx.com/support/librar.htm".

[17f ARN Ltd.. "http: ;x-vww.arm.com'.

859

[IEEE 2005 International Conference on Information and Communication Technology - Cairo, Egypt...

Documents

Transcript of [IEEE 2005 International Conference on Information and Communication Technology - Cairo, Egypt...