ARM
Politecnico di TorinoDipartimento di Automatica e
Informatica
M. Sonza Reorda – M. Rebaudengo
M. Sonza Reorda – a.a. 2006/072
Outline
Introduction The instruction set The ARM architecture ARM systems
M. Sonza Reorda – a.a. 2006/073
Introduction
The ARM processor was first developed (between 1983 and 1985) by Acorn Computers, Ltd., based in Cambridge (UK). ARM designers were heavily influenced by Berkeley RISC I.In 1990, ARM Ltd. was founded by Acorn, Apple and VLSI.Several versions of ARM processors were designed in the following years.Today, ARM cores are widely popular among SoC designers, mainly because they show a very good trade-off between performance and power consumption.
M. Sonza Reorda – a.a. 2006/074
ARM processors
They are mainly sold as cores, to be used for integration in Systems on Chip (SoCs).
Cores can be Hard cores: ARM provides a physical layout,
implemented in a given technology Soft cores: ARM provides a high-level description,
that can be then synthesized to any technology by the designer.
In a few cases, ARM processors have been delivered as stand-alone devices.
M. Sonza Reorda – a.a. 2006/075
ARM processors
They are mainly sold as cores, to be used for integration in Systems on Chip (SoCs).
Cores can be Hard cores: ARM provides a physical layout,
implemented in a given technology Soft cores: ARM provides a high-level description,
that can be then synthesized to any technology by the designer.
In a few cases, ARM processors have been delivered as stand-alone devices.
They are generally more efficient (in terms of area, speed and power), but require a significant implementation work to be mapped on a new technology.
M. Sonza Reorda – a.a. 2006/076
ARM processors
They are mainly sold as cores, to be used for integration in Systems on Chip (SoCs).
Cores can be Hard cores: ARM provides a physical layout,
implemented in a given technology Soft cores: ARM provides a high-level description,
that can be then synthesized to any technology by the designer.
In a few cases, ARM processors have been delivered as stand-alone devices.
They are generally less efficient, but moving to a new technology is easier and can be performed by the designer (i.e., provide a higher return from investment for ARM customers).
M. Sonza Reorda – a.a. 2006/077
Characteristics
Very simple design Load-store architecture Fixed-length 32-bit instructions 3-address instruction formats.
M. Sonza Reorda – a.a. 2006/078
Programmer’s model
r13_und r14_und r14_irq
r13_irq
SPSR_und
r14_abt r14_svc
user modefiq
modesvc
modeabortmode
irqmode
undefinedmode
usable in user mode
system modes only
r13_abt r13_svc
r8_fiqr9_fiq
r10_fiqr11_fiq
SPSR_irq SPSR_abt SPSR_svc SPSR_fiqCPSR
r14_fiqr13_fiqr12_fiq
r0r1r2r3r4r5r6r7r8r9r10r11r12r13r14r15 (PC)
M. Sonza Reorda – a.a. 2006/079
CPRS
N Z C V I F T
31 28 27 7 6 5 4 0
mode unused
CPRS stands for Current Program Status Register.
M. Sonza Reorda – a.a. 2006/0710
CPRS
N Z C V I F T
31 28 27 7 6 5 4 0
mode unused
Condition codes:• Negative• Zero• Carry• Overflow Shows the processor
operation mode
Affect some processor features
M. Sonza Reorda – a.a. 2006/0711
Memory OrganizationData items may be:
• 8-bit byte
• 16-bit half words (aligned on even byte boundaries)
• 32-bit word (aligned on 4-byte boundaries).
half-word4
word16
0123
4567
891011
byte0byte
12131415
16171819
20212223
byte1byte2
half-word14
byte3
byte6
address
bit 31 bit 0
half-word12
word8
M. Sonza Reorda – a.a. 2006/0712
Load-store architecture
The instruction set only processes values which are in registers (or specified directly within the instruction itself), and places the results of such processing into a register.
The only operations which apply to memory state are ones which copy memory values into registers (load instruction) or copy register values into memory (store instruction).
M. Sonza Reorda – a.a. 2006/0713
The ARM Assembly Language
The ARM instruction set is composed of the following types of instructions:
Data processing instructions Data transfer instructions Control flow instructions.
M. Sonza Reorda – a.a. 2006/0714
Data processing instructions
The following rules apply: All operands are 32 bits wide They may be either registers or immediates The result is always 32 bit wide and corresponds to
a register The two operands and the result are independently
specified in the instruction.
M. Sonza Reorda – a.a. 2006/0715
Examples
ADD r0, r1, r2 ; r0 := r1 + r2
ADC r0, r1, r2 ; r0 := r1 + r2 + C
AND r0, r1, r2 ; r0 := r1 and r2
MOV r0, r2 ; r0 := r2
CMP r1, r2 ; set cc on r1 – r2
ADD r3, r3, #1 ; r3 := r3 + 1
M. Sonza Reorda – a.a. 2006/0716
Shifted operands
Any operand in an instruction can be shifted before being used.
Example
ADD r3, r2, r1, LSL #3 ; r3 := r2 + 8 × r1
M. Sonza Reorda – a.a. 2006/0717
Available shift operations031
00000
LSL #5
031
00000
LSR #5
031
11111 1
ASR #5 , negative operand
031
00000 0
ASR #5 , positive operand
0 1
031
ROR #5
031
RRX
C
C C
M. Sonza Reorda – a.a. 2006/0718
Condition codes
Every instruction may (or may not) set the condition codes (N, Z, C and V) according to the programmer wish.
Example
ADDS r1, r2, r3 ; sets the cc
ADD r1, r2, r3 ; does not set the cc
M. Sonza Reorda – a.a. 2006/0719
Data transfer instructions
There are three groups of these instructions: Single register load and store Multiple register load and store Single register swap.
M. Sonza Reorda – a.a. 2006/0720
Addressing modes
register-indirect addressing
Example
LDR r0, [r1] ; r0 := mem32 [r1]
Pre-indexed
Example
LDR r0, [r1, #4] ; r0 := mem32 [r1+4]
M. Sonza Reorda – a.a. 2006/0721
Addressing modes (II)
Auto-indexing
Example
LDR r0, [r1, #4]! ; r0 := mem32 [r1+4]
; r1 := r1 + 4
Post-indexed
Example
LDR r0, [r1], #4 ; r0 := mem32 [r1]
; r1 := r1 + 4
M. Sonza Reorda – a.a. 2006/0722
Multiple register data transfer
When considerable quantities of data are to be transferred it is preferable to move several registers at a time.
Example: LoaD Multiple Increment After
LDMIA r1, {r0, r2, r5} ; r0 := mem32 [r1]
; r2 := mem32 [r1 + 4]
; r5 := mem32 [r1 + 8]
M. Sonza Reorda – a.a. 2006/0723
Multiple register data transfer (cont.)
r5
r1
r9’
r0r9
STMIA r9!, {r0,r1,r5}
100016
100c 16
101816
r1
r5r9
STMDA r9!, {r0,r1,r5}
r0
r9’ 100016
100c 16
101816
r5
r9
STMDB r9!, {r0,r1,r5}
r1
r0r9’ 100016
100c 16
101816
r5
r1
r0
r9’
r9
STMIB r9!, {r0,r1,r5}
100016
100c 16
101816
M. Sonza Reorda – a.a. 2006/0724
Stack addressing
A stack is a form of LIFO store which supports simple dynamic memory allocation.
A stack is implemented as a linear data structure which grows up (an ascending stack) or down (a descending stack) as data is added to it and shrinks back as data is removed.
A stack pointer holds the address of the current top of the stack, either by pointing to the last valid data item pushed onto the stack (the full stack) or by pointing to the vacant slot where the next data item will be placed (the empty stack).
M. Sonza Reorda – a.a. 2006/0725
Stack addressing (cont.)
There are 4 variations on a stack: full ascending (suffix FA), the stack grows up and the
base register points to the highest address containing a valid item
empty ascending (suffx EA), the stack grows up and the base register points to the first empty location above the stack
empty descending (suffix ED), the stack grows down and the base register points to the first empty location below the stack
full descending (suffix FD), the stack grows down and the base register points to the lowest address containing a valid item.
M. Sonza Reorda – a.a. 2006/0726
Stack addressing (cont.)
Example:
STMFD r13!, {r2-r9} ; save regs onto stack
LDMFD r13!, {r2-r9} ; restore regs from stack
Note that the same stack model is used for both the store and load, ensuring that the correct values will be collected.
M. Sonza Reorda – a.a. 2006/0727
Single register swap
The swap instruction allows a value in a register to be exchanged with a value in memory, doing both a load and a store operation in one instruction.The principal use is to implement semaphores to ensure mutual exclusion on accesses to shared data structures in multi-processor systems.
SWP Rd, Rm, [Rn] ; Rd := mem32 [Rn]
; mem32 [Rn] = RmRd and Rm may be the same register: memory and register are exchangedExampleSWP r1, r1, [r0]
M. Sonza Reorda – a.a. 2006/0728
Control flow instructions
They include Branch instructions (unconditional and conditional) Branch and link instructions (to activate
subroutines).
M. Sonza Reorda – a.a. 2006/0729
Branch instruction
It performs an unconditional branch.
Example
B LABEL
…
LABEL …
M. Sonza Reorda – a.a. 2006/0730
Conditional branches
They perform or not the branch depending on the value of the condition codes.
Branch Interpretat i o n No rmal us esBBAL
UnconditionalAlways
Always take this branchAlways take this branch
BEQ Equal Comparison equal or zero resultBNE Not equal Comparison not equal or non-zero resultBPL Plus Result positive or zeroBMI Minus Result minus or negativeBCCBLO
Carry clearLower
Arithmetic operation did not give carry-outUnsigned comparison gave lower
BCSBHS
Carry setHigher or same
Arithmetic operation gave carry-outUnsigned comparison gave higher or same
BVC Overflow clear Signed integer operation; no overflow occurredBVS Overflow set Signed integer operation; overflow occurredBGT Greater than Signed integer comparison gave greater thanBGE Greater or equal Signed integer comparison gave greater or equalBLT Less than Signed integer comparison gave less thanBLE Less or equal Signed integer comparison gave less than or equalBHI Higher Unsigned comparison gave higherBLS Lower or same Unsigned comparison gave lower or same
M. Sonza Reorda – a.a. 2006/0731
Conditional execution
All ARM instructions can be executed conditionally.
ExampleCMP r0, #5BEQ BYPASSADD r1, r1, r0SUB r1, r1, r2
BYPASS …
is equivalent toCMP r0, #5ADDNE r1, r1, r0SUBNE r1, r1, r2…
M. Sonza Reorda – a.a. 2006/0732
Conditional execution (cont.)
; if ( (a == b) && (c == d) ) e++;
CMP r0, r1CMPEQ r2, r3ADDEQ r4, r4, #1
M. Sonza Reorda – a.a. 2006/0733
Branch and link
Supports the call to a subroutine.
The address of the following instruction is saved in the link register r14.
Therefore, the return operation can be performed by a simple MOV instruction.
M. Sonza Reorda – a.a. 2006/0734
Branch and link
ExampleBL SUBR ; branch to SUBR. . .
SUBR . . .MOV pc, r14 ; return
; copy r14 into pc to return
Note that since the return address is held in a register, the subroutine should not call a further, nested, subroutine without first saving r14.But, a subroutine that does not call another subroutine (a leaf subroutine) need not save r14 since it will not be overwritten.
M. Sonza Reorda – a.a. 2006/0735
Nested calls
When a nested procedure is called, r14 is pushed onto a stack in memory.Since the subroutine will often also require some work registers, the old values in these registers can be saved at the same time using a store multiple instruction.
BL SUB1. . .
SUB1 STMFD r13!, {r0-r2, r14} ; save work regs ; and linkBL SUB2...LDMFD r13!, {r0-r2, pc} ; restore work regs ; and return
M. Sonza Reorda – a.a. 2006/0736
The ARM architecture
Several ARM processors have been developed and sold.
Core Architecture ARM1 v1ARM2 v2ARM2aS, ARM3 v2aARM6, ARM600, ARM610 v3ARM7, ARM700, ARM710 v3ARM7TDMI, ARM710T, ARM720T, ARM740T v4TStrongARM, ARM8, ARM810 v4ARM9TDMI, ARM920T, ARM940T v4TARM9ES v5TEARM10TDMI, ARM1020E v5TE
M. Sonza Reorda – a.a. 2006/0737
3-stage ARM
This architecture was employed up to ARM7.The 3 stages are
Fetch Decode Execute.
Some instructions (e.g., those accessing the memory) require more than 3 clock cycles to be executed.Memory is accessed once per every clock cycle (or less).Branch instructions flush and refill the pipeline.
M. Sonza Reorda – a.a. 2006/0738
Pipeline behavior
fetch ADD decode execute
time
1
fetch STR decode calc. addr.
fetch ADD decode execute
2
3
data xfer
fetch ADD decode execute4
5 fetch ADD decode execute
instruction
M. Sonza Reorda – a.a. 2006/0739
Architecture multiply
data out register
instruction
decode
&
control
incrementer
registerbank
address register
barrelshifter
A[31:0]
D[31:0]
data in register
ALU
control
PC
PC
ALU bus
A bus
B bus
register
M. Sonza Reorda – a.a. 2006/0740
5-stage ARM
The new architecture was adopted starting from ARM9.
It uses separate data and code memories (i.e., caches).
The 5 stages are Fetch Decode Execute Buffer/data Write-back.
The higher number of stages allows for a faster clock.
M. Sonza Reorda – a.a. 2006/0741
The Thumb Instruction Set
Some of the ARM processors (those with a T in the acronym) support the Thumb instruction set (together with the standard ARM instruction set).In the Thumb instruction set
Instructions are encoded on 16 bits Instructions are less powerful Instructions are less.
As a result, encoding an algorithm in Thumb instructions Requires more instructions, but less code memory Results in slower execution, but requires less power.
Thumb instructions are therefore used for low-cost, low performance applications.
M. Sonza Reorda – a.a. 2006/0742
The T bit
The mechanism to switch to/from Thumb instructions is driven by the T bit in the CPRS:
If T=1, the processor interprets the fetched code as a sequence of Thumb instructions
If T=0, the processor interprets the fetched code as a sequence of usual ARM instructions.
The value of T can be changed via software.
M. Sonza Reorda – a.a. 2006/0743
Thumb implementation
The Thumb instruction set requires some additional logic to translate Thumb instructions into ARM instructions.
This operation is performed in the decode stage, without significant effects on performance.
data in
instructionpipeline
immediate ¼elds
B operand bus
data in from memory
mux
Thumbdecompressor
ARM instructiondecoder
mux
select high orlow half-word
select ARM orThumb stream
M. Sonza Reorda – a.a. 2006/0744
Operating modes
The ARM processor may work in several modes: The user mode is the usual one Privileged modes are used to handle exceptions
and supervisor calls).
The current operating mode is defined by the bottom five bits of the CPSR.
M. Sonza Reorda – a.a. 2006/0745
SPSR
Each privileged mode (except system mode) has associated with it a Saved Program Status Register (SPSR).
This register is used to save the state of the CPSR when the privileged mode is entered.
In this way the user state can be fully restored when the user process is restored.
M. Sonza Reorda – a.a. 2006/0746
Operating modes (II)
CPSR[4 :0 ] Mo de Us e Reg i s ters10000 User Normal user code user10001 FIQ Processing fast interrupts _fiq10010 IRQ Processing standard interrupts _irq10011 SVC Processing software interrupts (SWIs) _svc10111 Abort Processing memory faults _abt11011 Undef Handling undefined instruction traps _und11111 System Running privileged operating system tasks user
M. Sonza Reorda – a.a. 2006/0747
I/O
Peripherals are accessed as memory-mapped devices.
M. Sonza Reorda – a.a. 2006/0748
Exceptions
Exceptions include interrupts (from the outside), traps and supervisor calls.They may be categorized in 3 groups:
Exceptions that are a direct effect of an instruction: Software interrupts Undefined instructions Prefetch abort (i.e., memory fault during fetch)
Exceptions that are a side-effect of an instruction Data aborts (i.e., memory fault during a load/store data
access) Exceptions generated externally
Reset IRQ FIQ.
M. Sonza Reorda – a.a. 2006/0749
Exception priorities
If multiple exceptions arise at the same time, the following priorities are used
Reset (highest priority) Data abort FIQ IRQ Prefetch abort SWI and undefined instruction.
M. Sonza Reorda – a.a. 2006/0750
Exceptions management
When an exception is served PC and CPSR are saved in proper registers The operating mode is changed to the appropriate
exception mode The PC is forced to a value between 0016 and 1C16,
depending on the exception type.
Locations from 0016 to 1C16 are called vector address, and usually contain branches to exception handlers.
M. Sonza Reorda – a.a. 2006/0751
ARM system development
In order to support the development of systems based on ARM cores, the following features have been developed
A memory interface A bus architecture A reference peripheral specification A debugging mechanism.
M. Sonza Reorda – a.a. 2006/0752
Memory interface
The memory bus interface signals include: A 32-bit address bus A 32-bit bidirectional data bus Some control signals: mreq, seq, r/w, b/w, wait, etc.
M. Sonza Reorda – a.a. 2006/0753
Bus architecture
ARM released a standard bus architecture (named AMBA, or Advanced Microcontroller Bus Architecture) to be used for developers of cores to be connected to ARM processors.
The AMBA specification includes 3 busses: The Advanced High-performance Bus (AHB): it is
used to connect high-performance modules. It supports burst mode data transfers and split transactions. All timing is referenced to a single clock edge.
M. Sonza Reorda – a.a. 2006/0754
Bus architecture (II)
The Advanced System Bus (ASB): it is an old specification, to be substituted by AHB
The Advanced Peripheral Bus (APB): offers a simpler interface for low-performance peripherals. APB is generally used as a local secondary bus which appears as a slave module on the AHB.
M. Sonza Reorda – a.a. 2006/0755
Typical AMBA-based system
externalbus
interface
ARMcore/CPU
on-chipRAM
bridge
APB
AHB or ASB
test i/f ctrl
DMAcontroller
parallel i/f
timer
UART
M. Sonza Reorda – a.a. 2006/0756
Bus arbitration
Arbitration is performed in a centralized way using as many couples of signals AREQx/AGNTx as the modules connected on the AHB.
The policy implemented by the arbiter is not specified by the standard.
M. Sonza Reorda – a.a. 2006/0757
AMBA reference peripheral specification
If a system developer wishes to develop a system able to more easily support an existing operating system, he should follow the ARM reference peripheral specification, that defines the following components:
A memory map An interrupt controller A counter timer A reset controller.
M. Sonza Reorda – a.a. 2006/0758
Debugging mechanism
Debugging a SoC is particularly difficult, since the developer has no access to internal signals and the code is often written in a ROM.ARM provides a debug solution based on
An embeddedICE module, that can be programmed to halt the processor when a given instruction is executed
Exploiting the JTAG port for programming the embeddedICE and accessing internal core elements
An embedded trace macrocell that allows tracing the values passing on the busses.
M. Sonza Reorda – a.a. 2006/0759
Real-time debug system organization
EmbeddedICE
Trace por tanalyzer
ARMcore
Embeddedtrace
macrocell
EmbeddedICEJTAG TAPJTAGport
Tracepor t
hostsystem
System on chip
data
address
control
controller
M. Sonza Reorda – a.a. 2006/0760
ARM CPU cores
In many cases, designers need not just a processor core, but a whole CPU, including caches, Memory Management Units, bus interface, etc.Therefore, ARM deliver not only processor cores, but also CPU cores.
ExampleThe ARM710T CPU core is based on the ARM7TDMI processor core.It also includes an 8Kbyte code/data cache, an AMBA bus master unit, a write buffer and MMU.
M. Sonza Reorda – a.a. 2006/0761
ARM710T
AMBAaddress
AMBAdata
instruction &data cache
AMBA interface
ARM7TDMI
EmbeddedICE& JTAG
virtual address
instruct ions & data
phy
sica
la
ddre
ss
CP15
MMU
writebuffer
M. Sonza Reorda – a.a. 2006/0762
Examples of ARM-based SoCs
ARM is very popular among SoC designers.
M. Sonza Reorda – a.a. 2006/0763
Ruby II
It is a chip to be used in portable communication devices.
It is produced by VLSI Technology, Inc. and delivered as a 144- or 176-pin thin quad flat packs.
M. Sonza Reorda – a.a. 2006/0764
Ruby II architecture
ARMcore
512 x 32SRAM
counter/timers
interruptcontroller
UART2serial
UART1
PCMCIAhost
interface
parallelinterface 0
paralleli/f 1,2,3,4
externalbus
control
serial
control
address (22)
data (8/16/32)
I2C, ...
8 data bits & control
externalinterrupts (3)
controller
high-speedserial i/f
hostFIFOs
(16 x 8)
serialFIFOs
(16 x 8)
I/Omodeselect
clockcontrol
clock
M. Sonza Reorda – a.a. 2006/0765
Bibliography
Steve Furber
ARM system-on-chip architecture
Addison-Wesley, 2000
Top Related