ARM Introduction & Instruction Set Architecturemilenka/cpe626-04F/secure/l18_arm_3p1.pdf7 ARM...
Transcript of ARM Introduction & Instruction Set Architecturemilenka/cpe626-04F/secure/l18_arm_3p1.pdf7 ARM...
ARMIntroduction &
Instruction Set Architecture
Aleksandar MilenkovicE-mail: [email protected]: http://www. ece.uah.edu/~milenka
2
OutlineØ ARM ArchitectureØ ARM Organization and Implementation
Ø ARM Instruction SetØ Thumb Instruction Set
Ø Architectural Support for System DevelopmentØ ARM Processor Cores
Ø Memory HierarchyØ Architectural Support for Operating Systems
Ø ARM CPU CoresØ Embedded ARM Applications
3
ARM HistoryØ ARM – Acorn RISC Machine (1983 – 1985)§ Acorn Computers Limited, Cambridge, England
Ø ARM – Advanced RISC Machine 1990 § ARM Limited, 1990§ ARM has been licensed to many semiconductor
manufacturers
4
ARM’s visible registersØ User level§ 15 GPRs, PC,
CPSR (current program status register)
Ø Remaining registers are used for system-level programming and for handling exceptions
r13_und r14_und r14_ irq
r13_ irq
SPSR_und
r14_ abt r14_svc
user mode fiqmode
svcmode
abortmode
irqmode
undefinedmode
usable in user mode
system modes only
r13_ abt r13_svc
r8_fiqr9_fiq
r10_ fiqr11_fiq
SPSR_irq SPSR_abt SPSR_svc SPSR_fiqCPSR
r14_ fiqr13_ fiqr12_ fiq
r0r1r2r3r4r5r6r7r8r9r10r11r12r13r14r15 (PC)
5
ARM CPSR formatØ N (Negative), Z (Zero), C (Carry), V (oVerflow )Ø mode – control processor mode
Ø T – control instruction set§ T = 1 – instruction stream is 16-bit Thumb instructions§ T = 0 – instruction stream is 32-bit ARM instructions
Ø I F – interrupt enables
N Z C V unused mode
31 2827 8 7 6 5 4 0
I F T
6
ARM memory organizationØ Linear array of bytes numbered from 0
to 232 – 1Ø Data items§ bytes (8 bits)§ half-words (16 bits) – always
aligned to 2-byte boundaries (start at an even byte address)
§ words (32 bits) – always aligned to 4-byte boundaries (start at a byte address which is multiple of 4)
half-word4
word16
0123
4567
891011
byte0byte
12131415
16171819
20212223
byte1byte2
half-word14
byte3
byte6
address
bit 31 bit 0
half-word12
word8
7
ARM instruction setØ Load-store architecture§ operands are in GPRs§ load/store – only instructions that operate with memory
Ø Instructions§ Data Processing – use and change only register values§ Data Transfer – copy memory values into registers (load) or
copy register values into memory (store)§ Control Flow
o branch o branch-and-link –
save return address to resume the original sequenceo trapping into system code – supervisor calls
8
ARM instruction set (cont’d)Ø Three-address data processing instructionsØ Conditional execution of every instruction
Ø Powerful load/store multiple register instructionsØ Ability to perform a general shift operation and a general
ALU operation in a single instruction that executes in a single clock cycle
Ø Open instruction set extension through coprocessor instruction set, including adding new registers and data types to the programmer’s model
Ø Very dense 16-bit compressed representation of the instruction set in the Thumb architecture
9
I/O systemØ I/O is memory mapped § internal registers of peripherals (disk controllers, network
interfaces, etc) are addressable locations within the ARM’s memory map and may be read and written using the load-store instructions
Ø Peripherals may use either the normal interrupt (IRQ) or fast interrupt (FIQ) input§ normally most interrupt sources share the IRQ input, while
just one or two time-critical sources are connected to the FIQ input
Ø Some systems may include external DMA hardware to handle high-bandwidth I/O traffic
10
ARM exceptionsØ ARM supports a range of interrupts, traps, and supervisor calls – all
are grouped under the general heading of exceptionsØ Handling exceptions§ current state is saved by copying the PC into r14_exc and CPSR
into SPSR_exc (exc stands for exception type)§ processor operating mode is changed to the appropriate exception
mode§ PC is forced to a value between 0016 and 1C16, the particular value
depending on the type of exception § instruction at the location PC is forced to (the vector address)
usually contains a branch to the exception handler; the exception handler will use r13_exc, which is normally initialized to point to a dedicated stack in memory, to save some user registers
§ return: restore the user registers and then restore PC and CPSRatomically
11
ARM cross-development toolkitØ Software development§ tools developed by ARM
Limited§ public domain tools
(ARM back end for gcc C compiler)
Ø Cross-development§ tools run on different
architecture from one for which they produce code
assemblerC compiler
C source asmsource
.aof
C libraries
linker
.axf
ARMsd
debug
ARMulator development
system model
board
objectlibraries
12
OutlineØ ARM ArchitectureØ ARM Assembly Language ProgrammingØ ARM Organization and ImplementationØ ARM Instruction SetØ Architectural Support for High-level LanguagesØ Thumb Instruction SetØ Architectural Support for System DevelopmentØ ARM Processor CoresØ Memory HierarchyØ Architectural Support for Operating SystemsØ ARM CPU CoresØ Embedded ARM Applications
13
ARM Instruction SetØ Data Processing InstructionsØ Data Transfer Instructions
Ø Control flow Instructions
14
Data Processing InstructionsØ Classes of data processing instructions§ Arithmetic operations§ Bit-wise logical operations§ Register-movement operations§ Comparison operations
Ø Operands: 32-bits wide;there are 3 ways to specify operands§ come from registers§ the second operand may be a constant (immediate)§ shifted register operand
Ø Result: 32-bits wide, placed in a register§ long multiply produces a 64-bit result
15
Data Processing Instructions (cont’d)
r0 := r2 – r1 + C - 1RSC r0, r1, r2
r0 := r1 - r2SUB r0, r1, r2r0 := r1 - r2 + C - 1SBC r0, r1, r2r0 := r2 – r1RSB r0, r1, r2
r0 := r1 + r2 + CADC r0, r1, r2r0 := r1 + r2ADD r0, r1, r2
Arithmetic Operations Bit-wise Logical Operations
r0 := r1 xor r2EOR r0, r1, r2r0 := r1 and (not) r2BIC r0, r1, r2
r0 := r1 or r2ORR r0, r1, r2r0 := r1 and r2AND r0, r1, r2
Register Movement
r0 := not r2MVN r0, r2r0 := r2MOV r0, r2
Comparison Operations
set cc on r1 xor r2TEQ r1, r2set cc on r1 and r2TST r1, r2set cc on r1 + r2CMN r1, r2set cc on r1 - r2CMP r1, r2
16
Data Processing Instructions (cont’d)Ø Immediate operands:
immediate = (0->255) x 22n, 0 <= n <= 12
Ø Shifted register operands§ the second operand is subject to a shift operation before it is
combined with the first operand
r5 := r5 + 2r2 x r3ADD r5, r5, r3, LSL r2r3 := r2 + 8 x r1ADD r3, r2, r1, LSL #3
r8 := r7[7:0], & for hexAND r8, r7, #&ffr3 := r3 + 3ADD r3, r3, #3
17
ARM shift operationsØ LSL – Logical Shift LeftØ LSR – Logical Shift Right
Ø ASR – Arithmetic Shift RightØ ROR – Rotate Right
Ø RRX – Rotate Right Extended by 1 place
031
00000
LSL #5
031
00000
LSR #5
031
1 1111 1
ASR #5 , negative operand
031
00000 0
ASR #5 , positive operand
0 1
031
ROR #5
031
RRX
C
C C
18
Setting the condition codesØ Any DPI can set the condition codes (N, Z, V, and C) § for all DPIs except the comparison operations
a specific request must be made § at the assembly language level this request is indicated by
adding an `S` to the opcode§ Example (r3-r2 := r1-r0 + r3-r2)
Ø Arithmetic operations set all the flags (N, Z, C, and V)Ø Logical and move operations set N and Z§ preserve V and either preserve C when there is no shift
operation, or set C according to shift operation (fall off bit)
; carry out to C; ... add into high word
ADDS r2, r2, r0ADC r3, r3, r1
19
MultipliesØ Example (Multiply, Multiply -Accumulate)
Ø Note§ least significant 32-bits are placed in the result register,
the rest are ignored§ immediate second operand is not supported§ result register must not be the same
as the first source register§ if `S` bit is set the V is preserved and
the C is rendered meaninglessØ Example (r0 = r0 x 35)§ ADD r0, r0, r0, LSL #2 ; r0’ = r0 x 5
RSB r3, r3, r1 ; r0’’ = 7 x r0’
r4 := [r3 x r2 + r1] <31:0>MLA r4, r3, r2, r1r4 := [r3 x r2] <31:0>MUL r4, r3, r2
20
Data transfer instructionsØ Single register load and store instructions§ transfer of a data item (byte, half-word, word)
between ARM registers and memory
Ø Multiple register load and store instructions§ enable transfer of large quantities of data§ used for procedure entry and exit, to save/restore workspace
registers, to copy blocks of data around memoryØ Single register swap instructions§ allow exchange between a register and memory
in one instruction§ used to implement semaphores to ensure mutual exclusion
on accesses to shared data in multis
21
Data Transfer Instructions (cont’d)
mem32[r1] := r0STR r0, [r1]r0 := mem32[r1]LDR r0, [r1]
Note: r1 keeps a word address (2 LSBs are 0)
r0 := mem32[r1 +4]LDR r0, [r1, #4]
Register -indirect addressing
Base+offset addressing (offset of up to 4Kbytes)
r0 := mem32[r1 + 4]r1 := r1 + 4
LDR r0, [r1, #4]!Auto-indexing addressing
r0 := mem32[r1]r1 := r1 + 4
LDR r0, [r1], #4Post-indexed addressing
r0 := mem8[r1]LDRB r0, [r1]Note: no restrictions for r1
Single register load and store
22
Data Transfer Instructions (cont’d)COPY: ADR r1, TABLE1 ; r1 points to TABLE1
ADR r2, TABLE2 ; r2 points to TABLE2LOOP: LDR r0, [r1]
STR r0, [r2]ADD r1, r1, #4ADD r2, r2, #4...
TABLE1: .. .TABLE2:...
COPY: ADR r1, TABLE1 ; r1 points to TABLE1ADR r2, TABLE2 ; r2 points to TABLE2
LOOP: LDR r0, [r1], #4STR r0, [r2], #4...
TABLE1: .. .TABLE2:...
23
Data Transfer Instructions
Ø Block copy view§ data is to be stored above
or below the the address held in the base register
§ address incrementing or decrementing begins beforeor after storing the first value
r0 := mem32[r1]r2 := mem32[r1 + 4]r5 := mem32[r1 + 8]
LDMIA r1, {r0, r2, r5}
Note: any subset (or all) of the registers may be transferred with a single instruction
Note: the order of registers within the list is insignificant
Note: including r15 in the list will cause a change in the control flow
Multiple register data transfers
Ø Stack organizationsØ FA – full ascending Ø EA – empty ascendingØ FD – full descendingØ ED – empty descending
24
Multiple register transfer addressing modes
r5r1
r9’
r0r9
STMIA r9!, {r0,r1,r5}
100016
100c 16
101816
r1r5r9
STMDA r9!, {r0,r1,r5}
r0
r9’ 100016
100c 16
101816
r5r9
STMDB r9!, {r0,r1,r5}
r1
r0r9’ 100016
100c 16
101816
r5r1r0
r9’
r9
STMIB r9!, {r0,r1,r5}
100016
100c 16
101816
25
The mapping between the stack and block copy views
As ce ndi ng Des ce ndi ngFul l Em pt y Ful l Em pt y
Increm entB ef o re STMIB
STMFALDMIBLDMED
Aft er STMIASTMEA
LDMIALDMFD
Dec re me ntB ef o re LDMDB
LDMEASTMDBSTMFD
Aft er LDMDALDMFA
STMDASTMED
26
Control flow instructionsBranch Interpretation Normal uses B BAL
Unconditional Always
Always take this branch Always take this branch
BEQ Equa l Comparison equal or zero result BNE Not equal Comparison not equal or non-zero result BPL Plus Result positive or zero BMI Minus R e sult minus or negative BCC B L O
Carry clear L o w e r
Arithmetic operation did not give carry-out Unsigned comparison gave lower
BCS BHS
Carry set Higher or same
Arithmetic operation gave carry-out Unsigned comparison gave higher or same
BVC Overflow clear Signed integer operation; no overflow occurred BVS Overflow set Signed integer operation; overflow occurred B G T Greater than Signed integer comparison gave greater than BGE Greater or equal Signed integer comparison gave greater or
equal B L T Less than S igned integer comparison gave less than B L E Less or equal Signed integer comparison gave less than or
equal BHI Higher Unsigned comparison gave higher B L S Lower or same Unsigned comparison gave lower or same
27
Conditional executionØ Conditional execution to avoid branch instructions used to
skip a small number of non-branch instructionsØ Example
CMP r0, #5 ; BEQ BYPASS ; if (r0!=5) {ADD r1, r1, r0 ; r1:=r1+r0-r2SUB r1, r1, r2 ; }
BYPASS: .. .
CMP r0, #5 ; ADDNE r1, r1, r0 ; SUBNE r1, r1, r2 ; . . .
With conditional execution
Note: add 2 –letter condition after the 3 -letter opcode
; if ((a==b) && (c==d)) e++;
CMP r0, r1CMPEQ r2, r3ADDEQ r4, r4, #1
28
Branch and link instructionsØ Branch to subroutine (r14 serves as a link register)
Ø Nested subroutines
BL SUBR ; branch to SUBR.. ; return here
SUBR: .. ; SUBR entry pointMOV pc, r14 ; return
BL SUB1..
SUB1: ; save work and link registerSTMFD r13!, {r0-r2,r14} BL SUB2..LDMFD r13!, {r0-r2,pc}
SUB2: ..MOV pc, r14 ; copy r14 into r15
29
Supervisor callsØ Supervisor is a program which operates at a privileged
level – it can do things that a user-level program cannot do directly§ Example: send text to the display
Ø ARM ISA includes SWI (SoftWare Interrupt)
; output r0[7:0]
SWI SWI_ WriteC; return from a user program back to monitorSWI SWI_Exit
30
Jump tablesØ Call one of a set of subroutines depending on a value
computed by the programBL JTAB...
JTAB: CMP r0, #0BEQ SUB0CMP r0, #1BEQ SUB1CMP r0, #2BEQ SUB2
Note: slow when the list is long, and all subroutines are equally frequent
BL JTAB...
JTAB: ADR r1, SUBTABCMP r0, #SUBMAX ; overrun?LDRLS pc, [r1, r0, LSL #2]B ERROR
SUBTAB: DCD SUB0DCD SUB1DCD SUB2...
31
Hello ARM World!AREA HelloW, CODE, READONLY ; declare code area
SWI_ WriteC EQU &0 ; output character in r0SWI_Exit EQU &11 ; finish program
ENTRY ; code entry pointSTART: ADR r1, TEXT ; r1 <- Hello ARM World!LOOP: LDRB r0, [r1], #1 ; get the next byte
CMP r0, #0 ; check for text endSWINE SWI_ WriteC ; if not end of string, print BNE LOOPSWI SWI_Exit ; end of execution
TEXT = “Hello ARM World!”, &0a, &0d, 0END
ARMOrganization and Implementation
Aleksandar MilenkovicE-mail: [email protected]: http://www. ece.uah.edu/~milenka
33
OutlineØ ARM ArchitectureØ ARM Organization and Implementation
Ø ARM Instruction SetØ Architectural Support for High-level Languages
Ø Thumb Instruction SetØ Architectural Support for System Development
Ø ARM Processor CoresØ Memory Hierarchy
Ø Architectural Support for Operating SystemsØ ARM CPU Cores
Ø Embedded ARM Applications
34
ARM organizationØ Register file –§ 2 read ports, 1 write port +
1 read, 1 write port reserved for r15 (pc)
Ø Barrel shifter – shift or rotate one operand for any number of bits
Ø ALU – performs the arithmetic and logic functions required
Ø Memory address register + incrementer
Ø Memory data registersØ Instruction decoder and
associated control logic
multiply
data out register
instructiondecode
&
control
incrementer
registerbank
address register
barrelshifter
A[31:0]
D[31:0]
data in register
ALU
control
PC
PC
ALU
bus
A
bus
B
bus
register
35
Three-stage pipelineØ Fetch§ the instruction is fetched from memory and placed in the
instruction pipeline
Ø Decode§ the instruction is decoded and the datapath control signals
prepared for the next cycle; in this stage the instruction owns the decode logic but not the datapath
Ø Execute§ the instruction owns the datapath; the register bank is read,
an operand shifted, the ALU register generated and written back into a destination register
36
ARM single-cycle instruction pipeline
fetch decode execute
time
1
fetch decode execute
fetch decode execute
2
3instruction
37
ARM single-cycle instruction pipeline
add r0,r1,#5
sub r2,r3,r6
cmp r2,#3
fetch
time
decode
fetch
execute add
decode
fetch
execute sub
decode execute cmp
1 2 3
38
ARM multi-cycle instruction pipeline
fetch ADD decode execute
t ime
1
fetch STR decode calc. addr.
fetch ADD decode execute
2
3
data xfer
fetch ADD decode execute4
5 fetch ADD decode execute
instruction
Decode logic is always generating the control signals for the datapath to use in the next cycle
39
ARM multi-cycle LDMIA (load multiple) instruction
fetch decodeex ld r2ldmiar0,{r2,r3}
sub r2,r3,r6
cmp r2,#3
ex ld r3
fetch
time
decode ex sub
fetch decodeex cmp
Decode stage occupied since ldmia must continue toremember decoded instruction
sub fetched at normal time butnot decoded until LDMIA is finishing
Instruction delayed
40
Control stalls: due to branchesØ Branches often introduce stalls (branch penalty)§ Stall time may depend on whether branch is taken
Ø May have to squash instructions that already started executing
Ø Don’t know what to fetch until condition is evaluated
41
ARM pipelined branch
time
fetch decode ex bnebne foo
subr2,r3,r6
fetch decode
foo addr0,r1,r2
ex bne
fetch decode ex add
ex bne
Decision not made until the third clock cycle
Two cycles of work thrown away if bne takes place
42
Pipeline: how it worksØ All instructions occupy the datapath
for one or more adjacent cyclesØ For each cycle that an instruction occupies the datapath,
it occupies the decode logic in the immediately preceding cycle
Ø During the fist datapath cycle each instruction issues a fetch for the next instruction but one
Ø Branch instruction flush and refill the instruction pipeline
43
ARM9TDMI 5-stage pipeline
I-cache
rot/sgn ex
+4
byte repl.
ALU
I decode
register read
D-cache
fetch
instructiondecode
execute
buffer/data
write-back
forwardingpaths
immediatefields
nextpc
regshift
load/storeaddress
LDR pc
SUBS pc
post-index
pre-index
LDM/STM
register write
r15
pc+8
pc + 4
+4
mux
shift
mul
B, BLMOV pc
Ø Fetch Ø DecodeØ instruction is decodedØ register operands read
(3 read ports)Ø Execute Ø an operand is shifted and
the ALU result generated, orØ address is computed
Ø Buffer/dataØ data memory is accessed
(load, store)Ø Write-back Øwrite to register file
44
ARM9TDMI Data Forwarding
I-cache
rot/sgn ex
+4
byte repl.
ALU
I decode
register read
D-cache
fetch
instructiondecode
execute
buffer/data
write-back
forwardingpaths
immediatefields
nextpc
regshift
load/storeaddress
LDR pc
SUBS pc
post-index
pre-index
LDM/STM
register write
r15
pc+8
pc + 4
+4
mux
shift
mul
B, BLMOV pc
r3 := r2 + 8 x r1r5 := r5 + 2r2 x r3
ADD r3, r2, r1, LSL #3ADD r5, r5, r3, LSL r2
r3 := r2 + 8 x r1r8 := r9 + r10r5 := r5 + 2r2 x r3
ADD r3, r2, r1, LSL #3ADD r8, r9, r10ADD r5, r5, r3, LSL r2
r3 := mem[r2]r1 := r2 + r3
LD r3, [r2] ADD r1, r2, r3
Data Forwarding
Stall?
45
ARM9TDMI PC generation
I-cache
rot/sgn ex
+4
byte repl.
ALU
I decode
register read
D-cache
fetch
instructiondecode
execute
buffer/data
write-back
forwardingpaths
immediatefields
nextpc
regshift
load/storeaddress
LDR pc
SUBS pc
post-index
pre-index
LDM/STM
register write
r15
pc+8
pc + 4
+4
mux
shift
mul
B, BLMOV pc
Ø 3-stage pipelineØ PC behavior:
operands are read in execution stage r15 = PC + 8
Ø 5-stage pipelineØ operands are read in decode
stage and r15 = PC + 4?Ø incompatibilities between 3-
stage and 5-stage implementations => unacceptable
Ø to avoid this 5-stage pipeline ARMs emulate the behavior of the older 3-stage designs
46
Data processing instruction datapath activity (Ex)
address register
increment
registersRd
Rn
PC
Rm
as ins.
as instruction
mult
data out data in i. pipe
(a) register – register operations
address register
increment
registersRd
Rn
PC
as ins.
as instruction
mult
data out data in i. pipe
[7:0]
(b) register – immediate operations
ØReg-RegØRd = Rn op RmØr15 = AR + 4
AR = AR + 4ØReg-ImmØRd = Rn op ImmØr15 = AR + 4
AR = AR + 4
47
STR (store register) datapath activity(Ex1, Ex2)
address register
increment
registersRn
PC
lsl#0
= A / A+ B / A - B
mult
data out data in i. pipe
[11:0]
address register
increment
registersRn
Rd
shifter
= A+ B / A - B
mult
PC
byte? data in i. pipe
(a) 1s t cycle – compute address (b) 2nd cycle – store data & auto-index
ØCompute address (Ex1)ØAR = Rn op DispØr15 = AR + 4
ØStore data (Ex2)ØAR = PCØmem[AR] =
Rd<x:y>ØIf autoindexing
=>Rn = Rn +/- 4
48
The first two (of three) cycles of a branch instruction
address register
increment
registersPC
lsl#2
= A + B
mult
data out data in i. pipe
[23:0]
address register
increment
registersR14
PC
shifter
= A
mult
data out data in i. pipe
(a) 1s t cycle – compute branch target (b) 2nd cycle – save return address
Third cycle: do a small correction to the value stored in the link register in order that it points to directly at the instruction which follows the branch?
ØCompute target addressØAR = PC + Disp,lsl #2
ØSave return address (if required)Ør14 = PCØAR = AR + 4
49
ARM ImplementationØ Datapath§ RTL (Register Transfer Level)
Ø Control unit§ FSM (Finite State Machine)
50
2-phase non-overlapping clock schemeØ Most ARMs do not operate on edge-sensitive registersØ Instead the design is based around
2-phase non-overlapping clocks which are generated internally from a single clock signal
Ø Data movement is controlled by passing the data alternatively through latches which are open during phase 1 or latches during phase 2
1 clock cycle
phase 1
phase 2
51
ARM datapath timingØ Register read§ Register read buses – dynamic, precharged during phase 2§ During phase 1 selected registers discharge the read buses
which become valid early in phase 1Ø Shift operation§ second operand passes through barrel shifter
Ø ALU operation§ ALU has input latches which are open in phase 1,
allowing the operands to begin combining in ALU as soon as they are valid, but they close at the end of phase 1 so that the phase 2 precharge does not get through to the ALU
§ ALU processes the operands during the phase 2, producing the valid output towards the end of the phase
§ the result is latched in the destination register at the end of phase 2
52
ARM datapath timing (cont’d)
read bus valid
shift out valid
ALU out
shift time
ALU time
registerwrite time
registerreadtime
ALU operandslatched
phase 1
ph ase 2
prechargeinvalidatesbuses
Minimum Datapath Delay = Register read time + Shifter Delay + ALU Delay+ Register write set-up time + Phase 2 to phase 1 non-overlap time
53
The original ARM1 ripple-carry adderØ Carry logic: use CMOS AOI (And-O r-Invert) gateØ Even bits use circuit show below
Ø Odd bits use the dual circuit with inverted inputs and outputs and AND and OR gates swapped around
Ø Worst case path:32 gates long
AB
Cin
sum
Cout
54
ARM2 4-bit carry look-ahead schemeØ Carry Generate (G)
Carry Propagate (P)Ø Cout[3] =Cin[0].P + G
Ø Use AOI and alternate AND/OR gates
Ø Worst case:8 gates long
A[3:0]
B[3:0]
Cin[0]
sum[3:0]
Cout[3]
4-bitadderlogic
P
G
55
The ARM2 ALU logic for one result bitØ ALU functions§ data operations (add, sub, ...)§ address computations for memory accesses§ branch target computations§ bit-wise logical
operations§ ...
ALUbus
432105
NBbus
NAbus
carrylogic
fs:
G
P
56
ARM2 ALU function codes
fs 5 f s4 f s 3 f s2 f s1 f s 0 ALU o ut put0 0 0 1 0 0 A an d B0 0 1 0 0 0 A an d not B0 0 1 0 0 1 A x or B0 1 1 0 0 1 A p lus no t B p lus carry0 1 0 1 1 0 A p lus B p lus carry1 1 0 1 1 0 no t A plus B p lus carry0 0 0 0 0 0 A0 0 0 0 0 1 A o r B0 0 0 1 0 1 B0 0 1 0 1 0 no t B0 0 1 1 0 0 zer o
57
The ARM6 carry-select adder schemeØ Compute sums of
various fields of the wordfor carry -in of zero and carry -in of one
Ø Final result is selected by using the correct carry -in value to control a multiplexor
sum[31:16]sum[15:8]sum[7:4]sum[3:0]
s s+1
a,b[31:28]a,b[3:0]
+ +, +1c
+, +1
mux
mux
mux
Worst case: O(log2[word width]) gates long
Note: Be careful! Fan-out on some of these gates is high so direct comparison with previous schemes is not applicable.
58
The ARM6 ALU organizationØ Not easy to merge the arithmetic and logic functions =>
a separate logic unit runs in parallel with the adder, and multiplexor selects the output
Z
N
VC
logic/arithmetic
C infunction
invert A invert B
result
result mux
logic functions
A operand latch B operand latch
XOR gates XOR gates
adder
zero detect
59
ARM9 carry arbitration encodingØ Carry arbitration adder
u
u
1
0
Ci
1, 010
1, 001
1, 111
0, 000
ai vi, wibi
u
1
0
1
0
Ci
1, 01(0)0(1)1(0)0(1)
1(0)
1(0)
1
0
bi
0(1)
0(1)
1
0
ai
1, 111
0, 000
1, 1--
0, 0--
vi, wibi-1ai-1
iii
iiibawbav
⋅=+=
60
The cross-bar switch barrel shifterØ Shifter delay is critical since it contributes directly to the
datapath cycle timeØ Cross-bar switch matrix (32 x 32)
Ø Principle for 4x4 matrix
in[0]
in[1]
in[2]
in[3]
out[0] out[1] out[2] out[3]
no shiftright 1right 2right 3
left 1
left 2
left 3
61
The cross-bar switch barrel shifter (cont’d)Ø Precharged logic is used =>
each switch is a single NMOS transistorØ Precharging sets all outputs to logic 0, so those which are
not connected to any input during switching remain at 0 giving the zero filling required by the shift semantics
Ø For rotate right, the right shift diagonal is enabled + complementary shift left diagonal (e. g., ‘right 1’ + ‘left 3’)
Ø Arithmetic shift right:use sign-extension => separate logic is used to decode the shift amount and discharge those outputs appropriately
62
Multiplier designØ All ARMs apart form the first prototype have included
support for integer multiplication§ older ARM cores include low-cost multiplication hardware
that supports only the 32-bit result multiply and multiply -accumulate§ recent ARM cores have high-performance multiplication
hardware and support 64-bit result multiply andmultiply -accumulate
Ø Low cost implementation§ Use the datapath iteratively, employing the barrel shifter
and ALU to generate 2-bit product in each clock cycle§ use early termination to stop the iterations when there are
no more ones in the multiply register
63
The 2-bit multiplication algorithm, Nth cycleØ Control settings for the Nth cycle of the multiplicationØ Use existing shifter and ALU + additional hardware§ dedicated two-bits-per-cycle shift register for the multiplier
and a few gates for the Booth’s algorithm control logic(overhead is a few per cent on the area of ARM core)
Carry - in Mult i pli e r Shi ft AL U Ca rry - o ut0 x 0 LSL # 2N A + 0 0
x 1 LSL # 2N A + B 0x 2 LSL # (2N + 1 ) A – B 1x 3 LSL # 2N A – B 1
1 x 0 LSL # 2N A + B 0x 1 LSL # (2N + 1 ) A + B 0x 2 LSL # 2N A – B 1x 3 LSL # 2N A + 0 1
64
High speed multiplicationØ Where multiplication performance is very important,
more hardware resources must be dedicated § in some embedded systems the ARM core is used to perform
real-time digital signal processing (DSP) –DSP programs are typically multiplication intensive
Ø Use intermediate results which include partial sums and partial carries § Carry -save adders are used for this
Ø These two binary results are added together at the end of multiplication§ The main ALU is used for this
65
Carry-propagate (a) and carry-save (b) adder structuresØ Carry propagate adder takes two conventional (irredundant) binary
numbers as inputs and produces a binary sum Ø Carry save adder takes one binary and one redundant (partial sum and
partial carry) input and produces a sum in redundant binary representation (sum and carry)
+A B Cin
Cout S(a) +
A B Cin
Cout S+
A B Cin
Cout S+
A B Cin
Cout S
+A B Cin
Cout S(b) +
A B Cin
Cout S+
A B Cin
Cout S+
A B Cin
Cout S
66
ARM high-speed multiplier organizationØ CSA has 4 layers of adders each handling 2 multiplier bits
=> multiply 8-bits per clock cycleØ Partial sum and carry are cleared at the beginning
or initialized to accumulate a valueØ Multiplier is shifted right 8-bits
per cycle in the ‘Rs’ registerØ Carry sum and carry
are rotated right 8 bits per cycleØ Performance: up to 4 clock cycles
(early termination is possible)Ø Complexity: 160 bits in shift registers,
128 bits of carry -save adder logic (up to 10% of simpler cores)
67
ARM high-speed multiplier organization
Rs >> 8 bits/cycle
carry-save adders
partial sum
partial carry
initiali zation for MLA registers
Rm
ALU (add partials)
rotate sum andcarry 8 bits/cycle
68
ARM2 register cell circuit
A busB bus
ALU bus
writeread
Bread
A
69
ARM register bank floorplan
A bus read decoders
B bus read decoders
write decoders
register cellsPC
Vdd
Vss
ALUbus
PCbus
INCbus
ALUbus
A bus
B bus
70
ARM core datapath buses
address register
incrementer
register bank
multiplier
ALU
shifter
data in
instruction pipe
data out
A B
W
instruction
Din
shift out
PC
Ad
inc
71
ARM control logic structure
decodePLA
cyclecount
multiplycontrol
load/storemultiple
addresscontrol
registercontrol
ALUcontrol
shiftercontrol
instruction
coprocessor