Post on 18-Feb-2017
AN OVERVIEW OF ARM FAMILIES
ARM7TDMI ARM9TDMI ARM10 ARM11
ARM ARCHITECTURE FAMILIES
ARM's architecture is compatible with all four major platform operating systems: Symbian OS, Palm OS, Windows CE, and Linux.
ARM is the industry standard embedded microprocessor architecture, and is a leader in low-power high performance cores
The ARM7 and ARM9 families have contributed to ARM's success. Each core family has several "children" that incorporate many different value-added features and combinations. Essentially, there are four main families available now for license: ARM7, ARM9 ARM10,ARM 11
Some words about ARM
ARM 7• The ARM7 family features hardened and synthesizable
macrocells with variants that incorporate Cache• with either a memory protection unit (MPU) or memory
management unit (MMU).• Other features include real-time debug (RTD) and real-
time trace (RTT) technology.
ARM 9The ARM9 family consists of hardened macrocells with variants also including cache with an MPUor MMU, as well as the RTD and the RTT. Although the ARM9E-S family was released under a different architecture version, ARMv5TE, the fundamental design of the core is based on theARM9TDMI family. The "E" identifies that the family is a DSP-enhanced architecture and the "S"identifies that the family is synthesizable.
Decreased heat production and lower overheating risk. Clock frequency improvements. Shifting from a three-
stage pipeline to a five-stage one lets the clock speed be approximately doubled, on the same silicon fabrication process.
Cycle count improvements. Many unmodified ARM7 binaries were measured as taking about 30% fewer cycles to execute on ARM9 cores. Key improvements include: Faster loads and stores; many instructions now cost just
one cycle. This is helped by both the modified Harvard architecture (reducing bus and cache contention) and the new pipeline stages.
Exposing pipeline interlocks, enabling compiler optimizations to reduce blockage between stages.
Difference from ARM7
Comparison of the ARM7TDMI with the ARM9TDMI families
(1) Pipeline ComparisonTo increase performance, the pipeline of the ARM9TDMI core
was re-engineered from the threestagEsystem used by the ARM7TDMI family to five stages.
Operations previously performed in the execute stage of ARM7 are spread across four stages in the
ARM9 pipeline: decode, execute, memory, and write. The reorganization and removal of these
critical paths resulted in a much higher clock frequency.
Another performance improvement is the reduced cycles per instruction rating of the processor. Thisis due to improved load and store instruction cycle counts. Single load and store instructions are nowsingle-cycle operations. This is an enhancement over the ARM7 operation, which used the executestage three times: 1)first, to calculate the address; 2)second, to access the memory and cache; and 3)third, towrite the data to the register bank. On ARM9, each step has a separate pipeline stage requiring onlyone cycle, avoiding pipeline stalls.
The Harvard bus architecture creates separate instruction and data memory interfaces, enablingsimultaneous access to instructions and data.The ARM9TDMI represents a new family of CPU technology. The enhancements made to this corefamily doubles the performance of the ARM7TDMI family. The ARM7TDMI family is popular with applications where small die size, high performance, andlow power consumption help reduce system costs, especially when the system does not require cache.The ARM9TDMI family are used for high performance applications that previously could not beimplemented at the same cost
P rogramData
Address
D ata
P rocessorRead
W rite
Read
M icrosoftW ord A dobe P hotos hopเ อ ก ส า ร ข อ ง
W ordรู ป ที่ ก ำา ลั ง ถู ก ตั ด ต่ อ
P hotos hopโ ด ย
0000000000
0000
0000
01
0010
0000
00
0025
0000
00
0925
0000
00
4294
9672
95
W indows XP
CPU
Mem
ory
ARM10ARM10E implements:
• Harvard 6-stage pipeline• Supports v5TE instruction set• Embedded ICE RTII debug logic• Fully compatible with v4T architecture• 390-700 MIPS integer performance based on Dhrystone 2.1• Branch prediction:• Eliminates 70% of branches on typical code sequences• Separate load/store unit:• 64-bit path to register bank - load two registers simultaneously• Hit-under-miss caches:• Significantly reduces pipe-line stalls• Write buffer: Holds up to 8 double-words (16 register values)• New energy saving power down modes
The pipeline was widened to add anadditional stage, and improvements were made to the EmbeddedICE logic to provide support for realtimedebug. All the while, compatibility was maintained with ARMv5TE and v4T for ease of codemigration.Performance enhancements include the introduction of branch prediction, hit-under-miss support inthe MMU and cache architecture, an improved write buffer that holds up to eight double-words, and aseparate load and store unit.These features improve code performance by lowering the averagenumber of cycles per instruction of the processor, and also help when code is heavily dependent oncache operations.
• It also supports an optional vector floating point(VFP) unit. The VFp significantly increases floating point performance.
• VFP (Vector Floating Point) technology is an FPU (Floating-Point Unit) coprocessor extension to the ARM architecture
• It provides low-cost single-precision and double-precision floating-point computation
• VFP provides floating-point computation suitable for a wide spectrum of applications such as PDAs, smartphones, voice compression and decompression, three-dimensional graphics and digital audio, printers, set-top boxes, and automotive applications.
• The VFP architecture was intended to support execution of short "vector mode" instructions but these operated on each vector element sequentially
VFP UNIT
ARM11• ARM is designed for high performance and power efficient
appliations.• ARM1136J-S was the first processor implementation to
execute architecture ARMv6 Instructions• Incorporates an 8 stage pipeline with separate load store
and arithmetic pipelines
DIFFERENCE FROM ARM 9• SIMD instructions which can double MPEG-
4 and audio digital signal processing algorithm speed
• Cache is physically addressed, solving many cache aliasing problems and reducing context switch overhead.
• Unaligned and mixed-endian data access is supported.
• Reduced heat production and lower overheating risk
• Redesigned pipeline, supporting faster clock speeds (target up to 1 GHz)• Longer: 8 (vs 5) stages• Out-of-order completion for some
operations (e.g. stores)• Dynamic branch prediction/folding
(like XScale)• Cache misses don't block execution of
non-dependent instructions.• Load/store parallelism ALU parallelism
64-bit data paths