Post on 16-Dec-2015
Transmeta’s Crusoe Architecture
Umran A. Khan
Microprocessors
Generations of Crusoe’s Processors Original architecture TM3120, TM5400 Later version TM5600-TM5800
The architecture is moreover the same, but is improved Faster clock rate (up to 800 MHz now) Smaller core/size (0.13 micron die) Has special instructions for the OS its emulating Lower power consumption Wider range of applications (from internet appliances to high
density servers)
We will look at the TM5400 here
Instruction Set
Uses a VLIW (Very Long Instruction Word) Instruction format/engine Instruction word is a 128 bit long packet
Each word (also called molecule) has four individual execution units called atoms
These atoms are packed into either a 128 or 64-bit chunks These atoms (operations) execute in parallel (4 operations
per clock) These Operations must be independent from one and
another
Four Execution Units
FPU (Floating Point Unit) Has a 10-stage floating point pipeline Uses conventional x86 80-bit register format
32 FP registers
2 Integer ALU (Arithmetic-Logic Units) Has a 7-stage integer pipeline 64 32-bit registers dedicated to it
LSU (Load/Store Unit) Branch Unit
Sample Instruction
128 bit Instruction
FADD ADD LD BRCC
FPU Integer LSU BU
ALU#0 (Load/Sore) (Branch)
Figure copied from reference#1
Introduction to Code Morphing
Code Morphing Software is a clever translation software layer that dynamically recompiles a x86 program into its native VLIW instruction format Located in the Bios Rom and runs in main memory An entire group of instructions are translated at once and then is
put into the translation cache Basically, an emulation mechanism
It can be used for architectures other than x86 such as the Linux (TM3120), Alpha’s FX!32, but TM5400’s is known for its x86 compatibility Great Potential!
Crusoe Translation layers
CPU Core
X86 ApplicationsOperating System
X86 Bios
Code Morphing Layer
Traditional x86 Architecture
Ia32 instructions are translated by the cpu into more compact and uniformed RISC-like instructions (translates instruction individually)
fancy/complicated translation It has dedicated hardware for
x86 Instruction translation Branch prediction Register Renaming Instruction reOrder
Transmeta’s Simplified Core
Al lot of the processor functionality is implemented in software Its hardware if made up of execution units, the
instruction decode unit and of course, the cache However, the rest of dedicated hardware (in previous
slide) is done in software Advantages
the cpu takes less die space less power demanding Less expensive for production and upgrades
Hardware vs. Software Implemented the hardware in software comes with a cost
Software is slower than hardware But how much slower?
It is not so easy Its reordering registers, renaming registers, predicating branches on the
fly, etc. using the same hardware used for addition, instruction execution, etc. adds complications
Does the benefits outweigh the costs?According to Transmeta, IT DOES!
Execution, Decoding and Scheduling
In x86, Instructions are translated individually An instruction’s binary is fetched and decoded into n
operations These operations are reordered and are fed to the execution
units (i.e. FPU, ALU, etc.) in parallel the sequence is reconstructed for execution
an out-of order execution has to be reconstructed in sequence and retranslated (complicated and costly)
Execution, Decoding and Scheduling (Continued)
In Crusoe, A group of instructions are translated at once Instructions are translated once and are placed into
the translation cache If the same code is run again, the processor can
grab it from the translation cache Instructions can by reordered by the scheduler by
looking at the generated code Thus, the number of instructions executed can be
minimized
Caching and Optimization
Translation cache used more efficiently A translation is optimized every time it is executed However, it will probably require more than pass for it to be truly
optimized Optimization is done in steps Sections of code usually don't get optimized if they occur only once Code is recompiled quickly to keep the processor and programming
running
Uses common optimizations done by a ordinary compiler Optimizer is basically a simple compiler
Optimization Strategies The Code Morphing software has many ways to gather feedback about
a running program “Instrument Translation”
Special code is used to collect information about the block that is going to be executed
This info is later used for optimizations and translation Branch predictions, path speculations and the reordering loads and stores are done by
the Code Morphing layer with some (Alias) hardware support and some condition code
Filtering Determines how much effort must be spent on translation and optimizing a piece
code Executions modes
Interpretation, translation with or without optimization
Translation Example
addl %eax, (%esp)addl %ebx, (%esp)movl %esi, (%ebp)subl %ecx, 5
FRONTENDld %r30, [%esp]add.c %eax, %eax, %r30ld %r31, [%esp]add.c %ebx, %ebx, %r31ld %esi, [%ebp]sub.c %ecx, %ecx, 5
OPTIMIZERld %r30, [%esp]add %eax, %eax, %r30add %ebx, %ebx, %r30ld %esi, [%ebp]sub.c %ecx, %ecx, 5
SCHEDULERld %r30, [%esp]; sub.c %ecx, %ecx, 5ld %esi, [%ebp]; add %eax, %eax, %r30; add %ebx, %ebx, %r30
KEY
ld – load movl - load
Addl – load and add add.c - add with condition codes set
Subl – load and sub sub.c - sub with condition codes set
Example from reference#2
Power Management Typical power saving approaches
Switching off the processor Having duty cycles Causes glitches
Changing the clock rate by suspending to and restarting from the RAM
Crusoe power saving Approaches Longrun power management (next slide) Integrated the north bridge of the chipset and RAM controllers onto the cpu
core Can also integrate video and sound cards Saves power in the overall system
Longrun Power Management
Feature of Code Morphing Software layer by detecting cpu load
Can adjust clock frequency on the fly Can dynamically change the cpu voltage It can reduce power consumption by 30% by
lowering the cpu clock rate by 10% 30% = 100% x (1-(.9 x .99 )) Less heat problems
No need for extra fans take up more power and space
Conclusion Advantages
low power consumption technology Low cost Longer battery life Great for the mobile user, embedded systems and even high
density servers Smaller and lighter computers
Code Morphing technology Can emulate any target architecture
Compatibility Uses special optimization techniques for target Operating
Systems Easier Software debugging (look at reference #1) Cheaper and Simplified upgrades
Conclusion (Continued) Disadvantages
An emulation can not be faster than the real thing Code translation requires extra cycles Code Morphing technology runs in main memory and takes up memory bandwidth Heavy coding
Inherits the some of the same problems with other VLIW processors Need clever Compilers for parallelism Too much fixup code (for speculation, predictions, rollbacks, etc.)
Technology seems to be really geared toward mobile users For desktops (power users) and servers, performance outweighs power
consumption
Performance is a measure of power consumption
Final Thoughts Transmeta only reported a net revenue of $4.1
millions for the first quarter of 2002 No significant share in the mobile industry
Even though Transmeta has a clever technology, the clock speeds of AMD and Intel have overshadowed its impact just like multiflow (clock speed are about 1.0 GHZ faster than the Crusoe)
AMD and Intel have also develop their own power efficient mobile processors (mobile Athlon XP with AMD PowerNow!™ technology and mobile pentium 4 with Intel® SpeedStep® technology)
Stay Tuned for the next Exciting Episode
VS.
AMD, I am your father! Not any
more!!!
References
http://www.hardwareanalysis.com/content/editorials/article/1237.4/
http://www.transmeta.com/pdf/white_papers/paper_aklaiber_19jan00.pdf
http://www.arstechnica.com/cpu/1q00/crusoe/crusoe-1.html
http://www.erc.msstate.edu/~reese/EE8063/html/transmeta/transmeta.pdf