Post on 12-Feb-2022
Towards a Java-enabled 2Mbps wireless handheld device
John Glossner1, Michael Schulte2, and
Stamatis Vassiliadis3
1Sandbridge Technologies, White Plains, NY2Lehigh University, Bethlehem, PA
3Delft University of Technology, Delft, Netherlands
glossner@SandbridgeTech.com
IntroductionMarket
DSPWireless
Application DomainBroadband CommunicationsPerformance RequirementsCompute + Control
Low Power TechniquesSaturating Multi-operand Arithmetic
ProgrammabilityDSP Compilation
Java executionHardware vs. Software
Programmable DSP Market
CAGR 34.4%
Growing faster than the general semiconductor market
Communications64%
Computer13%
Consumer10%
Industrial6%
Military2% Office Automation
1%
Instrumentation4%
4.4 $6.1$8.2
$10.9
$14.5
$19.2
$25.4
0
5
10
15
20
25
30
($B)
1999 2000 2001 2002 2003 2004 2005
General Purpose DSP Market
Source: Forward Concepts
Wireless Market
0
5
10
15
20
25
Bill
ions
95 96 97 98 99 2000 2001 2002 2003
AnalogGSMIS-95IS-136PDC3G
Source: Micrologic Research / Forward Concepts Worldwide
Wireless Handset Market416M units shipped in 2000
1B per year estimated by 2005!
70% using traditional DSPsTI – 83% (242M units)MOT – 10%ADI/LU – 7%
30% using DSP core ASICs
Datarate Evolution
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09
1.E+10
1.E+11
1.E+12
1.E+13
1980 1985 1990 1995 2000 2005
bits
/sec
Voice Modems Data Modems Wireless Ethernet Fiber - TDM Fiber - WDM
G.lite
2.5G
3G
4G
V.90ISDN
GSM
G.dmt
VDSL
8x2.5G
32x10G
160x10G80x10G
Performance vs. PowerDSP Performance vs. Power
(Log Log scale)
C55x
C55x
C203
C549
C5421
C5441
2181
2164
2173
21065L
16210
16291628 1609
1620
56652 56602
56307
5600256812
SC140
SC140 FR500
FR300
1M/mW
1M/mW
5M/mW
5M/mW
10M/mW
10M/mW
50M/mW
50M/mW
10
100
1000
10000
10 100 1000
mWatts
MM
AC
/s
100 GMAC/sec
Reducing PowerBatteries are not improving dramatically
Typical high-end SOC’s dissipate 1W1-10mW required for handset
Attention must be paid at every step of the design cycle
Algorithms, OS, Software, Architecture, Microarchitecture, Logic Design, Circuits, ProcessAny one particular area does not yield 100x
A 3G convergence deviceA convergence device
cell phone + pda
Java ApplicationsDynamic appletsWeb browsingEmail/calendar/PIM
Network ProcessingFirewallWeb server
Operating SystemLinux / PocketPC / PalmOS for ApplicationsRTOS for Signal Processing
DSP High-speed connectivity
Multiple reconfigurable protocols
2.5 / 3G (GPRS, CDMA-2000, WCDMA)Bluetooth802.11
Constant connectionE911GPSMPEG VideoSpeech recognition (?)
Processor Classification
Processor
GeneralPurpose
DSP
FloatingPoint
FloatingPoint
32 bitIEEE Other 32/64 bit
IEEEOther
(80 bit)
Processor Classification
Processor
GeneralPurpose
DSP
FixedPoint
FloatingPoint
16 bit 20 bit 24 bit 32 bitIEEE Other
FloatingPointInteger
32 bit +subsets
32/64 bitIEEE
64 bit +subsets
Other(80 bit)
Processor Classification
Processor
GeneralPurpose
DSP
FixedPoint
FloatingPoint
16 bit 20 bit 24 bit 32 bitIEEE Other
FloatingPointInteger
32 bit +subsets
32/64 bitIEEE
64 bit +subsets
Other(80 bit)
Numeric Representations
mantissa exponent
RadixPointSign Radix
PointSign
1
Implied mantissa(always 1)
SignSign
-21 20 2-1 2-2 2-3 2-4 2-5-20 2-1 2-2 2-3 2-4 2-5 2-6 2-7 -27 26 25 24 23 22 21 20-23 22 21 20
1 0 1 0 1 1 0 0
-20 + 2-2 + 2-4 2-5+ =
-1 + .25 + .0625 + .03125 = -.65625
1 0 1 0 1 1 0 0
-27 + 25 + 23 22+ =
-128 + 32 + 8 + 4 = -84
0 1 1 0 1 0 0 1 0 10
22 + 20 = 520 + 2 + 2 =-1 -3
1 +.5 + .125 = 1.625
1.625 x 25 = 52.0Multiplication complicates fractional representations
Source: BDTI
DSP vs. General PurposeExecution Predictability
Required to guarantee real-time constraints
1 cycle MAC
0-overhead Loop Buffer
Complex InstructionsMultiple Operations Issued
Harvard Memory ArchitectureMultiple memory access
Specialized Addressing Modes
Operate on Vector Stream Data
Data-independent Execution
Fractional Arithmetic
Pipeline Non-interlockedShallow Pipeline (3-5 stage)
Delayed Branch
Fast But Non-predictableDynamic Instruction IssueNon-deterministic caches
Multicycle MAC
Branch Prediction
RISC Superscalar InstructionsMultiple Instructions Issued
Von Neumann ArchitectureSplit Cache has similar benefit
Typically Linear Addressing
Caches Assume Locality
Data-dependent ExecutionDependent upon operands
Integer Arithmetic
Pipeline Typically InterlockedDeep Pipeline (5+ stage)
Multicycle Branch
Modern DSP ArchitecturesFocus on compilability
RISC based with Control + DSP processing
Highly Parallel
Multiple instruction issue
Multiple operation issueMACALULoad/Store
Predominately VLIWSome use of SIMD
32-bit unified address space
MotivationGSM (Global System for Mobile Communications)
Leading wireless digital technology in the world
Extensive use of saturating dot products on vectors
for (j = 0; j < n; j++) sum = L_mac(sum, x[j], y[j]);
For GSM compliance, results produced must exactly match serial resultsSaturating arithmetic operations are not associative
severely limits parallelism
Goal: Develop techniques for performing parallel saturating dot products with bit exact results
Parallel Saturating MACs
SaturatingMAC
X3 Y3
SaturatingMAC
X4 Y4X1 Y1
SaturatingMAC
SaturatingMAC
X2 Y2
Pipeline Registers
5-Input Parallel SaturatingMultioperand Adder
P1 P2 P3 P4 P5 Our approach:Parallel saturating multiply operationsm-input saturating multioperand additionA k element dot product requires 1 + k/m cycles
Pi+1= < XiYi >Z5 = <<<<P1 + P2> + P3> + P4> + P5>
Saturating AdditionWith two’s complement addition, overflow only occurs if P1 and P2 have the same sign, and the sign of their sum is different.
P1 = 0.101 = 0.625 P1 = 1.011 = -0.625+ P2 = 0.111 = 0.875 + P2 = 1.001 = -0.875= T1 = 1.100 = -0.500 = T1 = 0.100 = 0.500
Z2 = 0.111 = 0.875 Z2 = 1.000 = -1.000Overflow can be detected as
o1 = sp1 sp2 st1 + sp1 sp2 st1
If overflow occurs, the saturated result isV2 = sp2.sp2 sp2 ... sp2 sp2
Saturating 2-MAC Results16-bit designs for sat. and no-sat. units were implemented using the Synopsys Module Compiler with a 0.25 um CMOS std. cell lib.
Delay/area estimates and increases due to adding saturation logic are shown in the following table
Area Comparison
0
5000
10000
15000
20000
25000
30000
35000
40000
1 2 3 4 5 6 7 8
Number of MACs
Are
a (e
quiv
alen
t gat
es)
SerialParallel
DSP Application Complexity
1000
10000
100000
1985 1995 2005
Line
s of
C C
ode
10x Complexity every 10 years
Compiler Productivity
6-9 Months!
DesignAlgorithms
Map toFixed Point
C
Write DSPSpecific C
Write DSPAssembly
Hand ScheduleOperations on DSP
Final Product
6-9 Months!
Compiler Productivity
NEW
Compile
6-9 Months!
DesignAlgorithms
Map toFixed Point
C
Write DSPSpecific C
Write DSPAssembly
Hand ScheduleOperations on DSP
Final ProductIf floating point implemented Final Product
6-9 Months!
Compilable Architecture
Compiler
Architecture
Optimize
Cost / PowerPerformance
3G
DSL
GSM
VoIP
Implementations
Algorithms
DSP Compilation ProblemMismatch between C & DSP
16-bit fixed point40-bit accumulators with mixed type arithmeticSaturation arithmetic vs. modulo semantics
Historically...DSPs have had compiler unfriendly architectures
very complex instructionsnon-orthogonal, specialized resourcesexposed pipelines
DSP compiled performanceTypical: 1/10 speed of handwritten assemblyAssembly code is required for performance
DSP Compilation SolutionsExtensive libraries
Often more than 1000 functionsResource consuming but high reuse
C language extensions (DSP-C)Type support (Q15)Memory disambiguation
Intrinsics
Handwritten assembly code
DSP IntrinsicsIntrinsics allow programmers to use instructions a compiler can not generate
Has appearance of a function call in C Replaced with assembly statements by compilerHighly architecture dependent
Often condense 10 assembly instructions into 1
Early attempts were blockingInlined asm statement
Non-blocking pioneered by LucentWritten in the compiler’s intermediate languageSemantics of side effects well definedAllowed for further optimizationArchitecturally neutral
DSP Compilation SolutionIntrinsics work well but…
Compiler writers become DSP assembly language programmersOnly work for a specific application
DSP Solution: Semantic AnalysisType inferenceno intrinsics: out-of-the box C compilernear-parity with assembly codenovel DSP optimizationsexisting optimizations adapted for DSPspower-driven optimizations
Java & 3GJava may be required for all 3G devices
NTT DoCoMo issued statement requiring use of Java
Java may execute applications but…Is Java an efficient Signal Processing Language?Is special hardware required?
Java PropertiesObject-Oriented Programming Language
Inheritance and Polymorphism Supported
Programmer Supplied Parallelism (Threads)
Dynamically LinkedResolved C++’s fragile class problem but imposes performance constraints on class accessEntire set of objects in system not required at compile time
Strongly Typed Statically determinable type state enables simple on-the-fly translation of bytecodes into efficient machine code [Gos95]
Garbage collected
Compiled to Platform Independent Virtual Machine
Methods of Executing JavaInterpretation
1-10% performance of compiled code
JIT Compilation5-10x better than interpretation
Flash Compilation10x better than JIT
Off-line compilers10x better than interpretation found in Toba
Native CompilersShould provide near parity
Direct Execution5-50x better than JIT
ExperimentsTraditional FFT
C algorithm from [Press92]Multiple Sizes
Rewrote in JavaDoes not use Java-specific capabilities
Comparisongcc 2.7.2Matlab built-inJava InterpretedJava JITToba off-line compiler
Other Experiments included
Tensor FFT optimized for MT/MP ApplicationsCoded in Matlab and Java
0.0
2.0
4.0
6.0
8.0
10.0
gcc -O3 1.0 1.0 1.0 1.0 1.0
Solaris 2.6 w/JIT 1.2 1.6 2.2 2.6 2.8
Toba 1.0b6 -O3 1.3 1.8 2.5 4.0 3.2
gcc unopt 1.7 2.5 3.6 4.2 2.5
Kaffe 0.9.2 2.1 3.3 5.0 5.8 6.3
Matlab built-in 6.7 4.9 4.9 2.0 2.0
Java 1.1.4 Interp 7.2 15.0 25.0 30.0 34.0
F4 F16 F64 F256 F1024
Rel
ativ
e Ti
me
Traditional FFT Rel. Results
Java FFT ResultsJava May Offer Sufficient FFT Performance
Small FFT’s Are Within 20% to 60% of C performanceLarger FFT’s Are 2-3x Less EfficientObject Oriented FFT’s Are Much Less Efficient (>>30x)Better Java Compiler Technology may decrease this gap
JIT Compilers Offer 6-12x speed-up on C-like JavaTraditional FFT speedup of 6-12x over Interpreted Java
Delft-Java EngineRISC-style Architecture
32-bit Instructions Multiple Register Files
Concurrent Multithreaded Organization
Multiple Hdwr Thread UnitsMultiple Instruction Issue Per Thread
Indirect Register Access
Supervisory InstructionsBranch Java View (bex)
Integer & Floating Point8, 16, 32, and 64-bit Signed & Unsigned IntegersIEEE-754 Floating Point
Multimedia InstructionsSIMD Parallelism
DSP Arithmetic ExtensionsSaturation LogicRounding Modes
32-bit Address SpaceBase + Offset + Displacement
Java Hardware SupportTransparent Extraction of Parallelism
Multiple concurrent thread units
Dynamic Java Instruction TranslationRegister file caches stack with indirect access
JVM Reserved Instruction Used For BEX
Link Translation Buffer For Dynamic LinkingAssociates a caller’s object reference and constant pool entry ID with a linked object invocation
Logical Controller For Non-Supported TranslationsThin interpretive layer and Java run-time
Java H/W ExecutionSingle-threaded Dynamic Translation
A form of hardware register allocationTransform stack bottlenecks into pipeline dependenciesPipeline dependencies are removed using superscalar techniques
3.2x speedup achieved over a pipelined stack model
Up to 60% of stack bottlenecks removed
For translated instruction streams, out-of-order execution realized a 50% performance improvement when compared with in-order execution
ConclusionsPower control is a major enabler of convergence devices
Special purpose low power hardware techniques requiredAttention to implementation details a must
DSP Applications are becoming more complex and will be written in C
Highly optimizing compilers will be required
Java may become the predominant programming platform for 3G wireless
Limited to applications code
Embedded/DSP applications have distinct requirements from general purpose applications