Research on Human Behaviour Simulation in the Built Environment Bauke de Vries Bauke de Vries.
GenTera’s I M A G I N E 3 Introducing: GenTera’s I M A G I N E 3 HANS DE VRIES.
-
Upload
giana-tatman -
Category
Documents
-
view
214 -
download
0
Transcript of GenTera’s I M A G I N E 3 Introducing: GenTera’s I M A G I N E 3 HANS DE VRIES.
GenTera’s
IMAGINE 3 Introducing:
GenTera’s
IMAGINE 3HANS DE VRIES
GenTera’s
IMAGINE 3 Building Blocks
PCI/AGPBus
interface
PCI/AGPBus
interface
128 bitDDR-
SDRAMBus
128 bitDDR-
SDRAMBus
Imagine 3 Core Processor
Multi-Stream (32)Scalar / Vector Processor
80 Billion operations / second
Imagine 3 Core Processor
Multi-Stream (32)Scalar / Vector Processor
80 Billion operations / second
Advanced High Quality 3D Graphics / Volume processing
Pipelines
220 Billion operations / second
Advanced High Quality 3D Graphics / Volume processing
Pipelines
220 Billion operations / second
Graphics MaskGenerator
Graphics MaskGenerator
Motion Estimator100 Billion op/s
Motion Estimator100 Billion op/s
Data(Video)Input
Data(Video)Input
Data flowRingInput
Data flowRingInput
Data(Video)Output
Data(Video)Output
Data flowRing
Output
Data flowRing
Output2.0 Gigabyte/s 2.0 Gigabyte/s
160 Megabyte/s 1.0 Gigabyte/s
4.2 Gigabyte/s0.5 Gigabyte/s
GenTera’s
IMAGINE 3 Core Processor
HISC™ processor architecture 120 General Purpose registers (2x32 bit) 256 Vector registers (2x32 bit) 256x4 MAC Vector registers (2x32 bit) 128 Special Purpose control registers. (2x32 bit), 1200 control table registers (2x32 bit)
80 Billion operations per second (320 operations per cycle)
10 Giga Byte per second streaming I/O (memory & processor I/O)
including 64 Multiply Accumulates per cycle with saturate. 40 Conditional operations per cycle. 24 internal addresses per cycle 32 simultaneous concatenated vector streams (32 bit) (128 in byte mode) Single cycle 2D and 3D addressing modes. (1D, 2D and 3D memory management) C and C++ compiler, Image Processing Library Assembler, Linker, Debugger 3D graphics Library Visual Simulator Multi Media Library Soft In circuit Emulator Machine Vision Library
GenTera’s
IMAGINE 3 HISC Processor Architecture
RISC LEVEL:provides
C and C++compatibility
VLIW LEVEL:A moderate length VLIW instruction word plus fully programmable bus interconnect directly
controlled by the instruction code.
EXTENDED VECTOR PROCESSING:Numerous function specific Control Register add extended functionality that is activated
by the of group extended operations (as opposed to the basic operations)This increases the effective instruction word for vector operations to 1000+ bits
VARIABLE LENGTH VECTOR PROCESSING: Enables up to 32 simultaneous and concatenated Vector Processing
Streams. Word based Vector Processing (32, 2x16, 4x8) is symmetrically applied throughout the entire architecture.
HISC:Hierarchical
Instruction SetComputer
GenTera’s
IMAGINE 3 Core Processor
Examples of Basic Processor Stream performance(from external memory to external memory)
Standard GUI functions:
Screen to Screen Copy 2000 Mega pixels/s 8 bit pixels 500 Mega pixels/s 32 bit pixels
3 operand ROPS 1000 Mega pixels/s 8 bit pixels Bitmap to Color expansion 2000 Mega pixels/s 8 bit pixels
Windows Direct Draw GUI functions:
Pseudo to True Color 500 Mega pixels/s 8 bit pseudo to 16 bit or 32 bit colorsTrue Color to Pseudo 500 Mega pixels/s 32,16 bit color to 8 bit pseudo colorZ buffer aware copy 666 Mega pixels/s 8 bit pixels, 16 bit Z buffer
500 Mega pixels/s 16 bit pixels, 16 bit Z bufferAlpha Blended Copy 250 Mega pixels/s 32 bit ARGB pixels
GenTera’s
IMAGINE 3 Core Processor
Examples of Core Processor stream performance (2)(from external memory to external memory)
Multi Media Functions: (numbers in result pixels/s)
YUV to RGB conversion 500 Mega pixels/s ( 32 bit color, 16 bit hi-color, 8 bit pseudo)DCT and IDCT (8x8 blocks) 167 Mega pixels/s ( 16 bit values, 32 bit calculations)
DCT and IDCT (8x8 blocks) 667 Mega pixels/s ( 8 bit values, 16 bit calculations)
Photo shop type Image Processing Functions: (numbers in result pixels/s)
3x3 kernel convolution 2000 Mega pixels/s (8 bit pixels, 16 bit calculations)7x7 kernel convolution 500 Mega pixels/s (8 bit pixels, 16 bit calculations)Bi-cubic Rotation 1000 Mega pixels/s (8 bit pixels, 16 bit calculations)Bi-cubic Scaling 1000 Mega pixels/s (8 bit pixels, 16 bit calculations)
3D graphics Geometry:(4x4) homogeneous transformations plus perspective divides for X , Y and Z for meshedtriangles in 32 bit floating point (IEEE): 50 Million triangles/s
GenTera’s
IMAGINE 3 Core Processor
DIO WR
VIO WR
X0
MACX0ALU X0
X1
MAC X1ALU X1
Y0
MACY0 ALU Y0
Y1
MAC Y1ALU Y1
Interconnect(100 % connectivity)
REG A0VIO 0
A0
REG A1VIO 1
A1
REG B0DIO 0
B0
REG B1DIO 1
B1
REG WR1
REG WR0
Data Read Ports
Data Processing Units
DataWritePorts
DataWritePorts
GenTera’s
IMAGINE 3 Core Processor
A1/0
DIO
A0/1A0/1
I3D0
B0
MES0
B0
RING0
A0B0
REG
X0
ALU
Y0
ALU
X0
MAC
Y0
MAC
B0/1
VIO 0
Control Register Busses
SEQ
Control reg bus 1 bits [63:32]
Control reg bus 0 bits [31:0]
bus interconnect
I3D1
A1/0
MES1
B1
RING1
B1
REG
A1B1
ALU
X1
ALU
Y1
MAC
X1
MAC
Y1
VIO 1
B1/0
MSK0
VAU 0
VAU 1
MSK1
MTABEMI
GenTera’s
IMAGINE 3 Instruction Word
Dd Wr0 B0 A0 Y0 X0
Da Wr1 B1 A1 Y1 X1
127 123 112 64100 88 76
63 59 48 036 24 12
Highly orthogonal VLIW instruction word
ND0= 0
Data Processing Functions
GenTera’s
IMAGINE 3 Interconnect
Select path 1
A0 A1 B0 B1 X0 X1 Y0 Y1
Select path 2
A0 A1 B0 B1 X0 X1 Y0 Y1
Data Processing
Unit
Select path
A0 A1 B0 B1 X0 X1 Y0 Y1
Data Write Port
Instruction Word provides 8-wayInterconnectivity
InScalar-Processing
Mode
GenTera’s
IMAGINE 3 Interconnect
Select path 1 Select path 2
Data Processing Unit
Data Write Port
Instruction Word provides
100% Interconnectivity
InVector Processing
Mode
A0 R E GA0 M E MB0 R E GB0 M E MX0 A L UX0 M A CY0 A L UY0 M A CA1 R E GA1 M E MB1 R E GB1 M E MX1 A L UX1 M A CY1 A L UY1 M A C
Select path 2
A0 R E GA0 M E MB0 R E GB0 M E MX0 A L UX0 M A CY0 A L UY0 M A CA1 R E GA1 M E MB1 R E GB1 M E MX1 A L UX1 M A CY1 A L UY1 M A C
A0 R E GA0 M E MB0 R E GB0 M E MX0 A L UX0 M A CY0 A L UY0 M A CA1 R E GA1 M E MB1 R E GB1 M E MX1 A L UX1 M A CY1 A L UY1 M A C
GenTera’s
IMAGINE 3 Instruction Word
0 1 Shift, Ufu path 1 path 2 0 1 Shift, Ufu path 1 path 2
24 20 16 012 8 4
Y0 X0
1 MAC path 1 path 2
0 0 ALU path 1 path 2
1 MAC path 1 path 2
0 0 ALU path 1 path 2
Data processing instruction fields
GenTera’s
IMAGINE 3 Instruction Word
48 44 40 2436 32 28
B0 A0
Data read ports instruction fields
memory port
0 0 0
0 1 register size
1 0 control register size
0 0 Be31 16 bit imm. [15:8]
register port
0 0 Be20 16 bit imm. [7:0]
0 1 register
1
size
11 bit signed immediate
0 VIO function size 0 0 0 0 DIO read size
register port
memory port
GenTera’s
IMAGINE 3 Instruction Word
12363 4856 52
register port
Wr0ND
0 DIO address
DIO address / data and (control-) register write ports fields
size 0 register path
DIOaddress select
wr addrNon data-processing
function1 control register path
127
size rd addr
59
DIO rd/wr
DIOdata
select
62
x wr data
x rd addr
58
GenTera’s
IMAGINE 3 Parallel Conditional Processing
64 bit Uniform Status Register
X1 Y1 X1 Y1 X1 Y1 X1 Y1 X0 Y0 X0 Y0 X0 Y0 X0 Y0
[63:56] [55:48] [47:40] [39:32] [31:24] [23:16] [15:8] [7:0]
Status forByte 0
Status forByte 1
Status forByte 2
Status forByte 3
Status forByte 4
Status forByte 5
Status forByte 6
Status forByte 7
S0 C0 M0 Z0
W0 L0 H0 I0
ALU Status: Overflow, Carry, Minus, Zero (ALU, Shifts, Unary functions)
MAC Status: Wrong, Lower, Higher, Inside
GenTera’s
IMAGINE 3 Parallel Conditional Processing
Status: Generation, Collection and Application
Y0 0
X0 0
Y0 1
X0 1
Y0 2
X0 2
Y0 3
X0 3
Y1 4
X1 4
Y1 5
X1 5
Y1 6
X1 6
Y1 7
X1 7
Y0
ALUMAC
0
1
2
3
Y1
ALUMAC
0
1
2
3
X0
ALUMAC
0
1
2
3
X1
ALUMAC
0
1
2
3
V0
MSKVAU
0
1
2
3
V1
MSKVAU
0
1
2
3
A0B0
VEC.REG.
0
1
2
3
A1B1
VEC.REG.
0
1
2
3
0
1
2
3
4
5
6
7
GenTera’s
IMAGINE 3 Register File
256 vector registers
2 x 32 bit wide4 x 16 bit wide8 x 8 bit wide
up to 24 independent and conditional byte addresses
up to 8 independent and conditional byte write enables
256 vector registers
2 x 32 bit wide4 x 16 bit wide8 x 8 bit wide
up to 24 independent and conditional byte addresses
up to 8 independent and conditional byte write enables
120 general registers
2 x 32 bit / 4 x16 bit / 8 x 8 bit
120 general registers
2 x 32 bit / 4 x16 bit / 8 x 8 bit
8 x Write Indices
8 x Read AIndices
8 x Read BIndices
Write Port CVector Indexgenerators
Write Port CVector Indexgenerators
Read Port AVector Indexgenerators
Read Port AVector Indexgenerators
Read Port BVector Index generators
Read Port BVector Index generators
General Register
AddressesFrom the
InstructionCode
Write Port CInput BUS
select
Write Port CInput BUS
select
Read Port Aoutput BUS
register
Read Port Aoutput BUS
register
Read Port Boutput BUS
register
Read Port Boutput BUS
register
INTERNAL
BUS
MATRIX
ADDRESSES DATA PORTSGENERAL PURPOSE REGISTERS,VECTOR REGISTERS
2 x Read BAddress
2 x Read AAddress
2 x WriteAddress
Write Data 2,4,8 x
Read AData 2,4,8 x
Read BData 2,4,8 x
A1
A0
B1
B0
GenTera’s
IMAGINE 3 Function Units
A L U
Arithmetic,Boolean,
Shift / Rotate,Unary Functions
4 x 8, 2 x 16, 1 x 3232 bit float
A L U
Arithmetic,Boolean,
Shift / Rotate,Unary Functions
4 x 8, 2 x 16, 1 x 3232 bit float
MULTIPLIER
(un)signed x (un)signedbinary point at:
end, middle or topgraphics formats( 0.0..1.0 == 00..ff )
4 x 8, 2 x 16, 1 x 3232 bit float
MULTIPLIER
(un)signed x (un)signedbinary point at:
end, middle or topgraphics formats( 0.0..1.0 == 00..ff )
4 x 8, 2 x 16, 1 x 3232 bit float
MAC
VectorRegisters
256words
x 64 bit
MAC
VectorRegisters
256words
x 64 bit
ACCUMULATORACCUMULATOR
Variable Range ClampVariable Range Clamp
GenTera’s
IMAGINE 3 Multiplier / Accumulator8 bit Matrix functions:
Quad Inproduct (16 multiplies & 12 adds per MAC)
Matrixvec (16 multiplies & 12 adds per MAC)
32 bit input data into a 4 tab shift register (4 times for each byte)
8 bit
8 bit
8 bit
8 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
8 bit
8 bit
8 bit
8 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
8 bit
8 bit
8 bit
8 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
8 bit
8 bit
8 bit
8 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
8 bit
8 bit
8 bit
8 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
8 bit
8 bit
8 bit
8 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
8 bit
8 bit
8 bit
8 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
8 bit
8 bit
8 bit
8 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
32 bitinput data distributed to all four columns
( 4 times for 4 bytes )
GenTera’s
IMAGINE 3 Multiplier / Accumulator8 bit Matrix functions:
Open GL Blend Function ( 8 multiplies & 4 adds per MAC)
Coefficients fixed or derived from the input operands:
16 bit16 bit16 bit16 bit
32 bit input data into a 4 tab shift register (4 times for each byte)
8 bit 16 bit16 bit8 bit 16 bit16 bit8 bit 16 bit16 bit8 bit 16 bit16 bit
32 bit input data into a 4 tab shift register (4 times for each byte)
8 bit 16 bit16 bit8 bit 16 bit16 bit8 bit 16 bit16 bit8 bit 16 bit16 bit
0 BLEND_CONSTANT1 BLEND_ZERO2 BLEND_ONE3 SRC_COLOR4 INV_SRC_COLOR5 SRC_ALPHA6 INV_SRC_ALPHA7 DST_ALPHA 8 INV_DST_ALPHA9 DST_COLOR
10 INV_DST_COLOR11 SRC_ALPHA_SATURATE12 BOTH_SRC_ALPHA (source) BOTH_SRC_ALPHA (dest)13 BOTH_INV_SRC_ALPHA (source) BOTH_INV_SRC_ALPHA (dest)14 MAX_INTENSITY (source) MAX_INTENSITY (dest) 15 MIN_INTENSITY (source) MIN_INTENSITY (dest)
GenTera’s
IMAGINE 3 Multiplier / Accumulator16 bit Matrix functions:
Convolute (4 multiplies & 2 adds per Multiplier)
Transform (4 multiplies & 2 adds per Multiplier)
32 bit input data into a 2 tab shift register (2 times for each 16 word)
16 bit
16 bit
32 bit
32 bit
16 bit
32 bit
32 bit
16 bit
16 bit
32 bit
32 bit
16 bit
32 bit
32 bit
32 bit input data distributed to
both columns ( 2 times for each 16
word)
16 bit
16 bit
32 bit
32 bit
16 bit
32 bit
32 bit
16 bit
16 bit
32 bit
32 bit
16 bit
32 bit
32 bit
Mix: MH [63:32] =Coef 10[31:0]
. Mb [31:16] + Coef 11[31:0]
. Ma [31:16]
ML [ 31:0 ] =Coef 00[31:0]
. Mb [ 15:0 ] + Coef 01[31:0]
. Ma [ 15:0 ]
Merge: MH [63:32] =Coef 10[31:0]
. Ma [31:16] + Coef 11[31:0]
. Ma [ 15:0 ]
ML [ 31:0 ] =Coef 00[31:0]
. Mb [31:16] + Coef 01[31:0]
. Mb [ 15:0 ]
GenTera’s
IMAGINE 3 Multiplier/Accumulator
16 x 16 bit extern16 x 32 bit intern 32 bit accumulate
16 x 16 bit extern16 x 32 bit intern 32 bit accumulate
16 x 16 bit extern16 x 32 bit intern32 bit accumulate
16 x 16 bit extern16 x 32 bit intern32 bit accumulate
16 x 16 bit extern16 x 32 bit intern 32 bit accumulate
16 x 16 bit extern16 x 32 bit intern 32 bit accumulate
16 x 16 bit extern16 x 32 bit intern32 bit accumulate
16 x 16 bit extern16 x 32 bit intern32 bit accumulate
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
Single Multiplier/Accumulatorhandles all with the same hardware!
32 x 32 bit extern32 x 32 bit intern 64 bit accumulate
Single Multiplier/Accumulatorhandles all with the same hardware!
32 x 32 bit extern32 x 32 bit intern 64 bit accumulate
Imagine 3 operations per cycle:
64: 8x16 bit: quad in-product (4 comp.)64: 8x16 bit: 4x4 matrix x vector32: 8x16 bit: Open GL blending functions16: 16x16 bit: in-product, cross-product16: 16x16 bit: complex product16: 16x32 bit: FIR filter16: 16x32 bit: in-product, cross-product 16: 16x32 bit: complex product
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
8 x 8 extern8 x16 intern
16 x 16 bit extern16 x 32 bit intern 32 bit accumulate
16 x 16 bit extern16 x 32 bit intern 32 bit accumulate
16 x 16 bit extern16 x 32 bit intern32 bit accumulate
16 x 16 bit extern16 x 32 bit intern32 bit accumulate
16 x 16 bit extern16 x 32 bit intern 32 bit accumulate
16 x 16 bit extern16 x 32 bit intern 32 bit accumulate
16 x 16 bit extern16 x 32 bit intern32 bit accumulate
16 x 16 bit extern16 x 32 bit intern32 bit accumulate
16 x 16 bit extern16 x 32 bit intern 32 bit accumulate
16 x 16 bit extern16 x 32 bit intern 32 bit accumulate
16 x 16 bit extern16 x 32 bit intern32 bit accumulate
16 x 16 bit extern16 x 32 bit intern32 bit accumulate
16 x 16 bit extern16 x 32 bit intern 32 bit accumulate
16 x 16 bit extern16 x 32 bit intern 32 bit accumulate
16 x 16 bit extern16 x 32 bit intern32 bit accumulate
16 x 16 bit extern16 x 32 bit intern32 bit accumulate
16 x 16 bit extern16 x 32 bit intern 32 bit accumulate
16 x 16 bit extern16 x 32 bit intern 32 bit accumulate
16 x 16 bit extern16 x 32 bit intern
32 bit accumulate
16 x 16 bit extern16 x 32 bit intern
32 bit accumulate
16 x 16 bit extern16 x 32 bit intern 32 bit accumulate
16 x 16 bit extern16 x 32 bit intern 32 bit accumulate
16 x 16 bit extern16 x 32 bit intern
32 bit accumulate
16 x 16 bit extern16 x 32 bit intern
32 bit accumulate
Single Multiplier/Accumulatorhandles all with the same hardware!
32 x 32 bit extern32 x 32 bit intern 64 bit accumulate
Single Multiplier/Accumulatorhandles all with the same hardware!
32 x 32 bit extern32 x 32 bit intern 64 bit accumulate
Single Multiplier/Accumulatorhandles all with the same hardware!
32 x 32 bit extern32 x 32 bit intern 64 bit accumulate
Single Multiplier/Accumulatorhandles all with the same hardware!
32 x 32 bit extern32 x 32 bit intern 64 bit accumulate
Each of the 4 Multiplier/Accumulatorshandles all operations by utilizing
the same hardware!32 x 32 bit extern
32 x 32 bit intern32 x 32 bit floating point
64 bit accumulate
Each of the 4 Multiplier/Accumulatorshandles all operations by utilizing
the same hardware!32 x 32 bit extern
32 x 32 bit intern32 x 32 bit floating point
64 bit accumulate
GenTera’s
IMAGINE 3 Vector processing
1 2 16 17 18 19 20 21 223 4 5 6 7 8 9 10 11 12 13 14 15 23 24
ACTUAL ASSEMBLY CODE FOR THE EXAMPLE ABOVE:repeat, graph (label_1);;;label_1: genad(A0) => B0=input, A0=rd4x8(ri) => X0=mult(A,V,nuu ) ===> genad(A1) =>A1=rd4x8(ri) => Y0=subsat(X0,A1), B1=rd4x8(RING_Data) => X1=mult(Y0,B1,nus) ===> DA=Again ==> D0=word4x8(uI), X0=addsat(X1,D0) => Y0=matxvec(X0), Y1=inproduct(X0) =====> X1=addsat(Y0,Y1) => outputV1;
Variable length vector processing made simple.26 27 28 29 30 31 3225 33 34 35
genad(A0)
genad(A1)
A0=rd4x8(ri)
A1=rd4x8(ri)
Y0=subsat(X0,A1)
B1=rd(RING_Data)
B0=input
X0=mult(A0,B0,nuu)
X1=mult(Y0,B1,nus)
X0=addsat(X1,D0)
Y0=matxvec(X0)
Y1=inproduct(X0)
X1=addsat(Y0,Y1)
DA=again
D0=word4x8(uI)
outputV1
GenTera’s
IMAGINE 3 10 Gigabyte Streaming I/O
IMAGINE 3Internal Data Processing
Core
VECTOR UNITS: Simultaneousinput and output to and from memory
DATA CACHE or 3D GRAPHICS /VOLUME pipelinesINPUT AND OUTPUT
DataflowRinginput
DataflowRing
output
The Imagine 3 core canstream data from memoryor other processors at 10 GByte/sec. (Compared to0.48 GByte/sec. for the Imagine 1 )
GenTera’s
IMAGINE 3 Non-aligned S I M D
SIMD processing made simple with non-aligned memory accesses(No complex time-consuming shift-mask-merge operations needed)
32 bit memory word
32 bit memory word
32 bit memory word
32 bit word
8 bit8 bit 8 bit8 bit 8 bit8 bit 8 bit8 bit 8 bit8 bit 8 bit8 bit 8 bit8 bit 8 bit8 bit
GenTera’s
IMAGINE 3 Non Aligned Vector Accesses
32 bit words
2 x 16 bit words
16 bit words
4 x 8 bit words
8 bit words
2 x 8 bit words
2 Input and 2 output vectors simultaneous
GenTera’s
IMAGINE 3 Memory Vector Accesses
2 kB Vectorpre-fetch buffer
2 kB Vectorpre-fetch buffer
Vector Access Units: up to 32 vectors in flightVector Access Units: up to 32 vectors in flight
data/color outputconversion
data/color outputconversion
Mask Unit 256 pixels / voxels
Mask Unit 256 pixels / voxels
2D restructuringVector pipeline
2D restructuringVector pipeline
data/color inputconversion
data/color inputconversion
Vector I/OVector I/O
External MemoryInterfaceImagine 3ProcessorCore
2.25 kB Vectorwrite buffer
2.25 kB Vectorwrite buffer
2 kB Vectorpre-fetch buffer
2 kB Vectorpre-fetch buffer
2.25 kB Vectorwrite buffer
2.25 kB Vectorwrite buffer
2D restructuringVector pipeline
2D restructuringVector pipeline
2D restructuringVector pipeline
2D restructuringVector pipeline
2D restructuringVector pipeline
2D restructuringVector pipeline
Mask Unit 256 pixels / voxels
Mask Unit 256 pixels / voxels
data/color inputconversion
data/color inputconversion
data/color outputconversion
data/color outputconversion
GenTera’s
IMAGINE 31, 2 and 3D memory management
1 M Byte PAGE 1 M Byte PAGE 1 M Byte PAGE
1024x
1024
8 bit pixelTILE
256x
1024
32 bit pixelTILE
512x
1024
16 bit pixelTILE
X
Y
128 x 128x 128
16 bitvoxel
BRICK
256 x 128 x 128
8 bitvoxel
BRICK
64 x 128 x 128
32 bitvoxel
BRICK Y
Z X
GenTera’s
IMAGINE 3 3D texture/volume Hardware
Very High Quality
220 Billion operations/sec: 2 x 440 operations per cycle (4 ns)
Texture Quality: BI linear, TRI Linear and QUAD interpolation.Texture Types: 32 bit ARGB, 16 bit (4 types), 8,4,2 and 1 bit pseudo color
16 bit and 32 bit greyscale (signed and unsigned), 2x16 bit complex Texture Size: 16,384 x 16,384 max (2d)2048 x 2048 x 2048 max (3d)Texture Dimension: 1, 2 and 3 dimensional textures.Texture Clamping: Clamp and Wrap for all 3 co-ordinates.Texture Border: 0 or 1 pixels texture borders, Border Color supported.Texture MIP maps up to 16 levels: selection made for each individual pixel.
Perspective division for al 9 parameters: S, T, R, Alpha, Red, Green, Blue, Fog, Z Perspective Correct Texture Mapping,Perspective Correct Texture Lighting,Perspective Correct Linear and Exponential (2 types) Fog,Perspective Correct Depth Buffering,
GenTera’s
IMAGINE 3 3D graphics Pipelines
D BUS
3Dgraphicspipelinecontrol
unit
3Dgraphicspipelinecontrol
unit
Perspect.MIP mapprocessing
pipeline
Perspect.MIP mapprocessing
pipeline
Bressenha
m Edge Start
Interpolators(Q,R,S,T,
Z-1)
(F,A,R,G,
B)
Bressenha
m Edge Start
Interpolators(Q,R,S,T,
Z-1)
(F,A,R,G,
B)
Vector
StartInterpolators(Q,R,S,T,
Z-1)
(F,A,R,G,
B)
Vector
StartInterpolators(Q,R,S,T,
Z-1)
(F,A,R,G,
B)
Pixel Valu
eInterpolators(Q,R,S,T,
Z-1)
(F,A,R,G,
B)
Pixel Valu
eInterpolators(Q,R,S,T,
Z-1)
(F,A,R,G,
B)Perspective
3D co-ordinateGenerator
5 stages
Perspective3D co-ordinate
Generator
5 stages
Perspective3D correct
Lighting
5 stages
Perspective3D correct
Lighting
5 stages
Perspective MIP Map Addresses
Calculations
2 stages
Perspective MIP Map Addresses
Calculations
2 stages
PerspectiveInterpolatio
nCoefficients
PerspectiveInterpolatio
nCoefficients
Perspective Lighting &
FogCoefficients
Perspective Lighting &
FogCoefficients
Memory
Access
Input Fifo
/ Port Selec
t
Memory
Access
Input Fifo
/ Port Selec
t
External Memory
withMIP Map Textures
4 - 6 stages
External Memory
withMIP Map Textures
4 - 6 stages
Memor
y Acce
ssRe-orde
r buffers
Memor
y Acce
ssRe-orde
r buffers
Memory AccessInternal Delay Line
forInterpolation, Lighting &
FogCoefficients3 - 17 stages
Memory AccessInternal Delay Line
forInterpolation, Lighting &
FogCoefficients3 - 17 stages
Memor
y Acce
ssData Loa
d unit
Memor
y Acce
ssData Loa
d unit
TexelInterp./
Lightingcontrol
unit
TexelInterp./
Lightingcontrol
unit
Texel Selection / Expansion
Texel Selection / Expansion
Texel
Color
Look Up
Texel
Color
Look Up
Texel Interpolation / Lightingcoefficients generator
Texel Interpolation / Lightingcoefficients generator
Texel
Interpolation / Light
ing
Multiply stage
Texel
Interpolation / Light
ing
Multiply stage
Texel Interpolation / Light
ing
Summation
stage
Texel Interpolation / Light
ing
Summation
stage
GenTera’s
IMAGINE 3 3D texture/volume Hardware
3D graphics Pipeline + Core stream performance(from external memory to external memory)
Direct Draw functions: (numbers in result pixels/s)Bilinear Image Scale: 333 Mega pixels/s (32 bit gray scale or 32 bit color pixels )Bilinear Image Rotate: 333 Mega pixels/s (32 bit gray scale or 32 bit color pixels )Bilinear Affine Transform: 333 Mega pixels/s (32 bit gray scale or 32 bit color pixels )
MPEG functions: (numbers in result pixels/s)Bilinear Scaling plus kYUV to αRGB 333 Mega pixels/s (32 bit αRGB pixels)
3D functions: (numbers in result pixels/sec)Z-buffered, Perspective Correct, Bilinear Interpolated Texture mapping with perspectivecorrect lighting and exponential fog (Texture size up to 16k x 16k), MIP-Mapping: 300 Mega pixels/sec. (32 bit αRGB pixels, 16 bit hi-color, 8 bit pseudo, 16 bit Z values)
GenTera’s
IMAGINE 3 Fan Beam Back projection
The 3D Texture/Volume pipelines and the Multiplier / Accumulators in the Imagine 3 can handle eight 16 bit linear interpolated samples per cycle with 32 bit accuracy.
VectorDirectionBack
ProjectionDirection
GenTera’s
IMAGINE 3 Cone beam reconstruction
The Back projection in cone beam systems requires the:
Inverseperspectivemapping
from filtered images back to a 3D volume. The Imagine 3 performs this directly with it’s 3D volume pipelines.
GenTera’s
IMAGINE 3 De-blur filtering
FIR filter performance (16 bit input, 32 bit calculations)
128 Tab: 32 Mega-pixels / second256 Tab: 16 Mega-pixels / second512 Tab: 8 Mega-pixels / second
324 projections512 values
840 projections928 values
256x256 resultimage
512 x 512 resultimage
Filtered Backprojection for Medical Imaging324 x 512 to 256 x 256
De-blur filtering 10 ms (256 tabs)Backprojection 11 ms Reconstruction 21 ms
Filtered Backprojection for Medical Imaging840 x 928 to 512 x 512
De-blur filtering 100 ms (512 tabs)Backprojection 108 ms Reconstruction 208 ms
GenTera’s
IMAGINE 3 De-blur filtering (FFT)
Complex input Fast Fourier Transform performance (vectorized) 32 bit Floating Point 32 bit Integer 16 bit Integer
256 Point: 8 μs 4 μs 2.0 μs 512 Point: 18 μs 9 μs 4.4 μs 1024 Point: 40 μs 20 μs 10 μs 2048 Point: 88 μs 44 μs 22 μs 4096 Point: 192 μs 96 μs 48 μs 8192 Point: 436 μs 218 μs 109 μs16384 Point: 896 μs 448 μs 224 μs
1200 projectionsof
960 values
512 x 512resultimage
Filtered Back-projection for Medical Imaging1200 x 960 to 512 x 512
FFT filtering 106 ms (2048 point FP)Back-projection 157 ms
Reconstruction 263 ms
GenTera’s
IMAGINE 3 Radar Display Processing
Cartesian to Polar conversion with bi-linear interpolation 32 bit colors:
250 Mega-pixels /second
GenTera’s
IMAGINE 3 Motion Estimators
Motion Estimation Unit for MPEG1…MPEG4 video encoding
100 Billion operations / second- software controllable,- arbitrary MxN kernel sizes up to 256 by 256- arbitrary search space sizes up to 4096 by 4069 for HDTV and higher- allows optimizing algorithms (reduced search space)- forward and backward prediction- vector processing co-operation with core for bi-cubic pixel interpolation / rotation
Performance:
Compare a 16x16 pixel block with any other 16x16 pixel block(half, quarter, 1/8th, 1/16th pixels with bi-cubic interpolation)
120 Million Block Compares / second
GenTera’s
IMAGINE 3 Graphics Mask Generators
Generates Transparent and Opaque Masks for 512 pixelsmultiple units work in parallel:
Window Mask GeneratorAutomatically clips pixels outside the View Port (scissoring)
Span line Mask Generator for Concave Polygons and arbitrary Objects
Range Mask generator for Depth Buffer Tests, Stencil Buffer Tests, Alpha Test, Chroma Keying Tests et cetera
Complex Mask Generator for Concave and Complex Polygons according to the odd/evenor winding rules
Alpha Mask GeneratorFor objects with partially covered pixels
GenTera’s
IMAGINE 3 Graphics Mask Generators
Spanlin
e Address
Spanlin
e Address
Overlap
triangle
Window X
min /max
Window X
min /max
Window Y
min /max
Window Y
min /max
Spanline 0 Star
t/ End
Spanline 0 Star
t/ End
Spanline 1 Star
t/ End
Spanline 1 Star
t/ End
Spanline 2 Star
t/ End
Spanline 2 Star
t/ End
Spanline 3 Star
t/ End
Spanline 3 Star
t/ End
Spanlin
e Delt
a Star
t
Spanlin
e Delt
a Star
t
Spanlin
e Delt
a End
Spanlin
e Delt
a End
Spanline Y min
/ max
Spanline Y min
/ max
Spanlin
e Length (-1)
Spanlin
e Length (-1)
Range
mask 0
Range
mask 0
Range
mask 1
Range
mask 1
Range
mask 2
Range
mask 2
Range
mask 3
Range
mask 3
Complex
mask 0
Complex
mask 0
Complex
mask 1
Complex
mask 1
Complex
mask 2
Complex
mask 2
Complex
mask 3
Complex
mask 3
The Rang
e Mask conta
ins the
result of the Deph
t buffer test
(overlappin
g triang
le)
The Complex Mask
is used in this example to hold the
Polygon
Stipple
pattern
The Spanl
ine registers
define the outlines of the
triangle
The Window is defined by the
Window
registers
GenTera’s
IMAGINE 3 Multi media I/O units
Video Output(Α), R, G, B outputs with 330 MHz dot clock for 1800 x 1400 screen format at 90 Hz.12 (16) bit video out for Studio Quality video processing. Interface to DVI-TFTtransmitters for high resolution, high quality LCD displays.
Video InputCCIR 656: 8 bit digital video input for NTSC, PAL, SECAM, HDTV and custom formats
Audio Codec 97 InterfaceStandard from Intel, Creative Labs, Yamaha, Analog Devices and Nat.SemiconductorSupports Analog speakers, Microphone, Headphone + Headphone micro, Telephony and
Modem signals, CD analog audio in, Analog Video Sound In, PC beep in, et ceteraDigital Audio: 4 stereo serial I/O ports (I2S type and S type emulation capabilities) Supports CD , DVD and Dolby AC3 input or output
External Device Control 8 bit classic μP interface bus and I2C type emulation capability
MIDI interface (Input and output for synthesizers and keyboards)
GenTera’s
IMAGINE 3 Real Time Support
MULTI MEDIA REAL TIME SUPPORT
Level 1 Events (1 micro second response time requirement)Horizontal Sync interrupts, Video I/O interrupts, Register Virtualization interrupts.
Level 2 Events(2 - 100 micro second response time requirement)Communication Fifo interrupts, Mailbox Interrupts, I2S Fifo Interrupts, Ac97 Fifo InterruptsMidi Interrupt, I2C interrupt, Vertical Sync Interrupts, Scheduler Clock Tick, et cetera
Threads ( 100 micro - 10 millisecond response time requirement) Host Command Queues ManagerAudio Stream managersModem Stream managersUser definable threads
GenTera’s
IMAGINE 3 High-end Board
8 Processors: 3.2 Tera operations/s 4 GigaByte memory
IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3
GenTera’s
IMAGINE 3 High-end Board
8 Imagine 3 processors, 3200 Billion operations per second
32 GigaByte per second Memory Bandwidth
16 GigaByte per second Inter-Processor Bandwidth
- Perspective Volume Rendering: 1000 x 1000 x 1000 at 15 frames/second (based on 25% volume traversal)
- Cone Beam Reconstruction: 512 x 512 x 512 from 10002x128 in 4 seconds
- Real Time 3D ultra sound reconstruction and visualization
- Real Time HDTV MPEG 4 video encoding
- Advanced Radar Processing
GenTera’s
IMAGINE 3 High Speed Dataflow Ring
IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3
IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3
Up to 2 Gigabyte per second Dataflow Ring (SSTL-2)Point-to-point with Broadcast options and auto configuration
GenTera’s
IMAGINE 3 High Speed System I/O
The Dataflow Ring also provides very high speed System I/O.Entry level system can use the programmable Video Data I/O for
general purpose I/O. ( 160 MB/s per processor, 1 GB/s per processor )
IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3
IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3IMAGINE
3
Video In160 MB/s
Video out1 GB/s
OptionalSystem
I/OFPGA
e.g:Xilinx
Virtex II
OptionalSystem
I/OFPGA
e.g:Xilinx
Virtex II
Data-flow
input:Up to
2.0 GB/s
Data-Flow
Output:Up to
2.0 GB/s
GenTera’s
IMAGINE 3 Pipeline Processing
The Dataflow Ring allows long vector processing pipelinesover multiple processors. Here an example with just 2 processors
MAC as3D blend unit
MAC as3D blend unit
ALUALU256 entryvector register
256 entryvector register
MAC asFIR filterMAC asFIR filter
ALUALU
ALUALU
Bi linear Interpolated Data from the Graphics pipelineBi linear Interpolated Data from the Graphics pipeline
Bi linear Interpolated Data from the Graphics pipelineBi linear Interpolated Data from the Graphics pipeline
Vector Write to memory
Vector Write to memory
Vector Read from memoryVector Read
from memoryVector Write to memory
Vector Write to memory
Vector Read from memoryVector Read
from memory
DataflowRing
DataflowRing
DataflowRing
DataflowRing
DataflowRing
DataflowRing
GenTera’s
IMAGINE 3 128 bit memory bus (reads)
16 kbyte1st Level
data cache
16 kbyte1st Level
instruction cache
Dual 128 word x 128 bit
Vector input fifo’s
Dual3D-graphics
pipelines
PCI/AGPMemory
Read access
Video Output 128 word
x 128 bit fifo
4.2 Gigabyte /second Memory Bus: 128 bit PC2100
128 bit
GenTera’s
IMAGINE 3 128 bit memory bus (writes)
16 kbyte1st level
data cache
Dual 128 word x 128 bit
Vector output fifos
16 word x 128 bitwrite buffer
PCI/AGPMemory
Write access
4.2 Gigabyte /second Memory Bus.
(128 bit PC2100)
128 bit
8-fold address interleaved memory reads and writes. Out of order
accesses with coherency checking
GenTera’s
IMAGINE 3 END
GenTera’s
IMAGINE 3HANS DE VRIES