A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask...
Transcript of A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask...
![Page 1: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/1.jpg)
Larrabee: A Many-Core x86 Architecture
for Visual Computing
Doug Carmean
Larrabee Chief Architect
Hot Chips 20
August 26, 2008
![Page 2: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/2.jpg)
2
Agenda
Architecture Convergence
Larrabee Architecture
Graphics Pipeline
![Page 3: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/3.jpg)
3
Agenda
Architecture Convergence
Larrabee Architecture
Graphics Pipeline
![Page 4: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/4.jpg)
4
Architecture Convergence
Fixed Function
Partially
Programmable
Fully
Programmable
Multi-threading Multi-core Many Core
Throughput Performance
Pro
gra
mm
ab
ility
CPU
GPU
CPU• Evolving toward throughput computing
• Motivated by energy-efficient performance
GPU• Evolving toward general-purpose computing
• Motivated by higher quality graphics and
data-parallel programming
Larrabee: CPU programmability with GPU throughput
![Page 5: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/5.jpg)
5
Graphics Rendering Pipelines
Input Data
Input Data
Frame BufferBlend
Input Data
PixelProcessing
Rasterization
Primitive Setup
Transformationand Lighting
Input Data
Primitive Setup
Rasterization
Frame BufferBlend
Pre 2001 DX8-DX10
VertexShading
GeometryShading
Primitive Setup
Rasterization
Pixel Shading
FrameBuffer Blend
Larrabee
Frame Buffer
Frame Buffer
AlternativeLarrabee:
Customized Pipeline
PixelShading
Frame Buffer
VertexShading
DX10 GeometryShading
Frame Buffer
Software Rendering
Input Data
Frame Buffer
Pre 1996Customized
SoftwareRendering
Software Rendering
![Page 6: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/6.jpg)
6
Larrabee \Lar*a*bee”\, n. [from Intel]
1: a general, programmer-friendly, architecture combining data level and thread level parallelism optimized for high performance throughput applications
2: a high performance visual computing device
![Page 7: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/7.jpg)
7
Agenda
Architecture Convergence
Larrabee Architecture
Graphics Pipeline
![Page 8: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/8.jpg)
8
Cores for Throughput Tasks
CPU design experiment: specify a throughput-optimized processor with same area and power of a standard dual core CPU.
# CPU cores 2 out of order 10 in-order
Instructions per issue 8 per clock 2 per clock
VPU lanes per core 4-wide SSE 16-wide
L2 cache size 4 MB 4 MB
Single-stream 4 per clock 2 per clock
Vector throughput 16 per clock 160 per clock
Peak vector throughput for given power and area.Ideal for graphics & other throughput applications.
Design experiment: not a real 10-core chip!
10 times the peak throughput per clock
Data in table from Seiler, L., Carmean, D., et al. 2008. Larrabee: A many-core x86 architecture for visual computing. SIGGRAPH ’08: ACM SIGGRAPH 2008 Papers, ACM Press, New York, NY
![Page 9: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/9.jpg)
9
Larrabee Block Diagram
• IA Cores communicate on a wide ring bus– Fast access to memory and fixed function blocks– Fast access for cache coherency
• L2 cache is partitioned among the cores– Provides high aggregate bandwidth– Allows data replication & sharing
Me
mo
ry C
on
tro
lle
r
Multi-Threaded
Wide SIMD
Multi-Threaded
Wide SIMD
Multi-Threaded
Wide SIMD
Multi-Threaded
Wide SIMD
D$I$D$I$
D$I$
Fix
ed
Fu
nc
tio
nTe
xtu
re L
og
ic
Me
mo
ry C
on
tro
lle
r
Me
mo
ry C
on
tro
lle
r
Dis
pla
y I
nte
rfa
ce
Syste
m In
terf
ace
D$I$
Multi-ThreadedWide SIMD
D$I$
Multi-ThreadedWide SIMD
D$I$
Multi-ThreadedWide SIMD
D$I$
Multi-ThreadedWide SIMD
L2 Cache
. . .
. . .
![Page 10: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/10.jpg)
10
Larrabee IA Core
• Separate scalar and vector units with separate registers
• In-order IA scalar core
• Vector unit: 16 32-bit ops/clock
• Short execution pipelines
• Fast access from L1 cache
• Direct connection to subset of the L2 cache
• Prefetch instructions load L1 and L2 caches
Instruction Decode
ALUAGU
ALUAGU
L1 Data Cache
256K L2 Cache
Ring
L1 Instruction Cache
Thread Dispatch
Vecto
r pro
cessor
inte
rface
![Page 11: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/11.jpg)
11
Vector Processing Unit
• Vector complete instruction set– Scatter/gather for vector load/store
– Mask registers select lanes to write, which allows data-parallel flow control
– Can map a separate execution kernel to each vector lane
• Vector instructions support
– Fast, wide read from L1 cache
– Numeric type conversion and data replication while reading from memory
– Rearrange the lanes on register read
– Fused multiply add (three arguments)
– Int32, Float32 and Float64 data
Mask Registers
Mask Registers
16-wide Vector ALU
NumericConvert
NumericConvert
L1 Data Cache
VectorReg File
16-wide Vector ALU
Convert
Vector Mask Registers
L1Data Cache
Convert Store Buffer
![Page 12: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/12.jpg)
12
Texture Sampler
• Fixed function texture sampler– Typical texture operations, including
decompression, anisotropic filtering
– Core communication via the L2 cache
– Supports virtual address translation using IA page formats
• Fixed function vs. software
– Texture filtering needs specialized data access to unaligned 2x2 blocks of pixels
– Filtering is optimized for 8-bit color
– Code would take 12x longer for filtering or 40x longer if texture decompression is required
Mask Registers
Replicate
Reorder
Vector Registers
NumericConvert
NumericConvert
L1 Data Cache
Mask Registers
Replicate
Reorder
Vector Registers
NumericConvert
NumericConvert
L1 Data Cache
L2 Texture $
Filter
Decompress
Blend
Ring
L1 Texture $
AGUIA TLB
![Page 13: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/13.jpg)
13
Agenda
Architecture Convergence
Larrabee Architecture
Graphics Pipeline
![Page 14: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/14.jpg)
14
Larrabee’s Binning Renderer
• Goals of a Binning Renderer:
1. Parallelism
a) Provide independent work queues for cores/threads
b) Significantly reduce synchronization points
2. Bandwidth Efficiency
a) Utilize on-die caches
b) Reduce off-die bandwidth pressure
Process a Primitive Set
Bin
Set
(sto
red
in o
ff-c
hip
me
mo
ry)
••••••
Front End Back End
Fra
me B
uff
er
in L
2 C
ach
e
Co
mm
an
d S
tream
Process a Primitive Set
Render a Tile
Render a Tile
Render a Tile
Intel® Microarchitecture (Larrabee)
![Page 15: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/15.jpg)
15
Larrabee’s Binning Renderer
Screen is tiled and tiles processed concurrently
Software buffers (pixel, depth) also tiled
Tremendous bandwidth to tiled buffers
TilesScreen
![Page 16: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/16.jpg)
16
Larrabee’s Binning Renderer
Screen is tiled and tiles processed concurrently
Software buffers (pixel, depth) also tiled
Tremendous bandwidth to tiled buffers
TilesScreen
![Page 17: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/17.jpg)
17
Agenda
Bandwidth Implications
![Page 18: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/18.jpg)
18
core
core
core
core
Bi-direction ring: physically sits over cache
Banked, logically unified ring cache
Straightforward, elegant architecture
Tremendous aggregate ring bandwidth
cache
cache
cache
cache
Server Style Interconnect
![Page 19: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/19.jpg)
19
core
core
core
core
3D graphic tile buffers reside in ring cache
Tiles don’t fit in cores L1$
Tremendous BW to tiled buffers
Tile buffer BW requirement =>heavy ring traffic
cache
cache
cache
cache
Server Style Interconnect: 3D Graphics
![Page 20: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/20.jpg)
20
core cache
core cache
cache core
cache core
Access ring on private L2$ miss
Tile buffer accesses do not travel on ring
However, memory coherence implies
L2$ miss requires snooping of all other L2$
RFO transaction needs to traverse ring
Larrabee Ring
![Page 21: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/21.jpg)
21
core cache
core cache
cache core
cache core
Central tag directory (td) reduces coherence traffic
Copy of all L2$ tags
Tracks MESI state
However, hot spot at the tag directory
Tag directory BW doesn’t scale with cores
tag directory
Larrabee Ring with Central Tag Directory
![Page 22: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/22.jpg)
22
core cache
core cache
cache core
cache core
Distributed tag directory (td)
Address hash to determine specific tag directory
Solves hot spot
However, cache-cache transfer require global ordering
Introduces serialization
Larrabee Ring: Distributed Tag Directory
td
td td
td
![Page 23: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/23.jpg)
23
core cache
core cache
cache core
cache core
Change from MESI to MOESI
Cache-cache ownership transfer on die
However, contended locks still hop around ring
td
td td
tdmoesi moesi
moesimoesi
Larrabee Ring: Enhanced Coherency
![Page 24: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/24.jpg)
24
Lock cache holds lines containing contended locks
Locked inc/dec/cmp use ALU in lock cache
Big speedup (>10x) on contended locks
Contended lock throughput @ 1 per clock
locks locks
locks locks
core cache
core cache
cache core
cache core
td
td td
tdmoesi moesi
moesimoesi
Larrabee Ring: Future Enhancement
![Page 25: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/25.jpg)
25
core cache
core cache
cache core
cache core
Large core count configurations need higher bandwidth
Required flexibility for physical topology
Xring: 3 separate rings with efficient ring crossing
td
td td
tdmoesi moesi
moesimoesi
Larrabee Ring: Scalability
cache core
cache core
td
tdmoesi
moesi
xr
xr
xr
xr
![Page 26: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/26.jpg)
26
• Binning mode– Includes bin reads & writes
– Reads/writes each pixel once due to tiles in the L2 cache
• Immediate mode– Assumes perfect HeirZ cull
– Assumes 1MB each for compressed depth, color and stencil buffers
Bandwidth Results
![Page 27: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/27.jpg)
27
Summary
• Each Larrabee core is a complete IA core– Context switching & pre-emptive multi-tasking
– Virtual memory and page swapping
– Fully coherent caches at all levels of the hierarchy
• Efficient inter-block communication– Ring bus for full inter-processor communication
– Low latency high bandwidth L1 and L2 caches
– Fast synchronization between cores and caches
• Elegant use of fixed function logic– No backend blender between cores and memory
– No rasterization logic between vertex and pixel stages
– Result: flexible load balancing & general functionality
Larrabee: a general throughput computing architecture
![Page 28: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/28.jpg)
28
Questions?
![Page 29: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/29.jpg)
29
Legal Disclaimer• INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO
LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.
• Intel may make changes to specifications and product descriptions at any time, without notice.
• All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
• Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.
• Larrabee and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user
• Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
• Intel, Intel Inside and the Intel logo are trademarks of Intel Corporation in the United States and other countries.
• *Other names and brands may be claimed as the property of others.
• Copyright © 2008 Intel Corporation.
![Page 30: A Many-Core x86 Architecture for Visual Computing...–Scatter/gather for vector load/store –Mask registers select lanes to write, which allows data-parallel flow control –Can](https://reader033.fdocuments.in/reader033/viewer/2022043019/5f3b67d08e402a7c4e0c2135/html5/thumbnails/30.jpg)
30
Risk FactorsThis presentation contains forward-looking statements that involve a number of risks and uncertainties. These statements do not reflect the potential impact of any mergers, acquisitions, divestitures, investments or other similar transactions that may be completed in the future. The information presented is accurate only as of today’s date and will not be updated. In addition to any factors discussed in the presentation, the important factors that could cause actual results to differ materially include the following: Demand could be different from Intel's expectations due to factors including changes in business and economic conditions, including conditions in the credit market that could affect consumer confidence; customer acceptance of Intel’s and competitors’ products; changes in customer order patterns, including order cancellations; and changes in the level of inventory at customers. Intel’s results could be affected by the timing of closing of acquisitions and divestitures. Intel operates in intensely competitive industries that are characterized by a high percentage of costs that are fixed or difficult to reduce in the short term and product demand that is highly variable and difficult to forecast. Revenue and the gross margin percentage are affected by the timing of new Intel product introductions and the demand for and market acceptance of Intel's products; actions taken by Intel's competitors, including product offerings and introductions, marketing programs and pricing pressures and Intel’s response to such actions; Intel’s ability to respond quickly to technological developments and to incorporate new features into its products; and the availability of sufficient supply of components from suppliers to meet demand. The gross margin percentage could vary significantly from expectations based on changes in revenue levels; product mix and pricing; capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; excess or obsolete inventory; manufacturing yields; changes in unit costs; impairments of long-lived assets, including manufacturing, assembly/test and intangible assets; and the timing and execution of the manufacturing ramp and associated costs, including start-up costs. Expenses, particularly certain marketing and compensation expenses, vary depending on the level of demand for Intel's products, the level of revenue and profits, and impairments of long-lived assets. Intel is in the midst of a structure and efficiency program that is resulting in several actions that could have an impact on expected expense levels and gross margin. Intel's results could be impacted by adverse economic, social, political and physical/infrastructure conditions in the countries in which Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Intel's results could be affected by adverse effects associated with product defects and errata (deviations from published specifications), and by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust and other issues, such as the litigation and regulatory matters described in Intel's SEC reports. A detailed discussion of these and other factors that could affect Intel’s results is included in Intel’s SEC filings, including the report on Form 10-Q for the quarter ended June 28, 2008.