Intel’S Larrabee

25
Intel’s Larrabee Vipin.p.nair S7-EC Roll no: 24 CEK

description

Larrabee is a new processor from Intel. It combines the features of bot CPU & GPU

Transcript of Intel’S Larrabee

Page 1: Intel’S Larrabee

Intel’s LarrabeeVipin.p.nairS7-ECRoll no: 24CEK

Page 2: Intel’S Larrabee

Introduction

•It is a multicore general purpose graphics processor unit (GPGPU), combines the functions of multi core CPU & GPU.•Larrabee is based on Intel’s x86 architecture.

Page 3: Intel’S Larrabee

Architectural convergence

Page 4: Intel’S Larrabee

Features

• Texture filtering, rasterization, depth testing and alpha blending entirely in software

• Implement binned renderer to increase parallelism • Reduced memory Bandwidth• Parallel processing on image processing, physical

simulation, medical & financial analysis.• DDR5 RAM support• Each core can execute 32Gigaflops/s with 1GHz

clock, results several teraflops/s speed

Page 5: Intel’S Larrabee

Differences with CPU

• Out of order execution• Vector processing unit supports 16-single

precision floating point numbers at a time• Texture sampling units – trilinear /anisotropic

filtering & texture decompression• 1024-bit ring bus between cores• Cache control instructions• 4-way multithreading

Page 6: Intel’S Larrabee

Difference with GPU

• x86 instruction set with Larrabee-specific extensions

• cache coherency across all its cores• z-buffering, clipping, and blending without

using graphics hardware

Page 7: Intel’S Larrabee

Larrabee – Block Diagram

Page 8: Intel’S Larrabee

Architecture

• Cores communicate on a 1024-bit wide ring bus - Fast access to memory, I/O interfaces and fixed function blocks - Fast access for cache coherency• L2 cache is partitioned among the cores - Provides high aggregate bandwidth - Allows data replication & sharing• Optimized for highly parallel workload using vector processor

Page 9: Intel’S Larrabee

In-order CPU Core

• Separate scalar & vector units with separate registers• Vector unit: 16 32-bit ops/clock• In-order instruction execution• Fast access from 64k L1 cache• Direct connection to eachcore’s subset of the 256k L2 cache• Prefetch instructions load L1and L2 caches

Page 10: Intel’S Larrabee

Vector Unit

• Vector complete instruction set – Scatter/gather for vector load/store – Mask registers select lanes to write, which allows data-parallel flow control – Masks also support data compaction

• Vector instructions support – Full speed when data in L1 cache – Fused multiply add (three arguments) – Int32, Float32 and Float64 data – Can read 8-bit unorm, 8-bit uint, 16 bit sine, 16 bit float data & convert it into 32 bit floats/ integers.

Page 11: Intel’S Larrabee

Fixed Function Logic

• Micro codes in place of fixed function logic for post shader alpha blending, rasterization and interpolation.

• Includes fixed function texture filter logic

• Virtual memory for textures

Page 12: Intel’S Larrabee

Larrabee’s Binning Renderer

Binning pipeline– Reduces synchronization– Front end processes vertex & geometry shading– Back end processes pixel shading, stencil testing, blending– Bin FIFO between them

• Multi-tasking by cores– Each orange box is a core– Cores run independently– Other cores can run othertasks, e.g. physics

Page 13: Intel’S Larrabee

Back-end Rendering a Tile

• Orange boxes represent work on separate threads• Three work threads do Z, pixel shader, and blending• Setup thread reads from bins and does pre-processing• Combines task parallel, data parallel, and sequential

Page 14: Intel’S Larrabee

Pipeline can be changed

• Parts can move between front end & back end – Vertex shading, tesselation, rasterization, etc. – Allows balancing computation vs. bandwidth• New features – Transparency, shadowing, ray tracing etc. – Each of these need irregular data structures – Also helps to be able to “repack” the data

Page 15: Intel’S Larrabee

Transparency

Transparency with & without pre-resolve effects

Page 16: Intel’S Larrabee

Examples of using Tasks• Applications – Scene traversal and culling – Procedural geometry synthesis – Physics contact group solve – Data parallel strand groups – Distribute across threads/cores using task system – Exploit core resources with SIMD

• Larrabee can submit work to itself! – Tasks can spawn other tasks – Exposed in Larrabee Native programming interface(c/c++

compiler)

Page 17: Intel’S Larrabee

Application scaling studies

Page 18: Intel’S Larrabee

Scalability Studies

• Based on memory Bandwidth & texture filtering speed

Page 19: Intel’S Larrabee

Performance Breakdowns

Page 20: Intel’S Larrabee

Binning & Bandwidth Studies

Bandwidth

•Immediate mode use more Bandwidth -2.4 to 7 times for F.E.A.R -1.5 to2.6 times more for Gears of War -1.6 to 1.8 times more for Half Life 2 Episode 2.

Page 21: Intel’S Larrabee

Overall performance

Page 22: Intel’S Larrabee

Conclusion

The Larrabee architecture opens the rich set of opportunities for both graphics rendering and throughput computing and is the appropriate platform for convergence of GPU & CPU

Page 23: Intel’S Larrabee

Reference

• IEEE Digital Library- Larrabee: a many- core x86 architecture for visual computing: - Larry Seiler, Doug Carmean, Toni Juan of Intel Corporation, Jeremy Sugerman & Peter Hanrahan – Stanford University

• IEEE spectrum January 2008• ACM transactions on graphics-Article 18• www.intel.com• www.wikipedia.com

Page 24: Intel’S Larrabee

Questions

Page 25: Intel’S Larrabee

Thank You