Realizing a Power Efficient, Easy to Program Manycore: The...

38
Realizing a Power Efficient, Easy to Program Manycore: The Tile Processor Anant Agarwal Tilera, MIT

Transcript of Realizing a Power Efficient, Easy to Program Manycore: The...

Page 1: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

Realizing a Power Efficient, Easy to Program Manycore:

The Tile ProcessorAnant Agarwal

Tilera, MIT

Page 2: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

2 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

Multicore is Everywhere

DesktopTVs

Cellphones Telepresence

Datacenters/Clouds

NetworkSwitchesNetworkSwitches

LaptopAutomobiles

Page 3: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

3 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

Parallel Computing No Longer Province of Rocket Scientists

The computing world is ready for radical change

Time

#Cores

2007

1

2006

2

4

32

2014

Quadcore

2005

64cores

Dualcore

1000 cores

Intel

Sun IBM Cell

nCores

8-24cores

Larrabee

Page 4: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

4 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

Tilera Announced the TILE-Gx100 in Oct `09

100 general-purpose coresRuns SMP LinuxStandard programming1.25GHz – 1.5GHzFull 64-bit processors32 MBytes total cache546 Gbps memory BW200 Tbps iMesh BW

80-120 Gbps packet I/O80 Gbps PCIe I/O

Wire-speed packet engine– 120Mpps

MiCA engines:– 40 Gbps crypto– 20 Gbps compressMemory ControllerMemory Controller Memory ControllerMemory Controller

Memory ControllerMemory Controller Memory ControllerMemory Controller

mPI

PEm

PIPE

MiscI/O

MiscI/O

MiCAMiCA

MiCAMiCA

3 PCIeInterfaces3 PCIe

Interfaces

NetworkI/O

Eight10Gb-or-

32 1Gb-or-

Two 40Gb

NetworkI/O

Eight10Gb-or-

32 1Gb-or-

Two 40Gb

FlexibleI/O

FlexibleI/O

Page 5: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

5 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

But first, let’s rewind to 1996

Page 6: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

6 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

Research Vision to Commercial Product

2002

2007

100B transistors

2018

1B Transistors

in 2007

1996

The opportunity

CPUMem

1997

A blankslate

MIT Raw 16 cores Tile Processor

64 cores

Memory Controller (DDR3)

Memory Controller (DDR3) Memory Controller (DDR3)

Memory Controller (DDR3)

Memory Controller (DDR3)

Memory Controller (DDR3) Memory Controller (DDR3)

Memory Controller (DDR3)

mPI

PEm

PIPE

FlexibleI/O

FlexibleI/O

UART x2, USB x2,JTAG, I2C, SPI

UART x2, USB x2,JTAG, I2C, SPI

MiCAMiCA

MiCAMiCA

Ser

Des

Ser

Des

PCIe 2.08-lane

PCIe 2.08-lane

Ser

Des

Ser

Des

PCIe 2.04-lane

PCIe 2.04-lane

Ser

Des

Ser

Des

PCIe 2.08-lane

PCIe 2.08-lane

Inte

rlake

nIn

terla

ken

Inte

rlake

nIn

terla

ken

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

TileGx100100 cores

The future?

2010

Page 7: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

7 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

The Opportunity

20MIPS cpuin 1987

1996…

Few thousand gates

Page 8: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

8 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

The Opportunity

The billion transistor chip of 2007

Page 9: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

9 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

How to Fritter Away Opportunity

100 ported RegFil and RR

Caches

Control

More resolution buffers, control

Does not scale

Burns a lot of power

Page 10: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

10 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

Key Multicore Challenges: The 3 P’s

Performance challenge– How to scale from 1 to 1000 cores -- the number of cores is

the new MegaHertz

Power efficiency challenge– Performance per watt is the new metric – power efficiency

trumps instruction-set (ISA) compatibility

Programming challenge– How to distinguish between a processor and a paper weight

Page 11: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

11 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

Our Early Raw Proposal

CPUMem

Got parallelism?

Page 12: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

12 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

ASICs have high performance and low power• Custom-routed, short wires• Lots of ALUs, registers, memories – huge on-chip parallelism

memmem

mem

mem

mem

Take Inspiration from ASICs

But how to build a programmable chip?

Page 13: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

13 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

Replace Long Wires with Routed Interconnect

Ctrl

[IEEE Computer ’97]

Page 14: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

14 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

16-Way ALU Clump Distributed ALUs

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

Bypass Net

RF

Page 15: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

15 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

Distributed ALUs, Routed Bypass Network

ALU

ALU

ALU ALU

ALU

ALU

R

Scalar Operand Network (SON) [TPDS 2005]

Page 16: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

16 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

From a Large Centralized Cache…

Page 17: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

17 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

…to a Distributed Shared Cache

ALU

ALU ALU ALU

ALU

ALU

R

$

[ISCA 1999]

Page 18: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

18 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

Distributed Everything + Routed InterconnectTiled Multicore

ALU

ALU

ALU ALU

ALU

ALU

R

$

Each tile is a processor, so programmable

Page 19: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

19 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

On-Chip Interconnect Routes Messages

ALU

ALU

ALU ALU

ALU

ALU

R

$

head

erpa

yload

For distributed cache access, off-chip misses, I/O, user-level messages

Page 20: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

20 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

Tiled Multicore Captures ASIC Benefits and is ProgrammableScales to large numbers of coresModular – design and verify 1 tilePower efficient– Short wires plus locality opts –

CV2f– Chandrakasan effect, more cores at

lower freq and voltage – CV2f

ProcessorCore

= TileCore + Switch

S Current Bus Architecture

Page 21: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

21 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

“Raw” Die Photo

16 tiles, 425MHz, 18 Watts (vpenta)IBM 0.18 micron process

[ISCA 2004]

…2002

Page 22: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

22 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

Research Vision to Commercial Product

2002

2007

100B transistors

100B transistors

2018

1B Transistors

in 2007

1996

The opportunity

CPUMem

1997

A blankslate

MIT Raw 16 cores Tile Processor

64 cores

Memory Controller (DDR3)

Memory Controller (DDR3) Memory Controller (DDR3)

Memory Controller (DDR3)

Memory Controller (DDR3)

Memory Controller (DDR3) Memory Controller (DDR3)

Memory Controller (DDR3)

mPI

PEm

PIPE

FlexibleI/O

FlexibleI/O

UART x2, USB x2,JTAG, I2C, SPI

UART x2, USB x2,JTAG, I2C, SPI

MiCAMiCA

MiCAMiCA

Ser

Des

Ser

Des

PCIe 2.08-lane

PCIe 2.08-lane

Ser

Des

Ser

Des

PCIe 2.04-lane

PCIe 2.04-lane

Ser

Des

Ser

Des

PCIe 2.08-lane

PCIe 2.08-lane

Inte

rlake

nIn

terla

ken

Inte

rlake

nIn

terla

ken

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

TILE-Gx100100 cores

The future?

2010

Page 23: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

23 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

Cloud Demands higher performance density and low power

Wireless infrastructure marketDemands higher throughput, more services and low power

Networking market Demands higher performance, better security and services

Digital multimedia market Newer algorithms (e.g., H.264) for higher compression demand more performance

Why Do We Care?Markets Demanding More Performance at Lower Power

Cable & BroadcastCable & BroadcastVideo ConferencingVideo Conferencing

SwitchesSwitchesSecurity AppliancesSecurity Appliances

GGSNGGSN Base StationBase Station

Cloud server rackCloud server rack

Page 24: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

24 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

The Tile Processor is a System-on-a-Chip

Yes w/ DDCCache coherency

64Flexible I/O pins

27-34Typical power -9 device (W)

DDR2 bandwidth (peak Gbps)

PCIe interfaces

Ethernet bandwidth

I/O and Memory

Typical power -7 device (W)Typical power -5 device (W)

PowerClock speed (MHz)On chip bandwidth (Terabit/s)

Operations (16/32-bit BOPS)

On-chip cache (MB)# of cores

Performance

5.6

700, 866

TILEPro64

2 XAUI, 2GbE

N/A

200

2 x 4-lanes

17-21

64

221/166

38

TILEPro64 Block DiagramSe

rDes

SerD

es

GbE 0GbE 0

GbE 1GbE 1

FlexibleI/O

UARTJTAG

SPI, I2C

FlexibleI/O

UARTJTAG

SPI, I2C

DDR2 Controller 3DDR2 Controller 3

DDR2 Controller 0DDR2 Controller 0

DDR2 Controller 2DDR2 Controller 2

DDR2 Controller 1DDR2 Controller 1

PCIe 0PCIe 0

SerD

esSe

rDes

PCIe 1PCIe 1

XAUI 0XAUI 0

SerD

esSe

rDes

XAUI 1XAUI 1

SerD

esSe

rDes

TILEPro36 also available Tilera development systems

Page 25: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

25 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

Remember the 3 P’s?

Performance challenge– How to scale from 1 to 1000 cores -- the number of cores is

the new MegaHertz

Power efficiency challenge– Performance per watt is the new metric – power efficiency

trumps instruction-set (ISA) compatibility

Programming challenge– How to distinguish between a processor and a toaster

Page 26: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

26 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Key Innovations

1. General purpose cores – Standard OS and

programming

3. Multicore Dynamic Distributed Cache– How to achieve cache coherence and run

standard software

2. iMesh™ Network – How to scale and be energy efficient

4. Multicore Hardwall™– How to virtualize multicore

5. Multicore Development Environment

– How to program

OS1/APP1

OS1/APP3OS2/APP2

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Page 27: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

27 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

1 - Full-Featured General Cores Enable Throughput Oriented Computing and Standard Languages

Processor– Each core is a complete computer– 3-way VLIW CPU– Designed for low power – 200mW per core– SIMD instructions: 32, 16, and 8-bit ops– Instructions for video (e.g., SAD) and networking– Protection and interrupts– Single core performance roughly the same as a

modern MIPS or ARM coreMemory

– L1 cache: 8KB I, 8KB D, 1 cycle latency– L2 cache: 64KB unified, 7 cycle latency– 32-bit virtual address space per process– 64-bit physical address space– Instruction and data TLBs– Cache integrated 2D DMA engine

Switch in each tile

Runs SMP LinuxRuns off-the-shelf open-source C/C++ programs

Core

Cache16K L1-I

64K L2

I-TLB

D-TLB2D

DMA8K L1-D

Register File

Three Execution Pipelines

TerabitSwitch

Page 28: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

28 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

MDN TDNUDN IDN

SWITCH

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

Core

Cache + MMU

TerabitSwitch

2- iMesh On-Chip Network

Distributed resources– 2D Mesh peer-to-peer tile networks– 5 independent networks – Each with 32-bit channels, full duplex– Tile-to-memory, tile-to-tile, and tile-to-IO data transfer– Packet switched, wormhole routed, point-to-point– Near-neighbour flow control, dimension-ordered routing

Performance and energy efficiency– ASIC-like one cycle hop latency– 2 Tbps bisection bandwidth– 32 Tbps interconnect bandwidth– Low power

6 independent networks– One static, four dynamic – IDN – System and I/O– MDN – Cache misses, DMA, other memory– TDN, VDN – Tile to tile memory access and coherence– UDN, STN – User-level streaming and scalar transfer

Achieves scalability and power efficiency

VDN STN

Page 29: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

29 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

Meshes are Power Efficient

[Konstantakopoulos ’07]

More than 80% power savings over buses

Page 30: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

30 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

3 – Coherent On-Chip Cache System Enables Standard Programming

Distributed cache– Each tile has local L1 and L2 cache– Aggregate of L2 serves as a globally shared L3

Dynamic Distributed Cache (DDC™)– Hardware based cache coherence– Hardware tracks sharers, invalidates stale

copies– One or multiple coherency domains– Dedicated network to manage coherency

Coherent direct-to-cache I/O– Header/packet delivered directly to tile caches– Cache coherent delivery– Significant DRAM bandwidth and latency

reductionSe

rDes

SerD

es

GbE 0GbE 0

GbE 1GbE 1

FlexibleI/O

UARTJTAG

SPI, I2C

FlexibleI/O

UARTJTAG

SPI, I2C

DDR2 Controller 3DDR2 Controller 3

DDR2 Controller 0DDR2 Controller 0

DDR2 Controller 2DDR2 Controller 2

DDR2 Controller 1DDR2 Controller 1

PCIe 0PCIe 0

SerD

esSe

rDes

PCIe 1PCIe 1

XAUI 0XAUI 0

SerD

esSe

rDes

XAUI 1XAUI 1

SerD

esSe

rDes

CompletelyCoherentSystem

tile a tile h

tile b

Page 31: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

31 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

4 – Multicore Hardwall™ Technology for Virtualization and Protection in Cloud Environments

The virtualization and protection challengeMulticores need to run multiple OS’s and applications in cloud environmentsOS’s must be protected from each otherI/O and other shared resources must be virtualized

Multicore Hardwall technologyProtects applications and OS by prohibiting unwanted interactionsConfigurable to include one or many tiles in a protected areaSupported by Tilera hypervisor running on all the tiles

OS1/APP1

OS1/APP3

OS2/APP2

Page 32: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

32 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

Configurable Fine Grain Protection

01 2

3

TLB Access

01 2

3

DMA Engine

01 2

3

“User” Network

01 2

3

I/O Network

Key:0 – User Code1 – OS2 – Hypervisor or PAL3 – Hypervisor Debugger

Key:0 – User Code1 – OS2 – Hypervisor or PAL3 – Hypervisor Debugger

Full Stack Linux with Hypervisor

Page 33: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

33 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

5 – Standard Tools and Software Stack

Multicore Development Environment

Standard application stack

Standard programmingSMP Linux 2.6.26ANSI C/C++pthreads

Integrated toolsSGI compilerStandard gdb gprofEclipse IDE

Innovative toolsMulticore debugMulticore profile

Standards-based tools

TILE hardware

Tile Tile Tile Tile …

Hypervisor

Operating System

Applications

Virtualization and high speed I/O drivers

Linux kernel and kernel drivers

Linux libraries

Applications libraries

Application layerOpen source appsStandard C/C++ libs

Operating System layer64-way SMP LinuxZero Overhead Linux Bare metal environment

Hypervisor layerVirtualizes hardwareI/O devices driversLoad balancer

Page 34: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

34 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

Multiple Software Environments to Meet Diverse Needs of Embedded and Cloud Systems

Standard SMP Linux– Standard Linux environment with processes and threads– Ideal for applications and control plane code requiring operating system services– Open source applications work out of the box

Standard SMP Linux with Zero Overhead Linux (ZOL)– Zero Linux overhead (Eliminates OS interrupts, timer ticks, etc..) – Transparent to programmer - no software change required – For high performance data-plane applications not requiring OS services

Bare metal environment– Full control of the hardware on up to 64 tiles– No operating system or hypervisor layers– For embedded applications requiring fine grain control of memory, and IO

Hybrid environment – Using 2 or all three of the above models – Each environment can be run on one or more Tiles– Ideal for customers aggregating data plane and control plane code on one chip

Page 35: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

35 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

Parallel Programming using Standard Models

64-way SMP Linux– Single system image across all tiles

Standard pthreads API– pthread_create()– Shared memory model by default– Synchronize using mutexes and locks

Standard Linux processes – fork(), exec()– Separate address space– Share memory: mmap(), mspaces– Communicate: Pipes / local sockets

Gentle slope programming optimizations using Linux extensions

– Control memory location and distribution– Control thread scheduling and location

New models and further optimizations using TMC library (Tile Multicore Components)

DPI

DPI

DPI

LoadBalancer Histogram

Packet stream

Deep packet inspection

PCIe 1MACPHY

PCIe 1MACPHY

PCIe 0MACPHY

PCIe 0MACPHY

SerdesSerdes

SerdesSerdes

Flexible IOFlexible IO

GbE 0GbE 0

GbE 1GbE 1Flexible IOFlexible IO

UART, HPIJTAG, I2C,

SPI

UART, HPIJTAG, I2C,

SPI

DDR2 Memory Controller 3DDR2 Memory Controller 3

DDR2 Memory Controller 0DDR2 Memory Controller 0

DDR2 Memory Controller 2DDR2 Memory Controller 2

DDR2 Memory Controller 1DDR2 Memory Controller 1

XAUIMAC

PHY 0

XAUIMAC

PHY 0SerdesSerdes

XAUIMAC

PHY 1

XAUIMAC

PHY 1SerdesSerdes

DPI

DPI

DPI

HST LB

Page 36: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

36 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

Scaling Up: TileGx100 Announced in Oct `09

100 general-purpose coresRuns SMP LinuxStandard programming1.25GHz – 1.5GHzFull 64-bit processors32 MBytes total cache546 Gbps memory BW200 Tbps iMesh BW

80-120 Gbps packet I/O80 Gbps PCIe I/O

Wire-speed packet engine– 120Mpps

MiCA engines:– 40 Gbps crypto– 20 Gbps compressMemory ControllerMemory Controller Memory ControllerMemory Controller

Memory ControllerMemory Controller Memory ControllerMemory Controller

mPI

PEm

PIPE

MiscI/O

MiscI/O

MiCAMiCA

MiCAMiCA

3 PCIeInterfaces3 PCIe

Interfaces

NetworkI/O

Eight10Gb-or-

32 1Gb-or-

Two 40Gb

NetworkI/O

Eight10Gb-or-

32 1Gb-or-

Two 40Gb

FlexibleI/O

FlexibleI/O

Page 37: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

37 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

Scaling Down: TILEGx16™ also Announced

16 Processor Cores1.0 &1.25 GHz speedsFull 64-bit processors5.2 MBytes total cache200 Gbps memory BW20 Tbps iMesh BW

24 Gbps total packet I/O – 2 ports 10GbE (XAUI)– 12 ports 1GbE (SGMII)

32 Gbps PCIe I/O

Wire-speed packet engine– 30Mpps

MiCA engine:– 10 Gbps crypto– 5 Gbps compress &

5 Gbps decompressMidrange 36 core part also announced

Memory Controller (DDR3)Memory Controller (DDR3)

mPI

PEm

PIPE

10 GbEXAUI

10 GbEXAUI

SerD

esSe

rDes

4x GbESGMII

10 GbEXAUI

10 GbEXAUI

SerD

esSe

rDes

4x GbESGMII

SerD

esSe

rDes4x GbE

SGMII

FlexibleI/O

FlexibleI/O

UART x2, USB x2,JTAG, I2C, SPI

UART x2, USB x2,JTAG, I2C, SPI

MiCAMiCA

SerD

esSe

rDes PCIe 2.0

4-lanePCIe 2.04-lane

SerD

esSe

rDes PCIe 2.0

4-lanePCIe 2.04-lane

Memory Controller (DDR3)Memory Controller (DDR3)

SerD

esSe

rDes PCIe 2.0

4-lanePCIe 2.04-lane

Page 38: Realizing a Power Efficient, Easy to Program Manycore: The ...web.stanford.edu/class/ee380/Abstracts/100203-slides.pdf · Runs SMP Linux Standard programming 1.25GHz – 1.5GHz Full

38 © 2009 Copyright Tilera Corporation. All Rights Reserved.© 2009 Copyright Tilera Corporation. All Rights Reserved.

Vision for the Future

The ‘core’ is the logic gate of the 21st century

pm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

s

Corollary of Moore’s Law: The number of cores will double every 18 months