High Performance Simulations of Electrochemical Models on the Cell Broadband Engine

34
CBELU - 1 James 01/29/22 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC Workshop September 20, 2007 This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government

description

High Performance Simulations of Electrochemical Models on the Cell Broadband Engine. James Geraci HPEC Workshop September 20, 2007. - PowerPoint PPT Presentation

Transcript of High Performance Simulations of Electrochemical Models on the Cell Broadband Engine

CBELU - 1James 04/24/23

MIT Lincoln Laboratory

High Performance Simulations of Electrochemical Models

on the Cell Broadband Engine

James GeraciHPEC Workshop

September 20, 2007

This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government

MIT Lincoln LaboratoryCBELU - 2

James Geraci 04/24/23

• Introduction• Modeling battery physics• LU decomposition• CELL Broadband Engine

Outline

• Introduction

• Out of Core Algorithm

• In Core Algorithm

• Summary

MIT Lincoln LaboratoryCBELU - 3

James Geraci 04/24/23

Introduction Real Time Battery State of Health Estimation

•The way in which a battery ages is a strong function of battery geometry.•The rate of self discharge is also a function ofbattery geometry.

Lead Acid Battery

MIT Lincoln LaboratoryCBELU - 4

James Geraci 04/24/23

Inside a Lead Acid Battery

Face View Lead Dioxide Plate Face View Lead Plate

•Lead acid batteries are made up of stacks of lead dixode and lead electrode pairs

MIT Lincoln LaboratoryCBELU - 5

James Geraci 04/24/23

Finite volume description of batteryLead

DioxideSulfuric

Acid LeadEdge view lead dioxide plate

•The center volume’s physics can be influenced by the physics of the volumesto the right, the left, above, and below

•The 5 point stencil gives rise to banded matrix

MIT Lincoln LaboratoryCBELU - 6

James Geraci 04/24/23

Matlab spy plot of Banded Matrix

•Non-zero entries shown in blue

•Inner band around diagonal

•Distant narrow outer band

•For the physics of a lead acidbattery, the matrix is not symmetric (shown)

•For the physics of a lead acidbattery, the matrix is poorlyconditioned 10^8-10^12 - need double precision,

no GPGPU

MIT Lincoln LaboratoryCBELU - 7

James Geraci 04/24/23

• Introduction• Modeling battery physics• LU decomposition• CELL Broadband Engine

Outline

• Introduction

• Out of Core Algorithm

• In Core Algorithm

• Summary

MIT Lincoln LaboratoryCBELU - 8

James Geraci 04/24/23

LU Decomposition

• LU decomposition decomposes a matrix J into a lower triangular matrix L and an upper triangular matrix U

• L & U can be used to solve a system of linear equations Jx = r by forward elimination back substitution

– Essentially Gaussian Elimination

• Often used on poorly conditioned systems where ‘iterative solvers’ can’t be used.

• Difficult to parallelize for small systems because of the fine grain nature of the parallelism involved.

• Banded LU is a special case of LU – The matrix J has a special ‘banded’ data pattern.

MIT Lincoln LaboratoryCBELU - 9

James Geraci 04/24/23

• Introduction• Modeling battery physics• LU decomposition• CELL Broadband Engine

Outline

• Introduction

• Out of Core Algorithm

• In Core Algorithm

• Summary

MIT Lincoln LaboratoryCBELU - 10

James Geraci 04/24/23

Cell Broadband Engine

•Cell Broadband Engine is a new heterogeneous multicore processor that features large internal and off chip bandwidth.

MIT Lincoln LaboratoryCBELU - 11

James Geraci 04/24/23

EIB (96B/cycle 3.2Gcycles/sec*96B/cycle = 286.1GB/sec)

Cell Broadband Engine

SPE

MFC

LS

SXU

SPU

SPE

MFC

LS

SXU

SPU

SPE

MFC

LS

SXU

SPU

SPE

MFC

LS

SXU

SPU

SPE

MFC

LS

SXU

SPU

SPE

MFC

LS

SXU

SPU

SPE

MFC

LS

SXU

SPU

SPE

MFC

LS

SXU

SPU

PPU

L1L2 PXUMIC

16B/cycle 16B/cycle47.7GB/sec

16B/cycle

PPE

25GB/sec

Bandwidth:•50GB/sec/SPU•25GB/sec off chipSPE Performance:•14 Gflops double

MIT Lincoln LaboratoryCBELU - 12

James Geraci 04/24/23

Outline

• Introduction

• Out of Core Algorithm

• In Core Algorithm

• Summary

• Banded LU• Performance• Latency• Synchronization

MIT Lincoln LaboratoryCBELU - 13

James Geraci 04/24/23

Banded LUOut of Core Algorithm Explained

1hbWi

i

2b

2b

)2:,(),(

),2:1()2:,2:1( biiiJ

iiJ

ibiiJbiibiiJ

JMatrix

MIT Lincoln LaboratoryCBELU - 14

James Geraci 04/24/23

Banded LUOut of Core Algorithm Analyzed

1hbW

Ni

222 bN

i

i

2b

2b

2

2bMultiplies:

Compute

Memory Accesses

Ni

Subtracts:2

2b

Gets:

Puts:2b

2b

22 bN

Memory Doubles Moved

bb

22 bNb

Synchronizations N

•Compute/Memory Ratio = ½•For a 16728x16728 matrix with band size of 420, almost 22 GB of data would have to be moved.

JMatrix

MIT Lincoln LaboratoryCBELU - 15

James Geraci 04/24/23

Outline

• Introduction

• Out of Core Algorithm

• In Core Algorithm

• Summary

• Banded LU• Performance• Latency• Synchronization

MIT Lincoln LaboratoryCBELU - 16

James Geraci 04/24/23

Peformance of Out of Core Algorithm

•Out of Core Algorithm outperforms UMFpack on Opteron 246 based workstation.

•No appreciable gain in performance past 4 SPEs

Gflops for matrix size 16728x16728 w/ band size of 420 Compute Time for matrix size 16728x16728 w/ band size of 420

Seco

nds

2

1

3

1 4 5 62 3G

flops

Out of coreLinear speed up

Out of coreOpteron 246: UMFpackLinear speed up

3 4 5 61 2

0.2

0.3

0.4

0.5

0.6

0.7

0.80.91.0

MIT Lincoln LaboratoryCBELU - 17

James Geraci 04/24/23

Outline

• Introduction

• Out of Core Algorithm

• In Core Algorithm

• Summary

• Banded LU• Performance• Latency• Synchronization

MIT Lincoln LaboratoryCBELU - 18

James Geraci 04/24/23

Latency for DMA put

16 bytes1024 bytes

•Significant Performance hit for memory access smaller than 16bytes

•Bandwidth limited region starts at 8x128bytes

MIT Lincoln LaboratoryCBELU - 19

James Geraci 04/24/23

SPE to main memory bandwidth

16 bytes

1024 bytes

Theoretical Maximum Bandwidth 25GB/sec

•Theoretical maximum bandwidth can almost be achieved for larger message sizes

MIT Lincoln LaboratoryCBELU - 20

James Geraci 04/24/23

OCA Memory Access Size Dependence

•Out of Core performance is better when memory access is a byte multiple of 128

MIT Lincoln LaboratoryCBELU - 21

James Geraci 04/24/23

PPE/SPE Synchronization Mailboxes

// barrierWhile(sum != NUM_SPE){for(i < NUM_SPE){ if(NewMailBoxMessage){

readMailBox;}}} // Notify PPEspu_writech(SPU_WrOutMbox,1); // Wait for PPEspu_readch(SPU_RdInMbox);

// Restart SPEswriteSPEinMboxes();

PPE SPESPE

MFC

LS

SXU

SPU

PPU

L1L2 PXU

16B/cycle

16B/cycle

Out Mailbox

In Mailbox

PPE

EIB•Mailboxes are one common method of synchronization

MIT Lincoln LaboratoryCBELU - 22

James Geraci 04/24/23

PPE/SPE Synchronization MailboxesRound Trip Times

• Mailboxes using IBM SDK 2.1 C-intrinsics

• Mailboxes using IBM SDK 2.1 C-intrinsics & pointers to MMIO registers

• Mailboxes by pointers alone

• Standard round trip latency 16byte message

• IBM SDK 2.1 C-intrinsics for mailboxes do not seem to have idea performance.

6.24 seconds

3.65 seconds

0.35 seconds / not reliable

58.33 seconds TCP (Gigabit)8.08 seconds Infiniband

MIT Lincoln LaboratoryCBELU - 23

James Geraci 04/24/23

Synchronization by hybrid of C-intrinsics and pointers to MMIO registers

•Synchronization with IBM SDK 2.1 C-intrinsics & pointers to MMIO registersyields fairly good performance for a moderate numbers SPEs

Matrix size 16728x16728 w/ band size of 420

1 2 3 4 5 6Number of SPEs

Out of coreLinear speed up

1

2

Seco

nds

3

MIT Lincoln LaboratoryCBELU - 24

James Geraci 04/24/23

Synchronization exclusively by IBM SDK 2.1 mailbox C intrinsics

•Synchronization by IBM SDK 2.1 mailbox C intrinsics alone, yields little gain for low SPE count and performance LOSS after only 4 SPEs!!!

Seco

nds

2 3 4 5 6

Number of SPEs

SPE synchronization by SDK C intrinsics

0.11

0.2

0.3

0.4

0.5

0.6

0.7

0.8

MIT Lincoln LaboratoryCBELU - 25

James Geraci 04/24/23

Data from synchronization exclusively by pointers

Seco

nds

2 3 4 5 6Number of SPEs

SPE synchronization by SDK C intrinsics

0.11

0.2

0.3

0.4

0.5

0.6

0.7

0.8

•Synchronization by pointers alone, yields a nice speed up for all SPEs•Seems to be reliability issue with reading mailbox status register via pointers

MIT Lincoln LaboratoryCBELU - 26

James Geraci 04/24/23

Outline

• Introduction

• Out of Core Algorithm

• In Core Algorithm

• Summary

• Hide memory accesses• Hide synchronizations

MIT Lincoln LaboratoryCBELU - 27

James Geraci 04/24/23

Banded LUIn Core Algorithm

1hbW

2b

2b

Ni 2

2b

Multiplies:

Compute

Ni

Total Accesses:

222 bN

Subtracts:2

2b

Gets:

Puts:12N3

bb )*3(2 bNbb

Memory Accesses Doubles Moved

Synchronizations N

•Compute/Memory Ratio =

•For a 16728x16728 matrix with band size of 420, only 0.158 GB of data would have to be moved.

32b

Matrix

Start up Access: 1 2bb

MIT Lincoln LaboratoryCBELU - 28

James Geraci 04/24/23

In Core Performance

Achieves over 11% of theoretical performance

MIT Lincoln Laboratory

Max out core0.583 Gflops

Max in core1.23 Gflops

Intel Xeon5160 3.0Ghz0.377 Gflops

MIT Lincoln LaboratoryCBELU - 29

James Geraci 04/24/23

In Core Performancew/ linear speed up

MIT Lincoln Laboratory

Performance improves as band size increases

MIT Lincoln LaboratoryCBELU - 30

James Geraci 04/24/23

IBM QS20 performance

MIT Lincoln Laboratory

IBM QS20Max in core3.15 Gflops

Achieves over 11% of theoretical performance

MIT Lincoln LaboratoryCBELU - 31

James Geraci 04/24/23

ICA Memory Access Size Dependence

•In core performance does not show a large dependence on memory access size

MIT Lincoln LaboratoryCBELU - 32

James Geraci 04/24/23

Outline

• Introduction

• Out of Core Algorithm

• In Core Algorithm

• Summary

MIT Lincoln LaboratoryCBELU - 33

James Geraci 04/24/23

Summary / Future Work

• Parallel LU decomposition can benefit from the high bandwidth of the CBE

– Benefit depends greatly on synchronization scheme

• inCore LU offers performance advantages over out of core LU– Limits on size of matrix bandwidth for inCore LU.

• Partial pivoting

MIT Lincoln LaboratoryCBELU - 34

James Geraci 04/24/23

Acknowledgements

• Out of Core Algorithm– Sudarshan Raghunathan – John Chu MIT

• In Core Algorithm – Jeremy Kepner MIT LL– Sharon Sacco MIT LL– Rodric Rabbah IBM