Download - Sector Cache Optimizations for the K Computerperarnau/pdfs/aics13poster.pdf · Sector Cache Optimizations for the K Computer Swann Perarnau, Mitsuhisa Sato RIKEN AICS, University

Transcript
Page 1: Sector Cache Optimizations for the K Computerperarnau/pdfs/aics13poster.pdf · Sector Cache Optimizations for the K Computer Swann Perarnau, Mitsuhisa Sato RIKEN AICS, University

Sector Cache Optimizations for the K Computer

Swann Perarnau, Mitsuhisa Sato

RIKEN AICS, University of Tsukuba

Sector Cache on the SPARCVIIIfx

IOn-demand split of the shared L2 cache.IUser controlled mapping between accesses and sectors.→ Isolation of thrashing accesses.→ Select and keep useful data in cache.

Our Goals

Assess applicability of the sector cache to optimize HPC applications.I Study potential optimization strategies.IHelp users find good sector cache optimizations.IAim for as much automation as possible.

Issues

I Low-level compile-time API.IHard to predict impact on performance.I Little support from compiler and performance analysis tools.

Current Sector Cache API

double myarray[NSIZE];double otherarray[NSIZE];

void mywork(void){

int i;double sum = 0;

#pragma statement cache_sector_size 1 11#pragma statement cache_subsector_assign myarray

for(i = 2; i < NSIZE-2; i++){

// myarray in sector 1sum += myarray[i-2] + myarray[i-1] +

myarray[i] + myarray[i+1] +myarray[i+2] + otherarray[i];

}}

Results

I efficient cache optimization of classical HPC benchmarks.I framework for locality analysis and optimization of HPC applications.I on-going automation, already little user action required.

Framework Overview

Hotspotdetection

Structures/Scopesetup

DWARF reader

Binaryinstrumentation

Locality analysis

Codemodification

Locality Measurements

Reuse Distance: for a memory access, the number of uniquememory locations touched since the previous access to the samelocation.→ an architecture-independent measure closely related to the cachemisses triggered by an application.→ use it to measure consequences of isolating one structure by itselfwith the sector cache.

Reuse exemple

Access :

load 0x10load 0x20load 0x30load 0x10load 0x20load 0x10load 0x30

Distance :

∞∞∞2212

Den

sity

Distance

Validating the Framework: a toy example

1.75MB 3.5MB 7MB ∞

0

2 · 106

4 · 106

6 · 106

8 · 106

1 · 107

1.2 · 107

1.4 · 107M1M2M3

Reuse Distances of Each Structures

1 2 3 4 5 6 7 8 9 10 11

024681012141618202224

Sector Size

Cache

Miss

esReductio

n(%

)

M1M2M3

Selected by framework

Optimizations for all Sector Configurations

Optimizing the NAS Parallel Benchmarks

Benchmark Function Isolated Variables Sector Size Miss Reduction (%) Runtime Reduction (%)CG conj_grad p (1,11) 19 10

LU

ssor a,b,c,d

(2,10)

48 8blts ldz,ldy,ldx,d 75 10buts d,udx,udy,udz 18 3jacld a,b,c,d 64 14jacu a,b,c,d 57 6

Acknowledgements

Part of the results were obtained by early access to the K computer at the RIKENAICS. This work was supported by the JSPS Grant-in-Aid for JSPS Fellows Number24.02711.

RIKEN Advanced Institute for Computational Science Programming Environment Research Team [email protected]