The GLIMPSES Toolkit Rapid code prototyping for SPEs Jaswanth Sreeram, Santosh Pande.

Post on 31-Dec-2015

219 views 0 download

Transcript of The GLIMPSES Toolkit Rapid code prototyping for SPEs Jaswanth Sreeram, Santosh Pande.

The GLIMPSES ToolkitRapid code prototyping for SPEs

Jaswanth Sreeram, Santosh Pande

2

Overview of Toolkit

• GLIMPSES Toolkit : GLobal Interprocedural Memory and ParalleliSm Estimator for SPUs

– Profile instrumentation support• Profile parsers and interpreters.

– Analyzers for memory allocation & access behavior

– Visualization Engine

3

GLIMPSES toolkit• One of two tools available in public domain

– Rapid Prototyping, Legacy Code Migration and Performance Tuning on Cell SPEs

– Second one is asmvis

• Released on source-forge in mid July:http://glimpses.sourceforge.net

• OSI certified open source license(s).

• Has received interest for adoption in academia and industry– Samsung Korea, Codecs and Media computing Group.– Sony Computer Entertainment America (SCEA)

4

GLIMPSES : Motivation

• Prototyping large codebases for porting to SPEs is challenging– Find a partition (set of functions)– Find a set of upward exposed references– DMA transfer them and lay them out –

alignment– After execution store the results back– Make sure memory requirements do not exceed

capacity

5

Motivation – contd.

• Challenges due to architectural attributes– Limited local store– High branch penalty– Suited for vectorizable code rather than scalar

code– SPE/PPE interactions

• Provide programmer with tools to– Understand program behavior (esp. memory

usage)– Quickly construct candidates partitions for SPE– Evaluate/Quantify partitions’ suitability for SPEs

6

GLIMPSES : Details

• Memory Estimation tools enable programmer to:– Estimate static & dynamic memory usage

• Code, Stack, Heap

– Understand program behavior• Detect program objects affecting dynamic memory

behavior• Show the correlation between these program objects and

memory usage.

– Rank program segments• Criteria: Memory requirements, vectorizability, branching,

etc.

– Visualize results interactively.

7

Features overview• Dynamic Call Graph visualization – ability

to select a call tree • Memory Requirements

– Dynamic– Analytical – ‘what if’ scenario calculator

for memory capacity • Memory Access Patterns

– Locality (spatial, temporal, neighbor affinity)

• Ranking– Criteria based estimates

• Alias and safe pre-fetching information– Multiple alias analyses available

8

Overview

Test Inputs

VisualizationEngine

Dyn. Memory Estimator

Profile Trace

Analysis &Instrumentation Passes

Execute

Instru. Bytecode

C/C++ program

LLVM compiler flow

Bytecode

LinkRuntime

AnalyticalMemory Estimator

GraphML Trace

Partition Estimator

9

Visualization

Graph Visualization Area

Results Display Panel

10

Visualization …contd

11

Visualization …contd

• Zoom view

• Shows dynamic call chains for a program run (in this case the program is mpeg2-decode)

12

Visualization …contdFunction Characteristics

Alias Analysis Algorithm used

Type of Aliases displayed (“Must Alias”, “May Alias”, “No Alias”)

Aliasing information for pairs of variables/memory regions.

13

Analytical Memory Estimation

• Correlate dynamic memory usage with program objects– Dynamic memory usage depends on inputs, etc.

• Compiler Analysis– From each malloc, do a backward traversal to find

instructions that influence the arguments to malloc.– Construct an arithmetic expression for amount of

memory allocated, in terms of inputs or other program objects.

– Handles control flow constructs (if-then-else, loops etc)

14

Memory Behavior: Analytical Estimation

if (cc==0) size = Picture_Width * Picture_Height;else size = Chroma_Width * Chroma_Height;…..……

for(….) {if (…..)

malloc(size);if (…..)

malloc(size);}

__Malloc_size__1 = Picture_Width*Picture_Height

__Malloc_size__2 = Picture_Width*Picture_Height

__Malloc_size__3 = Picture_Width*Picture_Height

__Malloc_size__4 = Picture_Width*Picture_Height

__Malloc_size__5 = Chroma_Width*Chroma_Height

__Malloc_size__6 = Chroma_Width*Chroma_Height

__Malloc_size__7 = Chroma_Width*Chroma_Height

__Malloc_size__8 = Chroma_Width*Chroma_Height

15

Memory References

• Memory reference metrics– Temporal (frequency) – Spatial– Neighbor affinity

• Metrics measured per memory line

• Per function metrics or per-partition metrics

• Visually represented via a color map– Pale Violet (low) -> Bright Red (high)

16

Memory Ref. Frequency (mpeg2decode)Memory Reference map (per partition)

with 1024B memory lines

17

Mpeg2decode: Load recurrence

Neighbor Affinity

• Metric to describe how well memory layout is suited to caching

• Consider a slice S of length w of the whole memory access trace and two loads

L1, L2 Є S

If |L1addr – L2addr| < line size then

L1, L2 exhibit neighbor affinity for slice size w

18

19

Load Neighbor Affinity

20

Alias Analysis for libode

• Basic AA (least precise, fastest)– Aggressive local analysis– Non context sensitive– Non-flow sensitive

• Total number of queries 119520497• “No Alias” 35924925• “May Alias”

83492482• “Must Alias” 103090

21

Alias Analysis (contd)

• Globals Mod/Ref– context-sensitive mod/ref and alias

analysis for internal global variables– Very fast, very precise, limited scope

• Total number of queries 119520497• “No Alias” 35944215• “May Alias” 83473192• “Must Alias” 103090

22

Alias Analysis (contd)

• Anderson’s AA algorithm– Subset-based, flow-insensitive, context-

insensitive, and field-insensitive alias analysis

– Very precise, but slow.

• Total number of queries 119520497• “No Alias” 79361105• “May Alias” 40057171• “Must Alias” 102221

23

Ranking (MPEG2Encode)• Criteria based

– Code Size (csize)– Stack Size (ssize)– Heap Size (hsize)– Branch density (br_density)– Autovectorizable loops (av_loops)– Is LS memory limit likely to be hit (ls_limit)Rank = w1*csize + w2*ssize + w3*hsize + w4*br_density + w5/(1 + av_loops) + w6* ls_limit

(wi are weights for each criteria)

Partitioning

• Preprocessing: Propogate ranks upwards in the call graph

Rank(n) = Rank(n) + ∑ Rank(n→child[i])

• Input: Call graph consisting of nodes annotated with ranks

• Output: Graph partitions that are suitable for execution on the SPEs

• A partition P is deemed “suitable” if Rank(P→root) < Threshold

24

Effect of threshold on partitions

25

mpeg2decode

26

GLIMPSES status• Beta version available for download at:

http://glimpses.sourceforge.net • 300MB source code package (includes visualizer)• Lines of code (C/C++): 447,000 • Third party tools integrated: LLVM (Compiler),

Prefuse (Visualization) • Executable Size: 422 MB (x86 binaries) • Typical trace size : 900 MB (LIBODE)• Man-hour effort: ~750• Releases :

– v.0.8 : based on LLVM version 1.8 (July 7th)– v.1.0 : based on LLVM version 2.0 (undergoing testing)

• Tested to work with large codebases: – LIBODE (115000 lines of code), mpeg2 (10000 lines of

code etc.), SPEC INT 2000 etc.

Ongoing and future work

• More Validation– Compare partitions produced with those

generated by expert programmers

• An inter-procedural, flow-sensitive, context-sensitive alias analysis algorithm

27

Ongoing and future work

• Function data dependence graph– Encapsulates data flow between

functions– Arguments, aliases, globals– Important factor in partitioning decisions

– “affinity between pairs of functions”

28