Automated Extraction of Skeleton Apps from Apps February 2012
description
Transcript of Automated Extraction of Skeleton Apps from Apps February 2012
Lawrence Livermore National Laboratory
Automated Extraction of Skeleton Apps from Apps
February 2012
Daniel Quinlan (LLNL)Matt Sottile (Galois), Aaron Tomb (Galois)
Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551Operated by Lawrence Livermore National Security, LLC, or the U.S. Department of Energy,
National Nuclear Security Administration under Contract DE-AC52-07NA27344
2
What is a Skeleton and why you want one A skeleton is a reduced size version of an application that focuses on
one or more aspects of the behavior of the full original application. Examples include:• MPI usage, message passing patterns; • memory traversal; • I/O demands
This is important for Exascale:• Provides inputs to simulators for evaluation of expected Exascale
architectures and features (e.g. SST/macro)• Provides smaller applications for independent study
A skeleton program will not get the same answer as the original application
There is prior work in this area… I think we are the only ones with a distributed tool for this…
3
CoDesign Tool FlowAutomatic Generation of Skeletons for Rapid Analysis
3
This talk is about these arrows
4
We can generate many skeletons from an App
Many skeletons could be generated from a single application
The process can work on full applications or smaller compact applications
Single App with many files
Aspect A
Aspect B
Aspect X
Skeleton A
Skeleton B
Skeleton X
Many Skeleton Apps each with maybe
many files
5
An Automated or Semi-Automated Process
We treat this as a compiler research problem
We are building tools to automate the generation of skeletons, but some questions are difficult to resolve• May require dynamic analysis to identify important
values• May require some user annotations to define some
behavior
We start with the original application and transform it to modify and remove code to define an automated process; this is a source-to-source solution
6
We are using the ROSE Source-To-Source Compiler to support this work
Science & Technology: Computation Directorate
Source CodeFortran/C/C++
OpenMPTransformed Source Code
ROSEIR
Analyses/ Transformation/ Optimizations
System-dependency
Sliced-system-dependency
Control-Flow
Control dependency
Control flow
Unparser
ROSE
ROSEFrontend
ROSE-based Skeleton Generation Tool
7
A Non-trivial problem to Automate
Different aspects are related (they are not actually orthogonal)• Example: inter-message timings are a function of the
computational work that an app does.
Static analysis is not always precise, and dynamic analysis is not always complete
We are focused on using static analysis and formal methods to generate plausible, realistic skeletons is the focus of our research work.
8
Example of Automated Skeleton Code Generation: Before/After
do { if (rank < size - 1) MPI_Send( xlocal[maxn/size], maxn, MPI_DOUBLE, rank + 1, 0, MPI_COMM_WORLD ); if (rank > 0) MPI_Recv( xlocal[0], maxn, MPI_DOUBLE, rank - 1, 0, MPI_COMM_WORLD, &status ); if (rank > 0) MPI_Send( xlocal[1], maxn, MPI_DOUBLE, rank - 1, 1, MPI_COMM_WORLD ); if (rank < size - 1) MPI_Recv( xlocal[maxn/size+1], maxn, MPI_DOUBLE, rank + 1, 1, MPI_COMM_WORLD, &status ); itcnt ++; diffnorm = 0.0; for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) { xnew[i][j] = (xlocal[i][j+1] + xlocal[i][j-1] + xlocal[i+1][j] + xlocal[i-1][j]) / 4.0; diffnorm += (xnew[i][j] - xlocal[i][j]) * (xnew[i][j] - xlocal[i][j]); } for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) xlocal[i][j] = xnew[i][j]; MPI_Allreduce( &diffnorm, &gdiffnorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD ); gdiffnorm = sqrt( gdiffnorm ); if (rank == 0) printf( "At iteration %d, diff is %e\n”, itcnt, gdiffnorm );} while (gdiffnorm > 1.0e-2 && itcnt < 100);
do { if (rank < size - 1) MPI_Send( xlocal[maxn / size], maxn, MPI_DOUBLE, rank + 1, 0, MPI_COMM_WORLD ) if (rank > 0) MPI_Recv( xlocal[0], maxn, MPI_DOUBLE, rank - 1, 0, MPI_COMM_WORLD, &status ); if (rank > 0) MPI_Send( xlocal[1], maxn, MPI_DOUBLE, rank - 1, 1, MPI_COMM_WORLD ); if (rank < size - 1) MPI_Recv( xlocal[maxn/size+1], maxn, MPI_DOUBLE, rank + 1, 1, MPI_COMM_WORLD, &status ); itcnt ++;
MPI_Allreduce( &diffnorm, &gdiffnorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD );
} while (gdiffnorm > 1.0e-2 && itcnt < 100);
Before After
9
Example of Automated Skeleton Code Generation: Larger example
Source-to-source transformation Def-use analysis of variables leading to MPI calls Future work will explore use of:
• System Dependence Graph (SDG)• Data flow framework and defined concepts of dead-code
elimination.• Can be supplemented with dynamic information• Can be applied to abstract other things than MPI use
Generated Skeleton Code: rank(int iteration)
Original Source Code: rank(int iteration)void rank( int iteration ){
INT_TYPE i, k;
INT_TYPE shift = MAX_KEY_LOG_2 - NUM_BUCKETS_LOG_2; INT_TYPE key; INT_TYPE2 bucket_sum_accumulator, j, m; INT_TYPE local_bucket_sum_accumulator; INT_TYPE min_key_val, max_key_val; INT_TYPE *key_buff_ptr;
TIMER_START( T_RANK );
/* Iteration alteration of keys */ if(my_rank == 0 ) { key_array[iteration] = iteration; key_array[iteration+MAX_ITERATIONS] = MAX_KEY - iteration; }
/* Initialize */ for( i=0; i<NUM_BUCKETS+TEST_ARRAY_SIZE; i++ ) { bucket_size[i] = 0; bucket_size_totals[i] = 0; process_bucket_distrib_ptr1[i] = 0; process_bucket_distrib_ptr2[i] = 0; }
/* Determine where the partial verify test keys are, load into *//* top of array bucket_size */ for( i=0; i<TEST_ARRAY_SIZE; i++ ) if( (test_index_array[i]/NUM_KEYS) == my_rank ) bucket_size[NUM_BUCKETS+i] = key_array[test_index_array[i] % NUM_KEYS];
/* Determine the number of keys in each bucket */ for( i=0; i<NUM_KEYS; i++ ) bucket_size[key_array[i] >> shift]++;
/* Accumulative bucket sizes are the bucket pointers */ bucket_ptrs[0] = 0; for( i=1; i< NUM_BUCKETS; i++ ) bucket_ptrs[i] = bucket_ptrs[i-1] + bucket_size[i-1];
/* Sort into appropriate bucket */ for( i=0; i<NUM_KEYS; i++ ) { key = key_array[i]; key_buff1[bucket_ptrs[key >> shift]++] = key; }
TIMER_STOP( T_RANK ); TIMER_START( T_RCOMM );
/* Get the bucket size totals for the entire problem. These will be used to determine the redistribution of keys */ MPI_Allreduce( bucket_size, bucket_size_totals, NUM_BUCKETS+TEST_ARRAY_SIZE, MP_KEY_TYPE, MPI_SUM, MPI_COMM_WORLD );
TIMER_STOP( T_RCOMM ); TIMER_START( T_RANK );
/* Determine Redistibution of keys: accumulate the bucket size totals till this number surpasses NUM_KEYS (which the average number of keys per processor). Then all keys in these buckets go to processor 0. Continue accumulating again until supassing 2*NUM_KEYS. All keys in these buckets go to processor 1, etc. This algorithm guarantees that all processors have work ranking; no processors are left idle. The optimum number of buckets, however, does not result in as high a degree of load balancing (as even a distribution of keys as is possible) as is obtained from increasing the number of buckets, but more buckets results in more computation per processor so that the optimum number of buckets turns out to be 1024 for machines tested. Note that process_bucket_distrib_ptr1
void rank(int iteration){ INT_TYPE i; INT_TYPE k; INT_TYPE shift = (23 - 10); INT_TYPE key; INT_TYPE2 bucket_sum_accumulator; INT_TYPE2 j; INT_TYPE2 m; INT_TYPE local_bucket_sum_accumulator; INT_TYPE min_key_val; INT_TYPE max_key_val; INT_TYPE *key_buff_ptr;/* Get the bucket size totals for the entire problem. These will be used to determine the redistribution of keys */ MPI_Allreduce(bucket_size,bucket_size_totals,((1 << 10) + 5),MPI_INT,MPI_SUM,MPI_COMM_WORLD);/* This is the redistribution section: first find out how many keys each processor will send to every other processor: */ MPI_Alltoall(send_count,1,MPI_INT,recv_count,1,MPI_INT,MPI_COMM_WORLD);/* Now send the keys to respective processors */ MPI_Alltoall(key_buff1,send_count,send_displ,MPI_INT,key_buff2,recv_count,recv_displ,MPI_INT,MPI_COMM_WORLD);}
INT_TYPE shift = MAX_KEY_LOG_2 - NUM_BUCKETS, it is highly possible that the last few processors don't get any buckets. So, we need to set counts properly in this case to avoid any fallouts. */ while( j < comm_size ) { send_count[j] = 0; process_bucket_distrib_ptr1[j] = 1; j++; }
TIMER_STOP( T_RANK ); TIMER_START( T_RCOMM );
/* This is the redistribution section: first find out how many keys each processor will send to every other processor: */ MPI_Alltoall( send_count, 1, MPI_INT, recv_count, 1, MPI_INT, MPI_COMM_WORLD );
/* Determine the receive array displacements for the buckets */ recv_displ[0] = 0; for( i=1; i<comm_size; i++ ) recv_displ[i] = recv_displ[i-1] + recv_count[i-1];
/* Now send the keys to respective processors */ MPI_Alltoallv( key_buff1, send_count, send_displ, MP_KEY_TYPE, key_buff2, recv_count, recv_displ, MP_KEY_TYPE, MPI_COMM_WORLD );
TIMER_STOP( T_RCOMM ); TIMER_START( T_RANK );
/* The starting and ending bucket numbers on each processor are multiplied by the interval size of the buckets to obtain the smallest possible min and greatest possible max value of any key on each processor */ min_key_val = process_bucket_distrib_ptr1[my_rank] << shift; max_key_val = ((process_bucket_distrib_ptr2[my_rank] + 1) << shift)-1;
/* Clear the work array */ for( i=0; i<max_key_val-min_key_val+1; i++ ) key_buff1[i] = 0;
/* Determine the total number of keys on all other processors holding keys of lesser value */ m = 0; for( k=0; k<my_rank; k++ ) for( i= process_bucket_distrib_ptr1[k]; i<=process_bucket_distrib_ptr2[k]; i++ ) m += bucket_size_totals[i]; /* m has total # of lesser keys */
/* Determine total number of keys on this processor */ j = 0; for( i= process_bucket_distrib_ptr1[my_rank]; i<=process_bucket_distrib_ptr2[my_rank]; i++ ) j += bucket_size_totals[i]; /* j has total # of local keys */
/* Ranking of all keys occurs in this section: *//* shift it backwards so no subtractions are necessary in loop */ key_buff_ptr = key_buff1 - min_key_val;
/* In this section, the keys themselves are used as their own indexes to determine how many of each there are: their individual population */ for( i=0; i<j; i++ ) key_buff_ptr[key_buff2[i]]++; /* Now they have individual key */ /* population */
/* To obtain ranks of each key, successively add the individual key population, not forgetting the total of lesser keys, m.
INT_TYPE i, k;
INT_TYPE shift = MAX_KEY_LOG_2 - NUM_BUCKETS_LOG_2; INT_TYPE key; INT_TYPE2 bucket_sum_accumulator, j, m;ailed = 0;
switch( CLASS ) { case 'S': if( i <= 2 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'W': if( i < 2 ) { if( key_rank != test_rank_array[i]+(iteration-2) ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'A': if( i <= 2 )
{ if( key_rank != test_rank_array[i]+(iteration-1) ) failed = 1; else passed_verification++;
} else { if( key_rank != test_rank_array[i]-(iteration-1) ) failed = 1; else passed_verification++; } break; case 'B': if( i == 1 || i == 2 || i == 4 )
{ if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++;
} else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'C': if( i <= 2 )
{ if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++;
} else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; }
void rank( int iteration ){
INT_TYPE i, k;
INT_TYPE shift = 'D': if( i < 2 )
{ if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++;
} else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; } if( failed == 1 ) printf( "Failed partial verification: " "iteration %d, processor %d, test key %d\n", iteration, my_rank, (int)i ); } }
TIMER_STOP( T_RANK );
/* Make copies of rank info for use by full_verify: these variables in rank are local; making them global slows down the code, probably since they cannot be made register by compiler */
if( iteration == MAX_ITERATIONS ) { key_buff_ptr_global = key_buff_ptr; total_local_keys = j; total_lesser_keys = 0; /* no longer set to 'm', see note above */ }
}
10
Static Analysis Drives Skeleton Generation First prototype:
• Generate skeleton representing message passing via static analysis (using the use-def analysis in ROSE)
Basic concept, where MPI is the target aspect:• Identify message passing (MPI) operations.• Preserve MPI operations and code that they depend on, removing superfluous code.• Aim to remove large blocks of computational code, replacing it with surrogate code
that is simpler to produce skeleton of app that contains essential message passing structure without the actual work.
Our research approach has been to explore four different forms of analysis to drive the skeleton generation:
1) Use-def analysis (to generate a form of program slice), works on the AST directly, not directly using the inter-procedural control flow graph (CFG)
2) Program slicing using ROSE’s System Dependence graph (SDG) which captures the def-use analysis and more on the inter-procedural control flow graph in ROSE
3) A new Data-Flow Framework in ROSE; another form of analysis using the interprocedural control flow graph in ROSE
4) Connections to Formal methods
11
Static Analysis: Program Slicingint returnMe (int me) { return me; }
int main (int argc, char ** argv) { int a = 1; int b; returnMe(a); b = returnMe(a); #pragma SliceTarget return b; }
System (Inter-procedural) Dependence Analysis
A sequence of directed edges define a slice Can be used for Model extraction
12
Data Flow as an alternative approach to Drive Skeleton Generation
Future work will explore the use of a new Data Flow Framework in ROSE to support analysis required to generate skeletons• May be an easier way (for users) to specify aspects• It is related to slicing in that it uses the same inter-
procedural control flow graph internally
Each form of analysis (Use-def, SDG, and Data-Flow) are an orthogonal direction of work which share the common infrastructure we have built for skeleton generation.
The analysis and infrastructure in implemented using ROSE
13
A Generic API for Skeletonization
Generalized skeletonization target APIs• Original work focused on skeletonizing relative to the MPI API.• Current code extended to allow skeletons against any API (e.g.,
Visualization and Data Analysis, I/O and Storage, use of domain-specific abstractions, etc.)
• Important for building skeletons to probe different aspects of program behavior – IO, message passing, threading, app-specific libraries
14
Annotation guided skeletonization
Annotation guided skeletonization• Previous work focused on purely dependency-based
slicing. This led to problems: Removal of computational code could cause loops to cease to
converge (iterate forever). Branching patterns no longer meaningful with computational
code gone.• Annotations let the user guide skeletonization to add
semantics the skeleton that is impossible/difficult to statically infer. Loop iteration counts ; branching probabilities ; variable
initialization values.
15
Use of an Annotation Before/After
int main() { int x = 0; int i; // execute exactly 10 times #pragma skel loopIterate 10
for (i = 0; x < 100 ; i++) { if (x % 2) x += 5;
} return x;}
int main() { int x = 0; int i; // execute exactly 10 times #pragma skel loopIterate 10 int k = 0; for (i = 0; k < 10; k++) {{ if ((x % 2) != 0) x += 5; } rose_label__1: i++; } return x;}
Before After
16
User Work Flow for Skeletonization
Science & Technology: Computation Directorate
OriginalApplication
Program
DynamicMeasurements
Of Program
AnnotatedApplication
Program
SkeletonProgram
SkeletonExtraction
Tool
ObserveBehavior
Of Skeleton
Satisfactory BehaviorKeep Skeleton
Unsatisfactory behavior:modify or add annotations to tune skeleton generator
- Branch probabilities - Average loop iteration counts - Legitimate data values
17
Future work
SDG version of analysis for skeletonization Using the new Data Flow framework in ROSE for skeletonization Galois will be working on adding formal-methods-based analysis to
the skeleton generator to analyze regions of code to remove.• Floating point range analysis.• Symbolic execution.
Formal methods will aim to answer questions to aid skeleton generation such as:• What range of values do we expect a complex computation to produce?
Allows us to automatically select surrogate values for populating data structures Know when specific values are critical
• Under specific input conditions, what code is reachable or not reachable?
Allows us to build skeletons for specific input circumstances, instead of generic skeletons
This is a connection to path feasibility analysis currently being developed in ROSE
18
Front-End
Back-End
AST Builder API
High Level IRs (AST)
IR Extension API(ROSETTA)
High Level Analysis
& OptimizationFramework
ExascaleArchitecture
Mid-End
Low Level Analysis & Optimization
Low Level IR(LLVM)Unparser
Existing LLVM Analysis & Optimization
Exascale Vendor Compiler
Infrastructures
LLVM Backend Code Generation
Exascale Vendor Compilers
General Purpose Languages used within DOE
Python
C & C++ Fortran (F77-F2003)
UPC 1.1OpenMP 3.0
CUDA
ROSE Compiler Design