PAPI: Performance Application Programming Interface
Transcript of PAPI: Performance Application Programming Interface
![Page 2: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/2.jpg)
Analyzing performanceLinux comes with a lot of ways to tell how long a program is running:
• time(1)
• time(2)
• clock(2)
• times(2)
• gprof(1)
![Page 3: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/3.jpg)
Analyzing performance
• These are all relatively high-level tools
• They tell what’s slow, and how long it’s taking, but they don’t tell you why it’s slow.
![Page 4: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/4.jpg)
Modern CPUs can track a lot more low-level data
• cycle count
• instruction count
• floating point instruction count
• pipeline stalls
• L1 cache hits/misses
• L2 cache hits/misses
• TLB misses
• hardware interrupts
![Page 5: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/5.jpg)
PAPI
• Runs on a number of hardware platforms and operating systems
• Provides consistent high-level interface (C and Fortran) to CPU performance data
• Requires Perfctr kernel patch on Linux
![Page 6: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/6.jpg)
Using PAPI
• Install Perfctr kernel patch
• Install PAPI
• Add PAPI functions to your application
• Link to PAPI library with “-lpapi”
![Page 7: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/7.jpg)
PAPI APIPAPI has over 40 functions in its API, but you only really need a few:
• PAPI_library_init()
• PAPI_num_counters()
• PAPI_query_event()
• PAPI_start_counters()
• PAPI_read_counters()
• PAPI_flops()
![Page 8: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/8.jpg)
PAPI_library_init()
if (PAPI_VER_CURRENT != PAPI_library_init(PAPI_VER_CURRENT)) ehandler("PAPI_library_init error.");
• Initialize the PAPI library.
![Page 9: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/9.jpg)
PAPI_num_counters()
• Check how many counters this CPU can monitor
const size_t EVENT_MAX = PAPI_num_counters();
![Page 10: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/10.jpg)
PAPI_query_event()
• Check if the CPU can monitor the event you’re interested in
if (PAPI_OK != PAPI_query_event(PAPI_TOT_INS)) ehandler("Cannot count PAPI_TOT_INS.");
if (PAPI_OK != PAPI_query_event(PAPI_L1_DCM)) ehandler("Cannot count PAPI_L1_DCM.");
if (PAPI_OK != PAPI_query_event(PAPI_L2_DCM)) ehandler("Cannot count PAPI_L2_DCM.");
![Page 11: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/11.jpg)
PAPI_start_counters()
size_t EVENT_COUNT = 3;int events[] = { PAPI_TOT_INS, PAPI_L1_DCM, PAPI_L2_DCM };PAPI_start_counters(events, EVENT_COUNT);
![Page 12: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/12.jpg)
PAPI_read_counters()
long long values[EVENT_COUNT];
if (PAPI_OK != PAPI_read_counters(values, EVENT_COUNT)) ehandler("Problem reading counters 1.");
C = matrix_prod(n, n, n, n, A, B);
if (PAPI_OK != PAPI_read_counters(values, EVENT_COUNT)) ehandler("Problem reading counters 2.");
printf("%d %lld %lld %lld\n", n, values[0], values[1], values[2]);
![Page 13: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/13.jpg)
PAPI_flops()float rtime;float ptime;long long flpops;float mflops;
if (PAPI_OK != PAPI_flops(&rtime, &ptime, &flpops, &mflops)) ehandler("Problem reading flops 1");
C = matrix_prod(n, n, n, n, A, B);
if (PAPI_OK != PAPI_flops(&rtime, &ptime, &flpops, &mflops)) ehandler("Problem reading flops 2");
printf("%d %lld %f\n", n, flpops, mflops);
![Page 14: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/14.jpg)
How am I doing on time?
![Page 15: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/15.jpg)
Bonus slides!
![Page 16: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/16.jpg)
Example: Benchmarking Matrix Multiplication
1. Matrix multiplication code from Numerical Recipes in C
• compiled with -O1, -O3, and-O3 -funroll-loops
2. ATLAS (ringer)
• full optimizations
![Page 17: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/17.jpg)
float **matrix_prod(m1,n1,m2,n2,A,B)float **A,**B;int m1,n1,m2,n2;/* * Matrix product. A is a m1 X n1 matrix with range [1..m1][1..n1]* and B is a m2 X n2 matrix with range [1..m2][1..n2]. n1 = m2.* C = A * B.*/{int i,j,k;float **C;
C = zero_matrix(1,m1,1,n2);for (i=1;i<=m1;i++) for (j=1;j<=n2;j++) for (k=1;k<=n1;k++) C[i][j] = C[i][j] + A[i][k] * B[k][j];return C;}
Numerical Recipes code
![Page 18: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/18.jpg)
float **matrix_prod(m1,n1,m2,n2,A,B)float **A,**B;int m1,n1,m2,n2;/* * Matrix product. A is a m1 X n1 matrix with range [1..m1][1..n1]* and B is a m2 X n2 matrix with range [1..m2][1..n2]. n1 = m2.* C = A * B.*/{int i,j,k;float **C;
C = zero_matrix(1,m1,1,n2);for (i=1;i<=m1;i++) for (j=1;j<=n2;j++) for (k=1;k<=n1;k++) C[i][j] = C[i][j] + A[i][k] * B[k][j];return C;}
Numerical Recipes code
![Page 19: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/19.jpg)
Test Platform
CPU 1.8 Ghz Opteron
RAM 2 GB
Kernel 2.6.13
Distribution Gentoo
L1 Cache 64 KB
L2 Cache 768 KB
Integer instruction pipeline 12
FP instruction Pipeline 17
![Page 20: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/20.jpg)
![Page 21: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/21.jpg)
![Page 22: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/22.jpg)
![Page 23: PAPI: Performance Application Programming Interface](https://reader030.fdocuments.in/reader030/viewer/2022021008/62039edfda24ad121e4b866c/html5/thumbnails/23.jpg)
More information
PAPI homepage
• http://icl.cs.utk.edu/papi/
ATLAS
• http://acts.nersc.gov/atlas/