Post on 14-Jan-2016
University of Maryland
Profile-Driven Selective Program Loading
Tugrul Incetugrul@cs.umd.edu
Jeff HollingsworthDepartment of Computer Science
University of Maryland, College Park, MD 20742
University of Maryland2
Motivation Programs are getting larger!
– Many frameworks and libraries Many supercomputers lack demand-
paging– Example: Cray XT and BlueGene series– Available memory is scarce
Observation: Most programs do not use every available function!– Frameworks and libraries are too general– Code that handles errors or special cases
Why not remove functions that are not used in the common case?
University of Maryland3
Aim
Reduce memory footprintby selectively loading
parts of shared libraries
University of Maryland
Target Platforms and Applications
Unix/Linux systems that support ELF– Modifies ELF program headers
Applications with many libraries– Most current reasonable applications
Parallel programs running on multiple nodes– MPI etc.
Platforms without demand-paging– Cray XT and BlueGene series
4
University of Maryland
Architecture Overview
5
Application is profiled. It is rewritten with
– Modified Shared Libraries– A Signal Handler
Application is executed as usual.
University of Maryland
Profiler
Need a list of never-called functions in each shared library– Profile the application several times– May not be perfect
DynInst-based profiler– Write small program (~ 70 LOC)– Rewrite shared libraries– Profile as many times as necessary
6
University of Maryland
Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align LOAD 0x000000 0x00000000 0x00000000 0x090000 0x090000 R E 0x1000 LOAD 0x112000 0x00112000 0x00112000 0x012584 0x012584 R E 0x1000
Rewriting
Do not load unused functions– Modify ELF program headers– Example: libpetsc.so
7
Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align LOAD 0x000000 0x00000000 0x00000000 0x124584 0x124584 R E 0x1000
.text
LOAD 0x124584 0x00125584 0x00125584 0x013f8 0x0a434 RW 0x1000 DYNAMIC 0x12459c 0x0012559c 0x0012559c 0x00130 0x00130 RW 0x4 GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x4
First Loadable Section:.text, .init, .fini, .plt
Second Loadable Section:.dynamic, .got, .got.plt, .data, .bss
University of Maryland
Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align LOAD 0x000000 0x00000000 0x00000000 0x090000 0x090000 R E 0x1000 LOAD 0x112000 0x00112000 0x00112000 0x012584 0x012584 R E 0x1000
Rewriting
Do not load unused functions– Modify ELF program headers– Example: libpetsc.so
8
.text
LOAD 0x124584 0x00125584 0x00125584 0x013f8 0x0a434 RW 0x1000 DYNAMIC 0x12459c 0x0012559c 0x0012559c 0x00130 0x00130 RW 0x4 GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x4
First Loadable Section:.text, .init, .fini, .plt
Second Loadable Section:.dynamic, .got, .got.plt, .data, .bss
University of Maryland
Rewriting
Rewriter based on DynInst Profile data is used to create lists of
Used and Unused functions Access / Modify symbols Defragment functions to maximize
space savings– Requires moving functions inside shared
libraries
9
University of Maryland
Function Defragmentation
10
UsedUnused
University of Maryland
Challenges: Relative Calls
Common way of calling functions in PIC.
If either callee or caller is moved, their relative positioning changes.
Offsets in such relative call instructions need to be updated
11
call d
foo
d
call d’
foo
d'
University of Maryland
Challenges: Symbols
Runtime linker uses symbols to resolve cross-library calls.– Uses procedure linkage tables (plt)
If a function is moved, its associated symbol has to be updated.
12
call foo@plt
foo@plt
foo: 0xdeadbeef
foo call foo@plt
foo@plt
foo: 0xbeefdead
foo
University of Maryland
Challenges: Jump Tables
Used to represent n-way branches at machine level
Targets are read from jump table– Entries are offsets of targets from the GOT
address
Becomes invalid if the function referenced in a jump table is moved
DynInst reads jump tables to generate CFGs
We update entries so that they can be used to point to new location of targets
13
University of Maryland
Unexpectedly Called Function
Execution is not always predictable– Unexpected function calls
Rewrite original executable with a Signal Handler
Load the function upon an unexpected call– Signal Handler picks up page faults
(SIGSEGV)– Loads requested page on-demand– Execution resumes
User-level: No OS modifications14
University of Maryland15
Experiments Tested on
– PETSc ex5 in snes package– PETSc ex2 in ksp package– GS2
Compiled with debug flag and no optimization
Used Open MPI Tested on 64-node cluster at UMD
– Dual-core x86 processors– Unmodified Linux kernel
Space savings of about 82% on average
University of Maryland
PETSc – snes (ex5)
16
Library Name
Text Pages (Original)
Text Pages (Modified)
Reduction %
petsc 260 68 73.85
petscdm 161 19 88.2
petscksp 335 39 88.36
petscmat 772 40 94.82
petscvec 204 52 74.51
petscsnes 20 20 0
mpi_cxx 10 5 50
mpi 142 37 73.94
open-pal 62 34 45.16
open-rte 55 34 38.18
m 28 3 89.29
Library Name
Text Pages (Original)
Text Pages (Modified)
Reduction %
X11 146 7 95.21
lapack 866 2 99.77
blas 80 3 96.25
stdc++ 133 12 90.98
gcc_s 12 2 83.33
Xau 2 2 0
Xdcm 3 3 0
gfortran 123 4 96.75
dl 2 2 0
nsl 14 2 85.71
util 2 2 0
OVERALL 2021 348 82.78
University of Maryland
PETSc – snes (ex5)
17
University of Maryland
PETSc – ksp (ex2)
18
Library NameText Pages (Original)
Text Pages (Modified) Reduction %
petsc 260 72 72.31
petscdm 161 3 98.14
petscksp 335 49 85.37
petscmat 772 49 93.65
petscvec 204 54 73.53
mpi_cxx 10 5 50
mpi 142 47 66.9
open-pal 62 37 40.32
open-rte 55 36 34.55
OVERALL 2001 352 82.41
University of Maryland
GS2
19
Library Name Text Pages (Original)Text Pages (Modified) Reduction %
MdsLib 21 0 100
MdsShr 21 0 100
TdiShr 220 3 98.64
TreeShr 38 0 100
fftw 70 25 64.29
rfftw 58 8 86.21
mpi_f77 13 2 84.62
mpi 142 40 71.83
open-pal 62 36 41.94
open-rte 55 36 34.55
OVERALL 700 150 78.57
University of Maryland
Running Times
GS2 takes 5 seconds less on average– (36m 38s vs. 36m 33s)
Overhead on PETSc examples– ex2 runs for 2.7 secs, ex5 runs for 1.05 secs.
20
University of Maryland
Running Times
Results suggest no overhead for reasonably-long running programs– Initial cost for signal handler registration– Better instruction cache and TLB performance
21
University of Maryland22
Summary
Our tool reduces memory footprint of shared libraries
Rewrite shared libraries with holes– Defragment functions to maximize space
savings
On-demand page loading if a not-yet-loaded function is called
About 82% memory space savings for shared libraries
Might improve instruction cache and TLB performance