Efficient Execution of Compressed Programs · 2011-03-08 · pt go p e g eg2 e nc p t p l x...

Efficient Execution of Compressed Programs

Charles Lefurgy

http://www.eecs.umich.edu/compress

Advanced Computer Architecture LaboratoryElectrical Engineering and Computer Science Dept.

The University of Michigan, Ann Arbor

2

The problem

Embedded Microprocessor

ROMProgram

RAM

I/O

CPU

• Microprocessor die cost

– Low cost is critical for high-volume, low-margin embedded systems

– Control cost by reducing area and increasing yield

• Increasing amount of on-chip memory

– Memory is 40-80% of die area [ARM, MCore]

– In control-oriented embedded systems, much of this is program memory

• How can program memory be reduced without sacrificing performance?

3

Solution

Embedded Systems

Original Program

ROMProgram

RAM

I/O

CPU

Compressed Program

RO

MRAM

I/O

CPU

• Code compression– Reduce compiled code size– Compress at compile-time– Decompress at run-time

• Implementation– Hardware or software?– Code size?– Execution speed?

4

Research contributions• HW decompression [MICRO-32, MICRO-30, CASES-98]

– Dictionary compression method

– Analysis of IBM CodePack algorithm

– HW decompression increases performance

• SW decompression [HPCA-6, CASES-99]

– Near native code performance for media applications

– Software-managed cache

– Hybrid program optimization (new profiling method)

– Memoization optimization

5

Outline

Compression overview and metrics

HW decompression overview

SW decompression

• Compression algorithms

– Dictionary

– CodePack

• Hardware support

• Performance study

• Optimizations

– Hybrid programs

– Memoization

6

Why are programs compressible?

• Ijpeg benchmark (MIPS gcc 2.7.2.3 -O2)

– 49,566 static instructions– 13,491 unique instructions

– 1% of unique instructions cover 29% of static instructions

1

10

100

1,000

0 5,000 10,000 15,000

Unique instruction bit patterns

Number

810 (jalr $31,$2)

13,491

7

Evaluation metrics

• Size

• Decode efficiency– Measure program execution time

sizeoriginal

sizecompressedratio ncompressio =

0%

20%

40%

60%

80%

100%

Originalprogram

Compressedprogram

Compression ratio

8

• Many compression methods– Compression unit: instruction, cache line, or procedure

• Difficult to compare previous results– Studies use different instruction sets and benchmarks

• Many studies do not measure execution time• Wire code

– Small size: a goal for embedded systems– Problem: no random access, must decompress entire program at once

Who Instruction Set

Compression Ratio

Comment

Thumb ARM 70% 16-bit instruction subset of 32-bit ISA MIPS-16 MIPS 60% 16-bit instruction subset of 32-bit ISA CodePack PowerPC 60% Cache line compression Wolfe MIPS 73% Cache line, Huffman Lekatsas MIPS, x86 50%, 80% Cache line, stream division, Huffman Araujo MIPS 43% Cache line, op. factorization, HuffmanLiao TMS320C25 82% Procedure abstraction Kirovski SPARC 60% Procedure compression Ernst SPARC 20% Interpreted wire code

Previous results

Hardware decompression

10

CodePack

• Overview– IBM– PowerPC instruction set

– 60% compression ratio, ±10% performance [IBM]• performance gain due to prefetching

• Implementation– Binary executables are compressed after compilation

– Decompression during instruction cache miss• Instruction cache holds native code

• Decompress two cache lines at a time (16 insns)

– PowerPC core is unaware of compression

11

CodePack encoding

Encoding for upper 16 bits Encoding for lower 16 bits

32

64

128

256

Tag

Index

Escape

Raw bits

0 1

1 0 0

1 0 1

1 1 0

x x x x x x

xx x x x x x x

xx x x x x x x x

xx x x x x x x x x x x x x x x x1 1 1

0 0

x x x x x

x x x

0 1 x x x x

1 0 0

xx x x x x x x1 0 1

xx x x x x x x x1 1 0

xx x x x x x x x x x x x x x x x1 1 1

0 0

x x x x x

8

16

32

128

256

1 Encodes zero

0 311615

32-bit PowerPC instruction word

12

CodePack system

• CodePack is part of the memory system– After L1 instruction cache

• Dictionaries– Contain 16-bit upper and lower halves of instructions

• Index table– Maps instruction address to compressed code address

main memory

Processor

Dictionaries

DecompressorIndex table

Compressed code

Instruction cache CodePack

Instruction memory hierarchy

13

CodePack decompression

Decompress

Byte-aligned block address

L1 I-cache miss address

Fetch index

Fetch compressed instructions

Native Instruction

Low dictionary

Compression Block(16 instructions)

312625650

Index table(in main memory)

Compressed bytes(in main memory)

Hi tag Low tag Low indexHi index

High 16-bits Low 16-bits

High dictionary

1 compressed instruction

14

Instruction cache miss latency

• Native code uses critical word first• Compressed code must be fetched sequentially

Instruction cache missInstructions from main memory

Instruction cache missIndex from index cacheCodes from main memory

Instruction cache miss

Codes from main memoryDecompressor

Two Decompressors

a) Native code

b) Compressed code

c) Compressed code+ optimizations

t=0

L1 cache miss

Fetch index

Fetch instructions (first line)

Fetch instructions (remaining lines)

Decompression cycle

Critical instruction word

A

B

10 30201 cycle

Index from main memory

15

Comparison of optimizations

• Index cache provides largest benefit• Optimizations

– index cache: 64 lines, 4 indices/line, fully assoc.

– 2nd decoder

• Speedup over native code: 0.97 to 1.05• Speedup over CodePack: 1.17 to 1.25

0

0.2

0.4

0.6

0.8

1

1.2

cc1 go perl vortex

Speedup over native code CodePack

index cache2nd decoderboth optimizations

16

Hardware decompression conclusions

• Performance can be improved at modest cost– Remove decompression overhead

• index lookup

• dictionary lookup

– Better memory bus utilization

• Compression can speedup execution– Compressed code requires fewer main memory accesses

– CodePack includes simple prefetching

• Systems that benefit most from compression– Narrow memory bus

– Slow memory

Software decompression

18

Software decompression

• Previous work– Whole program compression [Tauton91]

• Saved disk space• No memory savings

– Procedure compression [Kirovski97]• Requires large decompression memory

• Fragmentation of decompression memory• Slow

• My work– Decompression unit: 1 or 2 cache-lines

– High performance focus

19

How does code compression work?• What is compressed?

– Individual instructions• When is decompression performed?

– During I-cache miss• How is decompression implemented?

– I-cache miss invokes exception handler• What is decompressed?

– 1 or 2 cache lines• Where are decompressed instructions stored?

– I-cache is the decompression buffer

Compressed code

Main memory (eDRAM)

compressed programIndex tableDictionary

Program data

Native codeProgram & decompressor

data

Decompressedinstructions

I-cache (SRAM)

D-cache (SRAM)

Processor

System-on-a-chip

Decompressorprogram

20

Dictionary compression algorithm

• Goal: fast decompression• Dictionary contains unique instructions• Replace program instructions with short index

lw r2,r3

lw r2,r3

lw r15,r3

lw r15,r3

lw r15,r3

32 bits

.text segment

Original program

5

5

30

30

30

16 bits

.text segment (contains indices)

Compressed program

lw r2,r3

lw r15,r3

32 bits

.dictionary segment

21

Decompression

• Algorithm1. I-cache miss invokes decompressor (exception handler)2. Fetch index3. Fetch dictionary word4. Place instruction in I-cache (special instruction)

• Write directly into I-cache• Decompressed instructions only exist in I-cache

Proc.

�

ó

ìö

D-cache

I-cache

Memory

Add r1,r2,r3

5

Dictionary

Indices...

22

Hardware support

• Decompression exception– Raise exception on I-cache miss– Exception not raised on native code section (allow hybrid programs)

– Similar to Informing Memory [Horowitz98]

• Store-instruction instruction – MAJC, MIPS R10K

23

Two software decompressors

• Dictionary– Faster– Less compression

• CodePack– A software version of IBM’s CodePack hardware

– Slower

– More compression

Dictionary CodePack

Codewords (indices) Fixed-length Variable-length

Decompress granularity 1 cache line 2 cache linesStatic instructions 43 174

Dynamic instructions 43 1042-1062

Decompression overhead 73-105 cycles 1235-1266 cycles

24

Compression ratio

•

– CodePack: 55% - 63%– Dictionary: 65% - 82%

size originalsize compressed

ratio ncompressio =

0%10%20%30%40%50%60%70%80%90%

100%

cc1

ghostsc

ript go

ijpeg

mpeg

2enc

pegwit

perl

vorte

x

Compression ratio

Dictionary

CodePack

25

Simulation environment

• SimpleScalar• Pipeline: 5 stage, in-order

• I-cache: 4KB, 32B lines, 2-way

• D-cache: 8KB, 16B lines, 2-way

• Memory: embedded DRAM– 10 cycle latency

– bus width = 1 cache line– 10x denser than SRAM caches

• Performance: slowdown = 1 / speedup (1 = native code)• Area results include:

– Main memory to hold program (compressed bytes, tables)– I-cache

– Memory for decompressor optimizations (memorization, native code)

26

Performance

• CodePack: very high overhead• Reduce overhead by reducing cache misses

02468

101214161820

I-cache size (KB)

Slowdown

CodePack

Dictionary

Native

CodePack 19.19 3.17 1.26

Dictionary 2.66 1.43 1.04

Native 1.00 1.00 1.00

4KB 16KB 64KB

Ghostscript

27

Cache miss

• Control slowdown by optimizing I-cache miss ratio• Small change in miss ratio large performance impact

0

5

10

15

20

25

30

35

40

0% 2% 4% 6% 8%

I-cache miss ratio

Slowdown relative to

native code

CodePack 4KBCodePack 16KBCodePack 64KBDictionary 4KBDictionary 16KBDictionary 64KB

28

Two optimizations• Hybrid programs (static)

– Both compressed and native code

• Memoization (dynamic)– Cache recent decompressions in main memory

• Both can be applied to any compression algorithm

nativeOriginal Program

compressedCompressed

compressedHybrid native

Memoization compressed memo

compressedCombined native memo

Program memory

29

Hybrid programs

• Selective compression– Only compress some procedures– Trade size for speed

– Avoid decompression overhead

• Profile methods– Count dynamic instructions

• Example: ARM/Thumb• Use when compressed code has more instructions

• Reduce number of executed instructions

– Count cache misses• Example: CodePack• Use when compressed code has longer cache miss latency• Reduce cache miss latency

30

Cache miss profiling

• Cache miss profile reduces overhead 50%• Loop-oriented benchmarks benefit most

– Profiles are different in loop regions

Pegwit (encryption)

1.00

1.02

1.04

1.06

1.08

1.10

1.12

60% 70% 80% 90% 100%

Compression ratio

Slowdown relative to

native code

CodePack: dynamic instructions

CodePack: cache miss

31

Code placement

adict b c d

Original code

b c d A B C

Whole compressiondecompress region (in L1 cache only)compressed code

Selective compression

A B C

A

B

B D

Dnative regioncompressed code

Memory

adict

cadict

cadict

decompress region

D

D

C

Decompress

Decompress

Different order!

Same order

32

CodePack vs. Dictionary

• More compression may have better performance– CodePack has smaller size than Dictionary compression– Even with some native code, CodePack is smaller

– CodePack is faster due to using more native code

Ghostscript

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

60% 70% 80% 90% 100%Compression ratio

Slowdown relative to native

code

CodePack: cache miss

Dictionary: cache miss

33

Memoization

• Reserve main memory for caching decompressed insns.– Use high density DRAM to store more than I-cache in less area– Manage as a cache: data and tags

• Algorithm– Decompressor checks memo table before decompressing– On hit, copy instructions into I-cache (no decompression)– On miss, decompress into I-cache and update memo table

decompress

address

cache line

No memoization With memoization

decompress

address

cacheline

tag native code

= ?

34

Memoization results

• 16KB memoization table• CodePack: large improvement • Dictionary is already fast: small improvement

0

1

2

3

4

5

cc1

ghostsc

ript go

ijpeg

mpeg

2enc

pegwit

perl

vorte

x

Dictionary

Memo

0

5

10

15

20

25

30

cc1

ghostsc

ript go

ijpeg

mpeg

2enc

pegwit

perl

vorte

x

Slo

wd

ow

n

CodePack

Memo

35

0

2

4

6

8

10

12

14

cc1

ghos

tscrip

t go

ijpeg

mpe

g2en

c

pegw

it

perl

vorte

x

Slo

wdo

wn

0KB memo4KB memo8KB memo16KB memo32KB memo (not hybrid)

Combined

• Use memoization on hybrid programs– Keep area constant. Partition memory for best solution.

• Combined solution is often the best

CodePack with 32KB decompression buffer

36

Combined

• Adding DRAM is better than larger SRAM cache

Ghostscript using CodePack

02468

101214161820

60% 70% 80% 90% 100%Compression Ratio

Slo

wdo

wn

0 KB Memo 4.5 KB Memo9 KB Memo 18 KB Memo36 KB Memo 72 KB Memo144 KB Memo 288 KB Memo4 KB I-cache 8 KB I-cache16 KB I-cache 32 KB I-cache64 KB I-cache Native

9 KB total

288 KB total

18 KB total

36 KB total72 KB total

144 KB total

37

Conclusions

• High-performance SW decompression possible– Dictionary faster than CodePack, but 5-25% compression ratio difference– Hardware support

• I-cache miss exception• Store-instruction instruction

• Tune performance– Cache size– Hybrid programs, memoization

• Hybrid programs– Use cache miss profile for loop-oriented benchmarks

• Memoization– No profile required, but has start-up latency– Effective on large working sets

38

Future work

• Compiler optimization for compression– Improve compression ratio without affecting performance

• Unify selective compression and code placement– Reduce I-cache miss to improve performance

• Energy consumption– Increasing run-time may use too much energy

– Improving bus utilization saves energy

• Dynamic code generation: Crusoe, Dynamo– Memory management, code stitching

• Compression for high-performance– Lower L2 miss rate

39

Web page

http://www.eecs.umich.edu/compress

Efficient Execution of Compressed Programs · 2011-03-08 · pt go p e g eg2 e nc p t p l x...

Documents

Transcript of Efficient Execution of Compressed Programs · 2011-03-08 · pt go p e g eg2 e nc p t p l x...