Code Compression in Embedded Systems -...
Transcript of Code Compression in Embedded Systems -...
Code Compression in Embedded Systems
_____________________________________________________________________________________________
____________________________________________________________________________________________Page 2 of 39
IntroductionEmbedded systems are becoming increasingly popular as more and more consumers accept the carrying electronic
devices with them. For many people, handheld devices such as cellular phones, pagers, digital cameras, PalmPilots,
and PocketPC’s have become like wristwatches; the owner simply would not go anywhere without them. It is no
surprise that some people feel as if embedded systems will become the next major computer science research field
that will follow the current boom of interest in networks, and past interest in parallel processing.
Problem StatementEmbedded devices’ popularity conceals the constraints that designers face when developing for these systems .
Embedded systems that drive handheld devices are designed to be small so as to be comfortable to hold and carry.
These systems are targeted toward large volume sales, hence they are cost-sensitive. Cost is strongly associated
with size of the integrated circuits (ICs) used in a device [LiDK 1999]. Much of an embedded system’s IC space is
devoted to memory for program code and data [LiDK 1999]. Many embedded systems are further limited by the
need to operate off a limited power source, usually batteries, hence power consumption is a major concern when
designing their components. Larger storage, memory, and cache sizes, and faster CPUs all contribute to space, cost,
and power use. Because of these factors, memory and storage space both are limited in handheld devices, affecting
designers, developers and users.
RISC has become the dominant CPU design paradigm, replacing CISC. Storing CPU control signals within the
instruction, finely tuning pipelines, and fixing instruction lengths have all contributed to the performance of RISC
architectures. RISC does however increase instruction size, increasing demand for memory.
Use of High Level Languages (HLL) lowers development and maintenance cost and time. Statistics on
programming language use gathered by Dataquest show use of HLLs replacing assembly language in embedded
systems [LiDK 1999]. Handheld systems are more prone to use HLLs. The drawback of these languages is the
increased code size. Code size increases both due to code bloat and due to compilers that have traditionally been
optimized for speed, not size. As the gap between CPU and memory speed widens, the amount of slowing due to
cache misses is further enlarged.
A solution to many of these constraints is code compression. Compression can be used to reduce the size of
programs in storage and memory. In so doing, it can reduce the size of components needed, lowering power
requirements, size, and cost.
Benefits of Code CompressionCode compression has many benefits. The reduced program size can be used to reduce the size of storage necessary
on IC’s. This can save production costs. Since memory consumes a significant amount of an embedded system’s
power and power consumption is proportional to IC area, battery life can be extended [YSOO 1997]. Smaller chips
also have lower capacitance, which lowers power consumption [LeHW 2000:2].
Code Compression in Embedded Systems
_____________________________________________________________________________________________
____________________________________________________________________________________________Page 3 of 39
When a cache miss occurs, the data brought to the instruction cache (I-cache) is loaded from main memory
compressed, so less data needs to be transmitted on the bus between memory and I-cache. If the code is compressed
in the I-cache, then less data is sent out from the I-cache with each instruction fetch. Thus, fewer bit toggles are
necessary on bus lines, which reduces power consumption [ViZA 2000]. Furthermore, if code in the I-cache is
uncompressed, then less data is transmitted on the bus from memory to the I-cache, so the cache may be filled
sooner. If code in the I-cache is compressed, then it is able to hold many more instructions, reducing the cache miss
rate. Both of these cases cause the CPU to pause for less time due to cache misses. Since the CPU is the main
consumer of power, compression can result in significant power savings [ViZA 2000, LeHW 2000:3]. If memory
and or cache are made smaller, their effective capacitance decreases further decreasing power consumption [LeHW
2000:2]. Because there are fewer transactions and transactions are shorter, compression may also increase
performance (reduce program execution time).
Introduction to Data CompressionCompression schemes try to reduce the number of bits required to represent data. In the following section, key
concepts of data compression and common data compression techniques will be introduced.
Measuring CompressionOne way to measure of compression is the amount by which data was reduced in size, expressed as a percentage.
For example, consider a 49,402,412 byte sound clip that is compressed to a 4,481,024 byte MP3 file. The reduction
in size is 1 - 4,481,024 / 49,402,412 = 9.07%.
A more common measure of compression is the compression ratio, which is defined as the ratio of the compressed
data to the original data. It may be expressed as a decimal number or as a percentage. In our example, the
compression ratio is 4,481,024 / 49,402,412 = .0907 or 9.07 %. Such a compression ratio is not uncommon for MP3
files.
EntropySuppose a random experiment is taking place. Let P(A) be the probability that some event A will occur. The
information content of A is denoted i(A) and defined as:
i(A) = - logb P(A)
Suppose that the set of independent events Ai represent the set of all possible outcomes of a random experiment. Let
P(Ai) be the probability that event A i occurs for each A i. The average information associated with the experiment is:
∑ ∑−== )(log)()()( ibiii APAPAiAPH
This quantity is called entropy. Claude Shannon showed for b = 2, entropy was the average number number of bits
needed to encode the output of a source. He further showed that encoding the output of a source with an average
number of bits equal to the entropy of the source is the best any compression scheme can achieve. As a
Code Compression in Embedded Systems
_____________________________________________________________________________________________
____________________________________________________________________________________________Page 4 of 39
consequence, known compression schemes attempt to compress files to sizes close to the entropy, but cannot
compress programs smaller than it.
Codeword ClassificationCodes can be further classified as fixed, static, or adaptive based on the type of codewords , that is output words,
that they generate. In fixed codes, for all input streams, each input symbol corresponds to the same codeword. In
static codes, the relationship between input symbols and codewords is selected by a model that matches a particular
input stream, but does not change during the processing of that stream. Additional information such as the model or
a dictionary must be transmitted with the encoding in order for the decoder to recover the original sequence.
Adaptive codes feature codewords that change while a given input stream is processed; selection and modification
of codewords is dictated by a model.
Dictionary MethodsDictionary methods for performing compression create a table of frequently occurring symbols or sequences of
symbols. When the encoder encounters a symbol or sequence that is in its dictionary, it will place a special escape
character on the output followed by the index into the dictionary of that symbol. Regular symbols are simply
encoded as themselves. In applications where the table is large and has a high probability of containing an encoding
for each symbol, this schema can also be reversed with escape characters only used before an input symbol is
written directly to output.
Construction of the dictionary can be done by hand, or by heuristic methods, especially those that examine the
typical source data stream. Fixed codes will have the same table for all sources, static codes will provide a new
table for each source, and adaptive codes will create the table while processing the source. Dictionary methods
using static codes are common to compression of program code; the table is created by encoder during compression
and is saved with the compressed output.
Markov ModelsWhen information is available about the data sets that are to be compressed, models can be constructed that allow
more efficient compression algorithms. The simplest model is to assign probabilities to each letter of the alphabet.
Markov Models, named after mathematician Andrei Markov, combine the history of elements with probabilities. A
kth-order Markov model uses the history of the last k elements to appear on the input stream to predict the next
symbol. Markov Models are often used in fixed codes.
Each possible history is associated with a state. Thus each state has m transitions (one for each possible next input
symbol). In total, there are with km possible states. Each transition from a state has a probability associated with it;
because the state is reached by the last k transitions taken, that probability is based upon the last k elements encoded.
As a simple example, consider the English language. If we knew that the last encoded character was the letter q,
there would be a high probability that the next character is u.
Code Compression in Embedded Systems
_____________________________________________________________________________________________
____________________________________________________________________________________________Page 5 of 39
Static Huffman CodesHuffman coding is a bottom up method for organizing elements into a tree. The shape of the built tree determines
the codeword for each element.
Huffman coding begins by sorting the elements in decreasing order of probability (or frequency) of occurrence.
Each element is listed along with its probability. Each element becomes a leaf. While more than one value is in the
list, the following is performed:
• A new node is created and the two smallest elements are removed from the list and made the node’schildren. The probability of the new node is the sum of the probabilities of the two children.
• The new node is inserted as an element in the list of nodes, in a place corresponding to the new node’sprobability. In the event of a tie, the new node is inserted before other elements of the same probability.
Thus bigger and bigger nodes are created with each iteration, until the final root node is created having the
probability 1. At that point, the only element in the list represents the root node.
As an example, consider the following alphabet and probabilities:
letter a s e t i r p jprobability .20 .18 .17 .15 .10 .08 .07 .05
Merging the smallest elements, we would form the list and initial tree shown below:
letter a s e t X1 i rprobability .20 .18 .17 .15 .12 .10 .08
.12
.07 .05p j
Code Compression in Embedded Systems
_____________________________________________________________________________________________
____________________________________________________________________________________________Page 6 of 39
The final result of continuing the algorithm would be:
The codeword for a letter can be determined by tracing a path from the root node to the leaf node representing that
letter. Each time a left branch is taken, a 0 is emitted. Each time a right branch is taken, a 1 is emitted. In the
example above, the codeword for p is 0110, and the codeword for i is 110.
Since the tree represents the dictionary of codewords, the tree or its equivalent as a dictionary must be transmitted
along with any compressed sequence. Huffman codes are provably optimal, as was described in [Fano 1961].
Arithmetic CodingArithmetic coding can generate a unique identifier tag for a given length input sequence without generating the
codewords for sequences of that length. Instead, sequences are represented as real numbers on the interval 0 <= x <
1. To generate a sequence, we order the letters of the alphabet and define a cumulative density function that returns
an interval as:
∑=
==i
kX kXPiF
1
)()(
where: i is the symbol being encoded
Thus, the value returned of the function for each symbol in the alphabet is greater than or equal to the value returned
for the previous alphabet symbol.
For instance, suppose the alphabet consists of the following characters with the probabilities given:
letter: probability:a 40%b 30%c 15%d 15%
.12
.07 .05p j
.27
.15t
.35
.18 .17s e
.62
.18
.10 .08i r
.38
.20t
1.00
Code Compression in Embedded Systems
_____________________________________________________________________________________________
____________________________________________________________________________________________Page 7 of 39
We could let the cumulative density function for each character be represented as:
letter i: FX(i):a 0.40b 0.70c 0.85d 1.00
Arithmetic coding algorithms generally work in the following manner. The interval under consideration begins as
[0.0, 1.0). Let [lj, hj) represent the current lower and higher bound of the interval. For each letter xn on the input that
is read, and the corresponding range FX(xn) for that letter is found. The interval is shrunk to [li+1, hi+1 ) where:
li+1 = li + (hi - li) * FX(xn-1)hi+1 = li + (h i - li) * FX(xn )
Algorithms either terminate by sending both the lower and higher bounds of the interval corresponding to the
sequence to be encoded, or by sending any value within the interval (but preceding transmission by the number of
characters to be encoded).
As an example, consider the sequence “bcac”. The interval begins as [0.00, 1.00). FX(a) is 0.40, FX(b) is 0.70.
Encoding the letter b results in the range [0.00 + (1.00 - 0.00) * 0.40, 1.00 + (1.00 - 0.00) * 0.70) = [0.40, 0.70).
Encoding the letter c results in the range [0.40 + (0.70-0.40) * 0.70, 0.40 + (0.70-0.40) * 0.85) = [0.61, 0.655).
Encoding the letter a results in the interval [0.61, 0.628). Encoding the last letter c results in the interval [0.6226,
0.6253). The encoder can either transmit this range or can transmit that the number of characters encoded is 4
followed by any value from this range.
An additional useful fact about arithmetic coding is that the average codelength lA for a string of length m is bound
by:
mXHlXH A
2)()( +≤≤
Therefore, arithmetic compression is nearly optimal for any distribution. The drawback is that coding delay can be
very long.
Lempel-Ziv 78In Lempel-Ziv 78 (LZ78), a table is built as the encoder processes the input stream using a greedy matching scheme.
The algorithm operates as follows. The longest prefix of the input stream that exists in the dictionary is found.
Label this prefix as P. If no such prefix exists, P is set to null. Let p be the index of P in the dictionary. If there is
no matching prefix in the dictionary, p is set to a special value ø. The input character that follows P is c. Now c is
read. The output (p, c) is transmitted. Then the string P•c is added to the dictionary, where • represents
concatenation.
Code Compression in Embedded Systems
_____________________________________________________________________________________________
____________________________________________________________________________________________Page 8 of 39
As an example, suppose we were to encode the string: “abbccccccbbdcc”. The following would be the output values
and the values added to the dictionary:
parse add to dictionary dictionary entry number(ø, a) a 1(ø, b) b 2(1, c) bc 3(ø, c) c 4(4, c) cc 5(4, b) cb 6(2, d) bd 7(5, ø)
A popular implementation of LZ78 is the unix command compress. Implementations of LZ78 use some kind of
dictionary replacement policy such as throwing it out when performance degrades below some threshold.
Lempel-Ziv 77An even easier compression model called Lempel-Ziv 77 (LZ77) builds a sliding, fixed-sized window of the stream
being compressed.
At each step, the algorithm either outputs a character or an integer pair. If the first character on the input stream
occurs once or less in the window, then the character is output. Otherwise, the longest matching prefix of the input
stream is found in the window. The offset of the prefix, measured backwards into the window, is output along with
the length of the prefix. As each character is matched from the input, it is shifted on to the end of the window, so
matched prefixes can actually exceed the current window.
As an example, consider the string: “ababcccccdabab”. The following would be the output and window at each step
of encoding for a window of maximum size 8:
output windowa ab ab(2, 2) ababc ababc(1,4) babcccccd abcccccd(8,2) cccccdab(2,2) cccdabab
Move-to-FrontMove-to-front is designed to operate on data sets consisting of words. As a dictionary, both the encoder and decoder
maintain list of words in the order of most recently seen. When the word M is encountered on the input stream and
M is in the dictionary, then the encoder outputs the index of M, then moves M to the front of the dictionary. The
decoder likewise retrieves M from the list by the transmitted index, then moves M to the front of its list. When a
word M is encountered and not in the dictionary, the encoder outputs as an index 1 + the current size of the list,
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 9 of 39
followed by M. The decoder receiving an index beyond the end of the list knows to receive the word M. Both
encoder and decoder finish by placing the word M at the front of the list.
As an example, consider the string “the first man went in first before the man first in line”. The following is a table
showing the word being considered, the encoder’s output, and the list after the output is sent.
current word output list after outputthe 1, the thefirst 2, first first theman 3, man man first thewent 4, went went man first thein 5, in in went man first thefirst 4 first in went man thebefore 6, before before first in went man thethe 6 the before first in went manman 6 man the before first in wentfirst 4 first man the before in wentin 5 in first man the before wentline 7 line in first man the before went
Special Requirements of Code Compression SystemsIn order to be useful, the compression system itself must be small. This implies a simple compression algorithm so
as to use little processor time and take up little memory (or to use less power and fit on a smaller IC for hardware
compression). The system must run all existing programs correctly.
Control transfer instructions present a problem for compression systems. The target of a branch instruction in
uncompressed code will change when the code is compressed. Programs also frequently contain relative or
computed branch targets. Researchers have developed several solutions.
Most existing compression systems such as the Lempel-Ziv algorithms used by compress, gzip, and pkzip use a
beginning to end, variable length encoding. Thus, they require the entire program be decompressed from beginning
to end before it can be used. Because these compression schemes would not reduce memory use, researchers have
developed special compression systems capable of decompressing a blocks or instructions of a program
independently, allowing random access to the program.
Classifications of Code Compression SystemsA survey of code compression systems must organize them by some means of classification. Several criteria are
described in this section.
An obvious classification criterion is the method of compression used. Many systems use some sort of dictionary
building technique. Many of these dictionary techniques operate by preprocessing the source code or executable
while scanning for frequent patterns, then filling dictionaries using heuristic selection criteria. Several specifically
use Huffman coding to build dictionaries. Others use fixed Markov models, arithmetic coding, or even original and
specialized tree building methods. This paper organizes code compression systems by the method of compression.
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 10 of 39
Classification may also be done based on the location where decompression takes place. As was mentioned above,
complete decompression of the program as it is loaded into memory is not practical for embedded systems.
Decompression might take place as the compressed code is loaded into the instruction cache, in the instruction
cache, or between the instruction cache and the CPU.
Most researchers tested their compression ideas in real systems. As Wolfe and Chanin pointed out, running their
compression systems “on actual ebmedded (sic) applications would have been preferable, accurate execution traces
from these types of programs were not available” [WoCh 1992]. Many researchers ran their compression systems
on benchmarks, such as I-Cheng Chen et al or Jin and Chen who used SPEC95 [ChBM 1997, JiCh 2000]. Some
created their own benchmarks, suggesting that popular programs were important to test, as was done by Liao who
compressed programs like compress, gzip, and jpeg [LiDK 1999]. Creation of benchmarks was especially common
for researchers limited to domains where common benchmarks didn’t exist, like the small memory Java domain
studied by Clausen et al who compressed standard Java libraries [CPCM 2000]. Benchmarks were also created
when certain programs were found to demonstrate noteworthy effects [LeHW 2000:3]. A few only simulated
execution on hardware design tools, which may lead to questionable results, although simulation was often done out
of necessity, since the researchers were focused on non-measurable quantities like energy usage of each component
of the system [CoMc 1999].
Compression schemes differ in the size of symbols being compressed. Early compression systems worked on each
byte, as do the unix commands compress and gzip [WoCh 1992]. Later on, it became common to use full
instructions as symbols such as in [YSOO 1997, LeHW 2000:3]. Dictionary building techniques such as [ChBM
1997] and [LBCM 1997] operated on sequences of instructions within basic blocks; some such as [LiDK 1999] even
operated on symbols spanning basic blocks. The unique Slim Binaries compression scheme compressed abstract
syntax tree branches [FrKi 1997]. The late 1990’s saw a concept for RISC systems where various instructions have
similar boundaries at which operands, immediate constants, or store addresses start. Instructions themselves were
split up into groups of bits called streams, with bits from each instruction in group 1 being compressed separately
from those from group 2 and so on [JiCh 2000]. [KoWo 1994] even compared several symbol sizes.
When code is compressed, the intended destination addresses of branches change. Because instructions are
condensed, relative branches may overshoot their target, and direct branches may point past the end of the program.
The first attempt to overcome the branch problem was to build a table translating addresses of blocks in
uncompressed code to addresses in compressed code, with the stipulation that all instructions in the block may have
to be decompressed to reach the exact target instruction [WoCh 1992]. Researchers using fixed sized codewords
were able to use similar translation tables without the need to decompress the entire block to reach the target [JiCh
2000]. Later researchers frequently patched branches whenever possible (although patching relative branches is
known to be NP complete) and modified the hardware of the system to accept branches the size of the smallest
codewords [LBCM 1997, Larin 1999]. Some researchers patched branches, but did not change rest of the systems,
thus requiring branch targets to be padded for word alignment [BeMN 1999]. Other researchers even performed
modifications on branches and used them as part of the compression scheme [CoMc 1999].
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 11 of 39
Machine Instruction MethodsOne of the simplest methods for shrinking the size of code is to shrink the size of instructions. Two commercial
proposals for such methods are ARM Thumb and MIPS16, both of which feature 16-bit instructions based on a 32-
bit instruction set. The shortened instructions supplement use of 32-bit instructions on the processor core, while
requiring support from the core [BeMN 2001]. In each system, the processor core expands the 16-bit instruction to
32 bits just after they are fetched from instruction memory [LBCM 1997]. Instructions chosen for the Thumb set
were selected due to their frequency of use, importance for generating small code, or lack of need for full 32 bits.
Instructions for the MIPS16 set were chosen by analyzing a number of applications to determine the most frequently
generated instructions [LBCM 1997]. In order to reduce the number of instruction bits to 16, the number of registers
that can be referenced was decreased to 8, and the size of immediate fields was shrunk [LBCM 1997]. Neither
shortened instruction set is capable of generating complete programs; special instructions are used to switch between
16-bit and 32-bit instruction modes [LBCM 1997].
Use of the supplemental shortened instructions does decrease code size. ARM programs compiled with Thumb
support have code sizes about 30% less than when compiled for 32-bit instructions alone [LBCM 1997]. Similarly,
code sizes in programs produced with MIPS16 support are about 40% smaller than for programs using only 32-bit
MIPS-III instructions [LBCM 1997].
Huffman Coding MethodsHuffman coding methods were used to assign variable length codewords to frequently occurring bytes or parts of
instructions. Since codewords were variable length, a translation table was always needed to map uncompressed
branch addresses to compressed ones. For maximum decoding speed, the Huffman codes were made fixed and
decoders were implemented using PLAs in hardware. As for all methods, results are summarized in a table in the
conclusion.
Wolfe and ChaninIn the paper that started interest in code compression for embedded systems, Wolfe and Chanin proposed their
method of compressing RISC instructions in what was called CCRP, “Code Compressed RISC Processor” [WoCh
1992]1. Each cache line was compressed, and code was expanded by the cache refill engine as code was loaded into
the cache so as to avoid problems when branch targets change due to compression. The mapping created by the
compressor for translating branch addresses was called a Line Address Table, or LAT, and only mapped to the start
of blocks. Therefore, several instructions may have to be decompressed from the block before the target instruction
was finally found. The most recently used LAT entries stored in the special cache called the Cached Lookaside
Buffer, or CLB. Each LAT entry had a 24 bit base address followed by 8 entries indicating the length of the next
compressed block. The LAT added 3.125% to the size of programs.
Compression was done on bytes, with more frequent bytes being assigned shorter variable-length codewords. Wolfe
and Chanin ran experiments to determine if having a fixed code system would degrade compression ratios compared
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 12 of 39
to having a static code system, and found little difference between the two. The researchers selected the fixed route
because it allowed the decompressor to be effectively hardwired using a PLA, for maximum speed. Instruction
bytes are encoded using a bounded Huffman technique in which codewords are selected to be from 1 bit long to not
more than 16 bits (the upper bound).
Wolfe and Chanin tested performance using the MIPS2000 architecture on C and FORTRAN programs.
Compression ratios ranged from 65% to 75%. Three different memory models were tried: EPROM, Burst EPROM,
and Static Column DRAM. The latter two models had the advantage that each subsequent word after the first word
only took one clock cycle to retrieve. For the latter two, the performance of the system using compression was even
greater than that of the original system.
Tests of different cache sizes found that increasing sizes from 256 bytes to 512 made a difference, but increases
beyond 1024 bytes had lessening effect. The size of the CLB was not a signficant bottleneck on performance. One
of the most important tests correlated memory speed to miss rate. It was found that for high miss rates, if memory
was slow, then compression improved performance; alternately, if memory was relatively fast, than compression
reduced performance.
A follow up study by Kozuch and Wolfe compared compression for 15 programs from the SPEC benchmark suite
the VAX 11, MIPS R4000, SUN 68020, SPARC, IBM RS6000, and Motolora MPC603 architectures [KoWo 1994].
It was found that statically compiled programs varied in size considerably, from MIPS programs being about 2.7
times the size of those for a VAX, to MPC603 programs being almost 4.9 times as large. Executable size variance
was due to architecture, compiler trade-offs, and library sizes [KoWo 1994]. In order to determine whether less
dense code was more compressible, the zeroth and first order entropy of the programs on each architecture was
calculated. Due to their importance, results for code size and entropy are shown in Figure 1 and Figure 2 below:
Figure 1: Program Sizes for EachArchitecture [KoWo 1994].
Figure 2: Average Entropy for EachArchitecture [KoWo 1994].
The results demonstrate that the MIPS instruction set is much more compressible than other instruction sets using
zeroth-order compression, which can explain variation in results found by other researchers who tried similar
compression methods. Furthermore, first order can achieve substantial compression improvement, but higher order
techniques (like gzip uses) are worth searching for regardless of the fact that most they are often too expensive to
1 Information in this section is drawn from [WoCh 1992] unless otherwise noted.
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 13 of 39
implement for run-time decompression on embedded systems. Returning to their own work, CCRP, Kozuch and
Wolfe found that the ratio of zeroth-order entropy to their compression ratio was 94 to 95%, indicating that they
were achieving good rates even though their method required the overhead of a LAT [KoWo 1994].
Because it used of Huffman coding, CCRP required a 2n entry decoder where n is the size (in bits) of symbols being
compressed. Trying different symbol sizes, it was found that larger symbol sizes yielded better compression ratios,
a finding later researchers took advantage of using their own compression systems. Again due to its importance,
their results are shown in Figure 3 below.
Figure 3: Entropy Versus Symbol Size forEach Architecture [KoWo 1994].
A final experiment compared two greedy compression algorithms for developing dictionaries. The result was an
approximately 10% better compression ratio achieved on average for an algorithm that tried to produce the best
compression ratio as opposed to one that tried to reduce the raw number of bits in the sample program [KoWo
1994].
Larin and ConteHuffman Coding was also chosen by Larin and Conte due to its near optimal results [LaCo 1999]2. Decompression
was performed along the critical path past the I-cache in order to maximize cache hit rates. The architecture chosen,
TEPIC, was 40 bit VLIW and had several points in each instruction at which the instruction could naturally be
divided. Therefore, Huffman coding was compared for 1 byte symbols, 4 streams, or whole 40 bit words. Streams
were formed by breaking each instruction up into 4 separate parts with the first 9 bits, second 12 bits, third 14 bits,
and the last 5 bits being placed in separate streams to be compressed separately and decompressed in parallel to form
each instruction. Since RISC architectures like TEPIC have distinct parts such as the opcode, source and destination
operands, and immediate constants, it is reasonable to assume that each part of the instruction may have its own
separate and optimal Huffman encoding. Yet another compression technique, tailored, was tried in which
unnecessary bits were removed from instructions.
2 Information in this section is drawn from [LaCo 1999] unless otherwise noted.
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 14 of 39
Larin and Conte’s branch translation table held mappings to exact locations for each unique branch. Due to its finer
granularity, the table added on average 15.5% to the size of the program. It also cached entries near the CPU.
The cache also stored the predicted PC following an instruction in order to help keep the CPU pipeline moving at
full speed, and stored a branch taken/not taken predictor value. Operations were collected into word aligned groups,
with an extra bit somewhat wastefully following each instruction and being set only for the last instruction in the
group.
Code compression ratios of stream and byte encodings produced little difference for this architecture, hovering
around 70% to 78%. Tailored encoding did better ranging from 60% to 67%. Finally, full instruction compression
was at least twice as good in all cases, typically achieving ratios of 23% to 30%. One should note that the
compression rate of full Huffman encoding is offset by the enormous table required to decode it.
Decoder complexity in terms of decoder size was also measured. Complexity for Huffman compression of full
instructions was normally several times that for stream compression, and both were about 100 times more complex
than for byte compression. The poor performance of stream compression was not explained by the authors. Bus
flips are the main cause of power consumption across the bus. On average, the number of flips for base compression
was several times larger than that for tailored compression, which was itself several times for that of Huffman
compression.
Single Instruction Dictionary MethodsEach of the methods in this section operates using some form of dictionary building. Typically, the most frequent
instructions are stored in a dictionary. In the code, a tag and an index into the dictionary replace the instruction.
There is strong evidence that this will work. Just as Hennessy and Patterson found in 1990 that 90% of program
execution time is spent in 10% of the code, Lefurgy obsevered:
On average, less than 20% of the instructions... have bit pattern encodings which are used exactly
once in the program. In the go benchmark, for example, 1% of the most frequent instruction
words account for 30% of the program size, and 10% of the most frequent instruction words
account for 66% of the program size.
[HePa 1990, LBCM 1997]
Even shortening the 1% most frequent instructions in a program from 32 bits to 8 bits would reduce the code size by
22.5%.
Yoshida, Song, Okuhata, Onoye, and ShirakawaYoshida et al [YSOO 1997]3 noted that compilers tend to generate many duplicate instructions. They developed a
compression algorithm with the goal of reducing power consumption. Their algorithm operates by searching
through the program of N instructions, each having m bits, to find a complete list of n distinct instructions. Since
programs only use part of the instruction set and compilers generate many duplicate instructions, this is not difficult.
3 Information in this section is drawn from [YSOO 1997] unless otherwise noted.
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 15 of 39
Then, a log2 n bit number is assigned to each instruction. A table is created to transform each code number to a full
instruction. This reduces the size of instructions from m bits to log2 n bits. An optional subcode compression
extension was to further reduce the size of instructions by separately compressing codes for registers and flags.
Power dissipation of memory units depends on their physical size. Power use reduction was quantified by the
equation:
Nm
knmnNPf
+−=
log1/ο
where k is the power dissipation ratio of the on-chip memory.
Using the ARM610 processor core, Yoshida et al ran their compression system on the dhrystone benchmark.
Opcode size of compressed programs typically decreased from 32 bits to 12 bits for a program compression ratio of
22.7% to 54.0% (or 4.48% to 21.52% with 3 bit subcode compression), and a power reduction of memory by
19.57% to 42.33% (16.46% to 24.55% with 3 bit subcode compression). In [YSOO 1996], it was found that placing
the decompression software in ROM required only 1.67 mm2 area using 0.6 µm technology. Battery life was
estimated to have been extended by 1.7 times [YSOO 1996].
Jin and ChenJin and Chen [JiCh 2000]4 realized that the capability of code compression could be enhanced if decompression was
done past the cache. To perform compression, each 64-bit instruction is broken up into 4 sections of 16-bits. If each
of these sections could be found in a 256 entry table related to that section, then the instruction was compressed to
4*8 = 32 bits. A 96 bit entry table called a Lateral Address Table, or LAT, stored a base address and then 64
separate bit flags indicating whether the 64 instructions after that address were compressed. The LAT was used to
resolve branches. The instruction cache was divided in 2 banks of 32 bits to avoid alignment problems with
uncompressed instructions. Looking up compressed instructions was done via 4 parallel dictionary accesses, which
combined with decompression added 2 stages to the fetch-execute cycle. This increased the branch penalty, which
was alleviated by adding a Branch Compensation Cache of decompressed branch targets.
Design was done on a 64-bit simple scalar instruction set, with results tried on SPEC95, gcc, and a few other
programs. Compression ratios were in the range of 70% to 80%. Even with a 1KB to 2KB I-cache, the branch
penalty due to cache misses and CPU idle times was significantly reduced, with miss ratios reduced by a few
percent, or up to 40% in certain cases. Using .15 µm technology, die area was reduced, but not as much as with
Wolfe and Chanin’s CCRP.
Benini, Macii, and NannarelliPerhaps the most extensive work was done by Benini, Macii, and Nannarelli in [BeMN 2001]5. Here, compression
was performed on instructions, but compression was also carefully done so as to minimize cache line reads. Cache
4 Information in this section is drawn from [JiCh 2000] unless otherwise noted.5 Information in this section is drawn from [BeMN 2001] unless otherwise noted.
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 16 of 39
lines were 128 bits, and instruction were 32 bits. Like most dictionary systems, the code is profiled and the 256
most frequent instructions were brought to a dictionary, being replaced by 8 bit codes. Cache lines would be made
into compressed cache lines only if one or fewer uncompressed instructions could fit into them. Uncompressed
cache lines contained only the standard four uncompressed instructions; thus compression was guaranteed to not
increase code size. Compressed cache lines began with an identifying illegal opcode, a set of flags that indicated
whether the following single byte slots contained compressed or uncompressed instructions or were empty for
alignment purposes, and 12 slots. Compressed cache lines were guaranteed to contain at least 5 instructions and up
to 12, so compression ratios could not fall below 25%. To solve the branch problem, any instructions at branch
destinations were patched after compression, and were word aligned (required by architecture) by leaving slots
empty as necessary. Furthermore, by preventing instructions from crossing cache lines, only a single cache line
needed to be read for each instruction fetch, avoiding the expensive double-line accesses and double line misses.
A diagram of Benini’s decompression engine is shown below.
Figure 4: Decompression Unit [BeNM2001]
In the decompression unit, a MUX located between the data bus and cache selected the next instruction to use as
being either an uncompressed instructions drawn from the cache or a table entry from the table of compressed
instructions, labeled IDT. The main controller performed cache tag checking, handled cache misses, and checked if
cache lines were compressed, setting the index sent to the IDT if necessary.
On a series of C program benchmarks, code size reduction averaged 28% and ranged from 11% to 61%. Cache hit
ratio improvement averaged 19%, ranging from 0% to 34%, with similar figures for energy use reduction being 30%
and 8% to 53%. Shown below are average results taken over a set of C program benchmarks for compression ratios,
hit rates, and energy use reductions:
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 17 of 39
Cache Size, Technology Average Code SizeReduction (%)
Average Hit RateIncrease (%)
Average EnergySavings (%)
2K Cache and On-Chip SRAM 40.21 39.904K Cache and On-Chip SRAM 18.76 30.318K Cache and On-Chip SRAM 7.29 8.383K Cache and On-Chip FLASH
27.90
18.76 49.83A noteworthy and unexpected result of their research was that increased cache sizes increased the cache hit ratio,
limiting the amount of memory transactions over the bus. However, the bigger cache increased the cost of cache
accesses, decreasing the contribution of the bus and memory to energy use and in turn decreasing the usefulness of
having compressed code.
Instruction Sequence Dictionary MethodsLike the dictionary building methods described above, these methods place frequently used instructions in a
dictionary. Unlike the methods above, in these methods sequences of instructions are aggregated into dictionary
entries. The instructions in the code are replaced by a call indexing the particular dictionary like a macro or function
call. No branch problem exists. Building an optimal dictionary is NP complete in size of the text, so a heuristic
function is usually used [LBCM 1997].
Liao and DavadasIn [LiDK 1999], Liao and Davadas propose two methods of data compression, both of which are based on set-
covering and suffix finding algorithms 6. The first method tried by Liao and Davadas was based upon finding
common sequences of machine instructions in an executable. The common sequences were extracted to a dictionary
and replaced with simple, parameter-less, CALL instructions. Extracted sequences could contain conditional
branches, unlike many similar compression methods.
Finding subsequences was the heart of their compression algorithm. Their system worked on blocks in the control
flow graph, building basic blocks with unique successors, called extended blocks. Because these blocks have unique
successors, they have unique exit points, which may be made into return statements in the extracted mini-
subroutines. Conditional branches are allowed between the start of the block and the inserted return statement.
Because subsequences vary significantly between programs, the dictionary was static. The algorithm first divides
the program into basic blocks, then compares each block with every other block for substrings, which has O(n2)
worst case running time. Next, a heuristic-covering algorithm is used to decide which potential dictionary entries
will be used. Finally, the actual replacement is performed.
The second method also uses finds common subsequences, but is more flexible due to the augmenting of the
machine’s hardware instruction set with a special instruction CALD that takes as parameters the procedure to call
and the number of instructions to execute. Thus, some calls to a common sequence may return earlier than other
sequences. Method 2 slightly outperformed method 1.
6 Information in this section is drawn from [LiDK 1999] unless otherwise noted.
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 18 of 39
Because returns from the function were implicit by the number of instructions executed, if the mini-subroutine
contained branches, one side of the branch may need to have NOPs inserted to make the number of instructions
executed equal in both branches. The CALD instruction required a counter of instructions left to execute and a stack
of subroutines be added to the CPU.
The compression methods were tested using TI’s TMC320C25 compiler. Method II outperformed Method I by
approximately 2-5% on the sample programs compressed. Compression ratios on unix compress and gzip, and jpeg
programs were .923, .789, .863 using method I and .967, 0.928, and 0.931 using method II. Compression times
ranged from 0.410s to 93.600s seconds on 811 and 9663 instruction programs for Method I, with similar results for
Method II.
A unique idea was taken from Hennessy and Patterson’s 90/10 locality rule that about 90% of the execution time is
spent in 10% of the code. Therefore, compressing only the 90% least used code was tried. This was found to
degrade compression ratios by 2-3%. When compared to execution speed of uncompressed programs, compressing
only the least-used code yielded a 1-2% versus 15-17% for complete compression.
Lefurgy et al[LBCM 1997] describes an algorithm that builds a dictionary of sequences of instructions7. Once repeating
sequences were identified, a greedy heuristic function was used that selects for the largest immediate savings.
Unlike [LiDK 1999], sequences were limited to one basic block. Instruction sequences were immediately replaced
by their dictionary index. Codewords of size 16, 12, 8, and 4 bits were tried.
Relative branches were not compressed because distances between instructions could change after a compression
pass requiring a readjustment process that had been shown to be NP-complete. However, indirect branches that take
their target from a register were compressed. Because codewords were smaller than the CPU’s smallest branch
alignment (e.g. 4 bit codewords are 8 times smaller than the normal 32-bit branch alignment), the control unit of the
processor was modified to treat branches as being aligned with codewords. Because uncompressed instructions
were larger than those compressed, this reduced the range of some branches, so jump tables were created to handle
jumps requiring them.
Experiments were run on the PowerPC instruction set, compiling with gcc 2.7.2. Of the plethora of results, one
interesting result was that dictionary entries equal to or larger than 8 instructions tended to decrease compression
ratio, due to the destruction of smaller, useful dictionary entries by the greedy algorithm. Even when dictionary
entries holding 4 instructions with dictionaries holding 16, 32, or 64 entries were used, 8%, 12%, and 16%
(respectively) of code from programs were removed. When using variable length codewords with shorter
codewords for more frequent encodings, code reduction of 30% to 50% was achieved, with compression ratios very
close to that of the unix command compress. Maximum number of dictionary entries had the greatest effect on code
reduction followed by allowing shorter codewords.
7 Information in this section is drawn from [LBCM 1997] unless otherwise noted.
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 19 of 39
Experiments were run on the PowerPC instruction set. One interesting result was that dictionary entries equal to or
larger than 8 instructions tended to decrease compression ratio, due to the destruction of smaller, useful dictionary
entries by the greedy algorithm. Even when 4 instruction entries in dictionaries of sizes 16, 32, or 64 were used,
8%, 12%, and 16% (respectively) of bytes from programs were removed. When using variable length codewords
with more frequent encodings assigned shorter codewords, code reduction of 30% to 50% is achieved. Maximum
number of dictionary entries had the greatest effect on code reduction followed by allowing shorter codewords.
Chen, Bird, and MudgeThe algorithm described in [ChBM 1997] also determined frequencies of instructions, this time based on only on the
binary bits of the instructions of fixed length, and only within each basic block8. A tiling method was used to
greedily select the sequences with the highest frequency, with sequences being placed in a dictionary and the
dictionary index replacing the instruction in the code.
Chen’s compression scheme differed in the location where decompression takes place: past the compression cache
on the way to the CPU core. Along this path, compressed instructions were dereferenced from the dictionary before
being passed to the CPU.
Measurements were taken using a DEC-21064-based workstation on the SPEC CINT95 and CFP95 benchmarks.
Program size reduction was measured at 45% to 60%. The focus of the paper was to measure bytes fetched from the
cache, and the miss ratio as the instruction cache (I-cache) was varied from 2K to 32K.
Compression was found to reduce the number of bytes needed by the CPU, even when only 32 dictionary entries
were formed. Chen also found that a less optimistic but more realistic measurement of performance than % bytes
needed for execution was % bus cycles used by the compressed program, which was larger than % bytes needed
because padding to fill byte boundaries was performed on each fetch. On average, the number of bytes needed was
reduced to about 50% compared to uncompressed code, while the number of cycles needed reduced to about 65%.
The results for integer benchmarks from the SPEC 95 benchmark suite are shown in Figure 5 below. A final finding
showed that increasing the cache size decreased the cache miss rate approximately proportional to the inverse square
root of the cache size, on average. In fact, the miss rate of programs compressed with only 128 dictionary entries
was less than that for an architecture with twice the cache size. Results for cache miss rates are compared in Figure
6 below for uncompressed code, 32 and 128 entry dictionaries, and cache line sizes of 4 and 8 instructions.
8 Information in this section is drawn from [ChBM 1997] unless otherwise noted.
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 20 of 39
Figure 5: Bytes Needed and Bus CyclesUsed Relative to Uncompressed
Programs [ChBM 1997]
Figure 6: Cache Miss Rates versusCache Size [ChBM 1997]
Clausen et alClausen et al used a similar macro-building compression technique in [CPCM 2000]9. Their area of research was on
an embedded version of Java for systems where RAM memory was as little as 6K -- so small other compression
systems might not work. Here Java classes had been stripped of much extraneous information not necessary for the
embedded market.
The researchers extend the instruction set of the JVM. Non-standard instructions were interpreted as macro
instructions read from a dictionary. Generation of sequences was unique. First, each instruction was formed into a
group of length 1. Then groups were expanded by elongation or splitting. Bytecode switch, jump subroutine, and
return statements were considered unfactorizeable; groups were normally split around them. Macros were formed
by greedily selecting the group that provides the greatest code size savings, until no more unused instruction codes
or groups exist.
Experiments were performed by testing compression of standard libraries and demonstration programs that ship with
embedded Java systems. The average compression ratio for bytecode size was 79.7%, with a compression ratio
(counting dictionary size) of 84.7%. This was found only slightly trail gzip on average. Slowdown of executed
code was measured between 19% and 27%.
Arithmetic CodingIn their many publications, Lekatsas, Henkel, and Wolf tried several compression schemes, one of which was to
combine arithmetic coding with Markov models [LeHW 2000:3]10. To validate this scheme, the authors cited that
arithmetic coding performs better when symbol probability is skewed, and doesn’t lose coding efficiency at symbol
boundaries.
9 Information in this section is drawn from [CPCM 2000] unless otherwise noted.
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 21 of 39
Their algorithm separated types of instructions into four groups, each group had a short prefix to identify it (shown
in parenthesis): instructions with immediates (0), branches (11), “fast dictionary instructions” (100), and
uncompressed instructions (101). Group 1 instructions were compressed using the Markov model and arithmetic
coding, group 2 instructions were compressed by rewriting them in a form without unnecessary bits common to
equal-length instructions in RISC architectures, group 3 instructions were looked up in a 256 entry table with the
stipulation that such instructions have no immediate fields. Phase 1 of compression made a pass to build the
Markov model. Phase 2 compressed group 1 instructions. Phase 3 compressed branches only, compressing the code
further still. Phase 4 patched the branch offsets that have been marked in the two previous phases.
Compression ratios of the group 2 instructions were better than for group 1. Overall compression ratios were about
0.52 to 0.56, considering code only. Different groups of instructions were present at different frequency in the code
with group 1 instructions about twice as common as groups 2 or 3, and group 4 representing only 0.6% of the code.
The Markov Model could be described in terms of width and depth, as shown in the diagram below.
Figure 7: Example MarkovModel [LeHW 2000:2]
The experimenters found that depth should divide the instruction size evenly in order to have the model be starting
at the same layer when finished processing the same number of instructions. The width of a model represents the
amount of knowledge of the path remembered by the model.
A unique facet of using a Markov model with arithmetic coding was the ability to assign codes after the frequencies
of instructions appearing after other instructions was understood. The authors noted that energy used on a bus is
given by the equation:
2_2
1 VCnEnergy lineefflinebus ⋅⋅⋅=−
where n is the number of bit toggles, Ceff_line is the capacitance of the bus, and V is the potential
difference between high and low signals in volts.
10 Information in this section is drawn from [LeHW 2000:3] unless otherwise noted.
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 22 of 39
Lekatsas et al’s system assigned encoding bits so as to reduce bit toggling for more frequent codewords. At any
given state in the Markov Model, the same bit would be assigned to the most probable next transition as was the bit
along the most probable last transition used to reach the state. Reducing bit toggling for codewords reduced power
use on buses over which compressed instructions passed, with no loss to decoding performance.
Since instruction size was reduced, the amount of data transferred to the CPU for each normally 32-bit instruction
fetch could be reduced. Two methods were tried. The first was to leave the unused bits of compressed instructions
unchanged because bit toggles use energy. The second method, found to be significantly more effective at power
reduction, was to pack as much as possible of the next instruction into leftover space. Number of toggles on buses
when performing decompression before and after the I-cache was examined, and decompression after the I-cache
was found to have significant benefit.
Energy savings were found to be between 16% and 54%. If extra performance was traded to save more energy by
slowing down the clock frequency, savings ranged from 16% to 82%. Energy savings were greatest in the CPU
which benefited from reduced idle time due to cache misses, as well as through reduced bit toggles on the buses, in
the cache, and elsewhere. Energy consumption of the hardware decompression unit was negligible compared to that
used and saved in the rest of the system.
In [LeHW 2000:2], arithmetic coding was compared to Huffman coding, and found to be superior in the metrics of
compression ratio, performance increase, and energy savings, as are shown in the table below for four separate
programs:
Compression Ratio Execution Time Increase Energy SavingsHuffman Arithmetic Huffman Arithmetic Huffman Arithmetic
i3d 0.66 0.53 50.29 56.46 33.08 43.27mpeg 0.66 0.53 48.35 47.01 28.40 30.96smo 0.65 0.55 2.42 2.85 12.00 19.01trick 0.65 0.54 53.28 62.02 43.91 53.81
Yet another study compared arithmetic coding only to a dictionary method using a 256 entry dictionary that was
filled with sequences of one to three instructions by a greedy selection method similar to those used by most
dictionary methods described in the dictionary section [LeWo 1998]. In the MIPS architecture, the dictionary
method was found to achieve compression ratios 4-6% better than the arithmetic method, while much less difference
was found on the PowerPC architecture [LeWo 1998]. No data on power or performance changes was reported.
This study provided the impetus to combine dictionary methods with arithmetic coding.
Procedural Abstraction and Cross JumpingCooper and McIntosh tried a compiler-oriented method of code compression [CoMc 1999]11. Their algorithm began
by building a suffix tree, which is a structure that stores information about repetition within a string, by representing
the string as a tree with repeated substrings as interior nodes. Construction can be done in O(n) time for a string of
length n. The suffix tree was used to identify repeated sequences that are then analyzed for hazards that would
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 23 of 39
prevent code compression, and the suffixes were split around such hazards. Next, two transformations were applied.
The first, procedural abstraction, created new procedures out of frequently repeated sequences of code and replaces
sequences with procedure calls. The second, cross jumping or “tail merging”, merged the ends of sequences that
will end by jumping to the same location by having one sequence be jumped into by sequences with similar endings.
Two abstraction techniques were used to increase the number of repeated sequences. In branch abstraction,
branches were recoded in a PC-relative form whenever possible, to identify more similar sequences and sequences
that spanned multiple blocks. The other technique, abstracting registers, renamed registers in sequences in terms of
uses and definitions within its enclosing basic block. Register renaming could then be applied to make sequences
identical to each other, and live range recoloring is used to reassign registers in blocks. A final optional
improvement was based on Patterson and Hennessy’s 90/10 principle: profiling data is used to locate the most
frequently run code, which is left uncompressed.
A compiler that generates a language similar to RISC assembly was modified, and the following results were
obtained as the average from a series of programs for experiments done with pattern matching only, pattern
matching and relative branches, and pattern matching with relative register assignment:
pattern matchingonly
pattern matchingand branchabstraction
pattern matchingand registerabstraction
percent decrease in staticinstruction count
0.67 0.72 4.88
percent increase in dynamicinstruction count
1.07 2.53 6.47
Register abstraction techniques achieved an average of nearly 5% code space reduction, with a high of 14.8%. The
average for relative branch compression was only about 1%. Standard optimization methods reduced code size by
18.56%. Compression reduced code size by an average of 7.91% without optimization, and made less of a
difference with optimization, yielding a combined 22.48% reduction. Finally, it is important to note that register
abstraction slowed down the compiler considerably due to its need to construct and manipulate interference graphs.
Slim BinariesNo paper about program compression written at UCI would be complete without mentioning Slim Binaries by
Michael Franz and Thomas Kistler [FrKi 1997]12. Unlike most compression systems covered here, slim binaries did
require decompressing the entire program before it runs. It is still covered due to the remarkable compression ratio
and its usefulness for embedded systems. Fat binaries are executables that contain machine code for several
architectures in order to allow users of new architectures the best performance without alienating users of older
systems at the expense of developers and compiler manufacturers who support older software. In contrast, slim
binaries contained only an “architecture neutral intermediate program representation” in a compressed form. The
authors would like us to believe that the reduced I/O load time of the compressed program will make up for the need
11 Information in this section is drawn from [CoMc 1999] unless otherwise noted.
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 24 of 39
to generate machine code at load time. Other features of slim binaries are support for modern software engineering
paradigms of creating modules with import and export interfaces, dynamic binding, and embedding applications in
documents.
The compression scheme of slim binaries was adaptive (using an ever-growing vocabulary) and predictive, adding
items to the vocabulary before they are encountered in the source. It operated on abstract syntax trees rather than
machine code. The vocabulary began with a few basic primitive operations such as assignment, addition ,
subtraction, and procedure call. The algorithm began by parsing the source code into an abstract syntax tree (AST)
and building a symbol table. Next, the tree was traversed with the encoder building the evolving vocabulary as it
processed. For example, the AST for the procedure call P(i + 1) would be encoded using operation symbols
procedure call, addition , and data symbols procedure P, variable i , and constant 1 . The vocabulary would be
updated to include a new symbol for P(i + 1). Additional vocabulary entries would be created using prediction
heuristics, creating such symbols as i-plus-something, something-plus-one, and predicting symmetric use in the
future, i-minus-something, something-minus-one. This would make encoding of a later statement such as i + j
easier. The adaptive system allows whole branches of the AST to be processed at a time.
For an example package of network applications, slim binaries achieved compression ratios of 0.36 and 0.32 for an
i386 and PowerPC package. This was 25% better compression results than for popular Lempel Ziv variant LZSS on
the binaries, which is amazing because LZSS uses higher order compression that Lefurgy et al described as
impractical for embedded systems. Slim Binaries can outperform high order compression of binary executables
because those systems are limited by the entropy of the compiler-created, machine oriented executeable. Slim
Binaries are limited by the entropy of the abstract syntax tree, which probably more closely describes the semantic
meaning of a program than machine instructions like load and store operations. On the fly code generation was
found to be largely dependent on processor power, which is continuing to increase many times faster than disk I/O
speeds. Shown below in Figure 8 is a table of compression results and in Figure 9 is a graph of load times of slim
binaries compared to natively compiled executables.
Figure 8: Compression Comparison for Slim Binaries [FrKi 1997]
12 Information in this section is drawn from [FrKi 1997] unless otherwise noted.
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 25 of 39
Figure 9: Load Times for Network Package in Slim Binaries andNative Code [FrKi 1997]
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 26 of 39
Summary of Compression System PerformanceThe table below lists a summary of the performance of the compression systems described herein. Most authors
provided results either for overall program compression ratio or for code compression ratio, which would
presumably appear better because it did not include extra tables or models that must be part of the whole compressed
program. Performance refers to the inverse of the execution time, unless otherwise noted.
Author CompressionMethod
ProgramCompressionRatio (%)
CodeCompressionRatio (%)
EnergyRequirementChange (%)
PerformanceChange (%)
Wolfe and Chanin Huffman 65 to 75 -13 to +38Larin and Conte Huffman, stream 70 to 78 -4 to +93Larin and Conte Tailored 60 to 67 +2 to +188Larin and Conte Huffman, whole
op+11 to +225
Yoshida et al InstructionDictionary
22.7 to 54.0 -58 to -80(memory)
Jin and Chen InstructionDictionary
70 to 76 +0 to +40 (cache hitrate)
Benini et al InstructionDictionary
72 -30.31 +18.76 (cache hit rate)
Liao and Davadas SequenceDictionary
85.1 to 96.5 +15 to +17
Lefurgy et al SequenceDictionary
53 to 70
Chen et al SequenceDictionary
40 to 55 +40 to +90 (bus cycles)
Clausen et al SequenceDictionary
84.7 79.7 +19 to +27
Lekatsas et al Arithmetic 52 to 56 -19.01 to -53.81 +2.85 to +62.02Lekatsas et al Huffman 65 to 66 -12.00 to -43.91 +2.42 to +53.28Lekatsas et al Sequence
Dictionary43 to 58
Cooper and McIntosh Compiler 93 to 96Franz and Kistler Slim Binaries 32 to 36
Misconceptions about Code CompressionResearchers and computer scientists alike have several misconceptions about the efficacy of code compression.
Several misconceptions and the truths are described below.
• “Performance will decrease because decompression must be done on the fly.” In actuality, codecompression is often done by fast table lookups near the CPU, decreasing the amount of data moved on thebus from memory to the I-cache. In fact, systems that used post I-cache compression increased theeffective cache size and decreased the cache miss rate as well as the amount of data that needed to bebrought across the bus from I-cache to CPU, which increased performance [LeHW 2000:3].
• “CISC Systems tend to be more dense by design, and therefore will not benefit from code compression asmuch as RISC architectures do” [LeHW 2000:2]. Several researchers made this comment, but it wasproven wrong as far back as 1994. Kozuch and Wolfe found that the zeroth order and first order entropy ofthe Vax-11 was approximately the same as 4 RISC systems, with only the MIPS having a significantlylower entropy for zeroth order [KoWo 1994].
• “Some kind of table is needed to translate uncompressed instruction addresses to compressed instructionaddresses in order to resolve branch targets.” For researchers using dictionaries of instruction sequencesthat were called like macros, this conception did not even apply. Many researchers came up with methodsof patching branch targets and offsets during a later pass, or as in Yoshida’s work, compressing allinstructions to the same width, making branch calculations trivial [YSOO 1997].
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 27 of 39
• In Lekatsas’ work, “appending a 3-bit preamble to uncompressed instructions may cause run-timedecompression inefficiency” [BeMN 2001]. While adding the preamble adds to the size of uncompressedinstructions, in actuality, only 0.6% of instructions are uncompressed [LeHW 2000:3]. Furthermore, thepublished results indicate that use of variable length arithmetic encoding (that includes its own 256 entryinstruction dictionary) outperforms instruction dictionary methods alone [LeHW 2000:3, LeWo 1998].
• “Variable length codewords (such as those used in ... [WoCh 1992]) are expensive to decode” [LBCM1997]. Lekatsas et al used Markov models to their advantage in developing a system that was both fast todecompress, achieved a great compression ratio, and lowered power requirements by choosing encodingsso as to minimize bit changes [LeHW 2000:3].
• “Combining 2 or more compression strategies does not yield better compression” [LaCo 1999]. Actually,Lekatsas et al combined 256-entry sequence dictionary and arithmetic coding methods to achieve some ofthe best compression ratios of the papers surveyed [LeHW 2000:3]. Just using a dictionary with 256entries alone did not produce as good results [LeWo 1998, LeHW 2000:3]. In fact, gcc was able to use7927 entry instruction sequence dictionary, more than most instruction sequence dictionaries provided, soincluding an additional compression scheme with any one of them would be likely to reduce code size[LBCM 1997].
• “Higher order compression techniques (Lempel-Ziv, etc) would be too expensive for implementation inembedded systems” and this limits code compressibility in embedded systems [KoWo 1994]. As MichaelFranz discovered, it is much more effective to compress an intermediate representation of the program, thanit is to compress the executable [FrKi 1997].
• “My results are typical.” Many researchers presented their compression results based on realimplementations performing compression on a particular architecture. While it is important to have resultsfrom an implementation, Kozuch and Wolfe demonstrated that compression ratios vary by as much as 10%between architectures [KoWo 1994]. The good results produced a compression system on an architecturemay be specific only to that architecture.
• “Code compression is an indicator of a compression performance.” For exactly half of the compressionsystems surveyed, the authors reported the code compression ratios of the compression systems. Oftenthese systems required auxiliary information be added to programs in order for them to work under thecompression system such as a dictionary mapping indexes to instructions or instruction sequences. In ourown implementation below, we demonstrate that code compression ratio is not a useable indicator ofoverall compression ratios. An impressive-looking code compression ratio of 0.5000 can hide an overallprogram size increase.
ImplementationIn order to verify the results of other researchers, a code compression system was developed. This system is based
on the work of Yoshida et al, extending their system to have multiple dictionaries.
Like Yoshida’s work, the compression system first reads a program of N instructions and makes a list of n unique
instructions. In addition, it determines the frequency at which each instruction occurs, then sorts the list of n unique
instructions in order of frequency. Instructions are then placed in dictionaries of varying sizes, with the most
frequent instructions being placed in the smallest dictionaries. Dictionaries always had a number of entries that was
a power of 2.
In the original code, the original machine instructions are then replaced by a prefix that references the dictionary in
which the instruction appears, followed by the index into the dictionary of the instruction. The compression system
was tried with 1, 2, 3, and 4 dictionaries. For 3 dictionaries, the two smaller dictionaries were assumed to have a 2
bit prefix while the third was assumed to have a 1 bit prefix.
Two programs were created. One program, dcompress.py, was a compressor based on having three-dictionaries.
When run on machine instructions, the compressor produced a file containing the dictionaries used for compression
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 28 of 39
and a file containing the compressed code. A corresponding decompression system would need to be implemented
in hardware, and would load the dictionaries into special decompression tables. The compressed code would be
stored back into the executable file.
The second program, dcalculate.py, calculated the size of the compressed code and code compression ratios that
would be achieved for various sized dictionaries if the algorithm above was applied. dcalculate.py was useful for
examining how the number of dictionaries affected compression.
Source code for both programs is included in appendix B: source code. Programs were written in the Python
programming language.
It is important to note that having multiple dictionary sizes creates varying length codewords, which in turn
necessitates a method for translating uncompressed addresses to compressed addresses. Wolfe and Chanin used a
translation table that mapped to blocks and added about 3.5% to the program size [WoCh 1992]. Larin and Conte’s
also used a table, but mapped uncompressed target addresses to compressed target address, adding 15.5% to the size
of the program [LaCo 1999]. A complete implementation of a dictionary compression scheme as described here
would also need such a table, but was not implemented here.
Generating Benchmark DataTen programs on the Sun SPARC architecture under the operating system SunOS 5.7 were selected as benchmarks
based on their popularity across unix platforms and their variety in terms of uncompressed executable size. SunOS
5.7 saves unix programs in the ELF binary format. For each program, the program “objcopy” was run with the -j
switch to extract the code segments of programs, called the “.text” segments [terran.txt]. The command “readelf”
was run with the “-l” switch on each result to determine the size of ELF headers in the objcopy’s output by
subtracting the number given for the VirtualAddress from the number given for Entry Point [readelf.html]. Finally,
the ELF headers were removed from the files along with any data that came after the number given by readelf for
the FileSize of the .text segment [terran.txt]. The resulting data was treated as being just the machine instructions of
the programs’ code segments.
The table below shows the programs used as benchmarks, the size of the executables, the size of code, the number of
unique instructions in the program, and the percentage of instructions that were unique (appeared for the first time)
in the programs. The version of the programs is listed if it was available; otherwise, the version of the operating
system, SunOS 5.7, is listed.
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 29 of 39
Program Name Version File Size(bytes)
Code Size(bytes)
UniqueMachineInstructions
% UniqueInstructions
echo SunOS 5.7 5680 1564 246 62.92%compress SunOS 5.7 16024 8464 1288 60.87%patch SunOS 5.7 38328 21092 2114 40.09%ftp SunOS 5.7 73840 48904 4503 36.83%gzip 1.2.4 124736 90236 2532 11.22%vi SunOS 5.7 208204 155208 12705 32.74%gcc 2.8.1 417792 177724 2737 6.16%vim 5.3 684500 539736 30877 22.88%ssh2 2.4.0 4927672 507376 30997 24.44%emacs 20.4 8520132 1092588 45186 16.54%
SUM 15016908 2642892 133185 20.16%
Two very important results are seen in this table. First, [YSOO 97] indicated that when a table of unique
instructions was created, the authors needed only 12 bits to address the table. Hence, only 2^12 = 4096 or fewer
unique instructions existed in their benchmarks. In contrast, it was found that up to 45186 unique instructions
existed in executables in our benchmarks, thus for the Solaris architecture, dictionaries needed up to 16 bit indexes.
Since compression required replacing each instruction with a prefix and index, this meant that compression ratios for
the architecture and benchmark set tried here would not look as Yoshida’s results, or that Yoshida only tested rather
small programs.
The second important result is that the percentage of unique instructions appearing in programs varies significantly.
One particular program, gcc, had the most frequent instruction appearing 21280 times, while the next three most
frequent instructions appeared 2734, 1050, and 698 times. The four most frequent instructions in vi, on the other
hand, appeared 1907, 1029, 464, and 452 times.
ResultsAppendix A: Result Data includes tables listing the compressed code size, code compression ratio, and overall
program compression ratios for each of the benchmark programs tested. Figure 10 below shows the average code
compression ratio for various dictionaries. Figure 11 below is a similar graph of average overall compression ratios
(note that we do not include a means to translate uncompressed branch target addresses to compressed branch
addresses).
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 30 of 39
Average Code Compression Ratio
0.0000
0.1000
0.2000
0.3000
0.4000
0.5000
0.6000
6553
6
1x65
536
4x65
536
16x6
5536
64x6
5536
256x
6553
6
1x12
8x65
536
1x25
6x65
536
4x25
6x65
536
4x51
2x65
536
16x1
28x6
5536
16x2
56x6
5536
16x5
12x6
5536
32x2
56x6
5536
32x5
12x6
5536
32x1
024x
6553
6
64x2
56x6
5536
64x5
12x6
5536
16x6
4x51
2x65
536
16x1
28x4
096x
6553
6
1x64
x512
x655
36
4x64
x512
x655
36
1x12
8x25
6x65
536
1x12
8x51
2x65
536
1x12
8x10
24x6
5536
Dictionary Size
Compression Ratio
MEAN
Figure 10: Average Code Compression Ratio
Average Overall Compression Ratio
0.7800
0.8000
0.8200
0.8400
0.8600
0.8800
0.9000
0.9200
6553
6
1x65
536
4x65
536
16x6
5536
64x6
5536
256x
6553
6
1x12
8x65
536
1x25
6x65
536
4x25
6x65
536
4x51
2x65
536
16x1
28x6
5536
16x2
56x6
5536
16x5
12x6
5536
32x2
56x6
5536
32x5
12x6
5536
32x1
024x
6553
6
64x2
56x6
5536
64x5
12x6
5536
16x6
4x51
2x65
536
16x1
28x4
096x
6553
6
1x64
x512
x655
36
4x64
x512
x655
36
1x12
8x25
6x65
536
1x12
8x51
2x65
536
1x12
8x10
24x6
5536
Dictionary Size
Compression Ratio
MEAN
Figure 11: Average Overall Compression Ratio
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 31 of 39
A few observations and results are worth mentioning. First, code compression ratio is not a very good indicator of
overall compression ratio. Many authors in the papers surveyed reported only code compression ratios even when
their compression schemes seemed to require a large dictionary and/or a table for resolving branch addresses. Such
figures belie the actual compression ratio that the system could produce. Second, having multiple dictionaries
improves code compression. In fact, with four tables of sizes 16, 128, 4096, and 65536, the average code
compression ratio is 30% better than using a single table. Third, having multiple dictionaries improves overall
compression ratios. The same set of four tables produces an average overall compression ratio 8.23% better than a
single table scheme. Fourth, the gains made by adding more dictionaries tends to decrease substantially after the
second table is added. Moving from a single dictionary to two dictionaries with 256 and 65536 entries, the code
compression ratio improves by 6.18%. Moving from said two dictionaries to three dictionaries with 32, 512, and
65536 entries only improves the overall compression ratio by 1.15%. Moving from three dictionaries to the highest
compressing system tried, four dictionaries with 16, 128, 4096, and 65536 entries, only improves the code
compression ratio by 1.04%.
The code compression ratio and overall compression ratio of four selected programs are shown below in figures 12
and 13. A few novel observations may be made. First, code compression ratios tend to be small (better) for very
small programs and programs that have a low percentage of unique instructions. One would tend to expect this
because the majority of compressed instructions would refer to smaller dictionaries that could be addressed with
fewer bits. The trends in figure 13 reiterates why code compression ratios are not useful for reporting results.
Overall compression ratios are very high (worse) for small programs. Due to the overhead of storing a dictionary,
compressed programs may be even larger than uncompressed programs. Overall compression ratios are lower for
medium to large programs that have a lower percentage of unique instructions
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 32 of 39
Code Compression Ratios for Four Programs
0.0000
0.1000
0.2000
0.3000
0.4000
0.5000
0.6000
6553
6
1x65
536
4x65
536
16x6
5536
64x6
5536
256x
6553
6
1x12
8x65
536
1x25
6x65
536
4x25
6x65
536
4x51
2x65
536
16x1
28x6
5536
16x2
56x6
5536
16x5
12x6
5536
32x2
56x6
5536
32x5
12x6
5536
32x1
024x
6553
6
64x2
56x6
5536
64x5
12x6
5536
16x6
4x51
2x65
536
16x1
28x4
096x
6553
6
1x64
x512
x655
36
4x64
x512
x655
36
1x12
8x25
6x65
536
1x12
8x51
2x65
536
1x12
8x10
24x6
5536
Table Size
Compression Ratio
echo
vi
gcc
emacs
Figure 12: Code Compression Ratios for Selected Programs
Overall Compression Ratios for Four Programs
0.0000
0.2000
0.4000
0.6000
0.8000
1.0000
1.2000
6553
6
1x65
536
4x65
536
16x6
5536
64x6
5536
256x
6553
6
1x12
8x65
536
1x25
6x65
536
4x25
6x65
536
4x51
2x65
536
16x1
28x6
5536
16x2
56x6
5536
16x5
12x6
5536
32x2
56x6
5536
32x5
12x6
5536
32x1
024x
6553
6
64x2
56x6
5536
64x5
12x6
5536
16x6
4x51
2x65
536
16x1
28x4
096x
6553
6
1x64
x512
x655
36
4x64
x512
x655
36
1x12
8x25
6x65
536
1x12
8x51
2x65
536
1x12
8x10
24x6
5536
Table Size
Compression Ratio
echo
vi
gcc
emacs
Figure 13: Overall Compression Ratios for Selected Programs
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 33 of 39
In [LBCM 1997], it was reported that the 1% most frequent instructions account for 30% the size of machine code,
and the 10% most frequent instructions account for 66% of the size of code. The program dcalculate.py was used to
examine the number of instructions that occurred at what frequency. For the ten benchmark programs, the
percentage of the size of code accounted for by the 1% most frequent unique instructions and the 10% most frequent
unique instructions was calculated as well as for the 1% most frequent total instructions and the 10% most frequent
total instructions. Percentages for most frequent unique instructions were determined by dividing the number of
unique instructions by 100 and 10, respectively. Percentages for the most frequent total instructions were
determined by dividing the total number of instructions by 100 and 10, respectively. Results are shown in the table
below.
ProgramName
File Size Code Size UniqueMachineInstructions
One PercentUniqueInstructions
Ten PercentUniqueInstructions
One PercentTotalInstructions
Ten PercentTotalInstructions
echo 5680 1564 246 6.65% 39.64% 13.04% 46.80%compress 16024 8464 1288 13.66% 38.89% 17.82% 46.74%patch 38328 21092 2114 21.58% 53.31% 32.68% 68.52%ftp 73840 48904 4503 25.96% 56.87% 37.54% 72.65%gzip 124736 90236 2532 40.45% 79.64% 78.09% 98.78%vi 208204 155208 12705 33.96% 61.33% 46.43% 77.26%gcc 417792 177724 2737 72.07% 90.06% 92.88% 100.00%vim 684500 539736 30877 43.48% 70.93% 60.96% 87.12%ssh2 4927672 507376 30997 38.14% 68.04% 55.42% 85.56%emacs 8520132 1092588 45186 48.53% 77.68% 72.13% 93.46%
MEAN 34.45% 63.64% 50.70% 77.69%
Our results tend to agree with those reported in [LBCM 1997] for programs larger than 50K. We found on average
that the 1% most frequent unique instructions account for approximately 34.45% of the code size, and the 10% most
frequent unique instructions account for approximately 63.65% the size of the code.
The distribution of instructions is shown below for the program ftp in a list of pairs. The first element in each pair
reports the frequency at which one or more unique instructions occurred. The second element reports the number of
unique instructions to occur that many times in the program. The program ftp had 12226 total instructions of which
4503 were unique.
(332, 1), (268, 1), (250, 1), (198, 1), (161, 1), (110, 1), (103, 1), (96, 1), (95, 1), (83, 1), (81, 1),
(74, 2), (73, 2), (67, 1), (63, 1), (48, 2), (46, 1), (44, 2), (42, 1), (41, 1), (39, 1), (38, 1), (37, 2), (35,
2), (34, 1), (32, 2), (31, 1), (30, 1), (29, 4), (28, 2), (27, 5), (26, 5), (25, 4), (24, 1), (23, 2), (22, 4),
(21, 8), (20, 5), (19, 2), (18, 7), (17, 5), (16, 5), (15, 13), (14, 14), (13, 17), (12, 17), (11, 18), (10,
21), (9, 23), (8, 23), (7, 42), (6, 52), (5, 82), (4, 132), (3, 187), (2, 552), (1, 3216)
(frequency, number unique instructions) for ftp
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 34 of 39
The distribution of instructions for ftp reiterates the uneven rate at which machine instructions appear in a typical
program.
ConclusionCode compression can be employed to shrink the storage and memory requirements of programs. Additionally,
power requirements may lessen and performance may increase through increased effective cache sizes, fewer and
smaller bus transactions, and optimized bit patterns sent across the bus. These combine to produce embedded
systems that cost and weigh less, are smaller, and run faster. Code compression remains an active area of research
in 2002.
Two programs for code compression were developed and tested on 10 benchmark programs. Several original
findings were found. On the SPARC architecture, as many as 45186 unique instructions appear per program, not
4096 as reported for an architecture used in [YSOO 1997]. The percentage of unique instructions varies per
program with 60% being typical for small executables around 10K in size, 20% to 30% for 100K executables, and
20% for multi-megabyte executables. Code compression ratios are not a good indicator of overall compression
ratios. Both average code compression ratios and average overall compression ratios are improved when
dictionaries of several sizes are used. While two dictionaries produces much better results than just one dictionary,
the benefits of more than two dictionaries is not substantial.
Code compression ratios are lower (thus, better) for smaller programs and programs with a lower percentage of
unique instructions; however, overall compression ratios follow a different trend. Overall compression ratios are
better for medium to large programs with a lower percentage of unique instructions. Code compression ratios are
not a good indicator of compression performance. Finally, the frequency at which instructions appear in the
programs is highly skewed with the 1% most frequent unique instructions accounting for about 34% of the code
size, and the 10% most frequent unique instructions accounting for about 64% of the code size.
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 35 of 39
References• [BeMN 2001] Cached-code compression for energy minimization in embedded processors. Luca Benini,
Alberto Macii, Alberto Nannarelli. Proceedings of the 2001 international symposium on Lowpower electronics and design. August 2001.
• [ChBM 1997] The Impact of Instruction Compression on I-cache Performance. I-Cheng K. Chen, Peter L.Bird, Trevor Mudge. Technical Report CSE-TR-330-97, University of Michigan, 1997.
• [CPCM 2000] Java Bytecode Compression for Low-End Embedded Systems. Lars Raeder Clausen, UlrikPagh Schultz, Charles Consel, and Gilles Muller. ACM Transactions on ProgrammingLanguages and Systems. May 2000.
• [CoMc 1999] Enhanced Code Compression for Embedded RISC Processors. Keith D. Cooper andNathaniel McIntosh. Proceedings of the 1999 International Conference on ProgrammingLanguage Design and Implementation.
• [Doug] The Compression Cache: Using On-Line Compression to Extend Physical Memory. FredDouglis . Matsushita Information Technology Laboratory.
• [Fano 1961] Transmission of Information. R.M. Fano. Cambridge, MA: MIT Press 1961.• [FrKi 1997] Slim Binaries. Michael Franz and Thomas Kistler. Communications of the ACM. December
1997.• [EEFL 1997] Code Compression. Jens Ernst, William Evans, Christopher W. Fraser, Steven Luco, and
Todd A. Proebsting. Proceedings of the 1997 ACM SIGPLAN conference on Programminglanguage design and implementation. May 1997.
• [HePa 1990] Computer Architecture: A Quantitative Approach . J.L. Hennessy and D.A. Patterson.Morgan Kaufman Publishers, 1990.
• [JiCh 2000] Instruction Cache Compression for Embedded Systems. Yujia Jin and Rong Chen. 2000.• [KoWo 1994] Compression of Embedded System Programs. Michael Kozuch and Andrew Wolfe.
Proceedings of the IEEE International Conference on ICCD. 1994.• [LaCo 1999] Compiler-Driven Cached Code Compression Schemes for Embedded ILM Processors. Sergei
Y. Larin and Thomas M. Conte. Proceedings of the 32nd Annual International Symposium ofMicroarchitecture, 1999.
• [LBCM 1997] Improving Code Density Using Compression Techniques. Charles Lefurgy, Peter Bird, I-Cheng Chen, and Trevor Mudge. Proceedings of the 30th Annual International Symposiumon Microarchitecture. December 1997.
• [LeHW 2000] Arithmetic Coding for Low Power Embedded System Design. Haris Lekatsas, Jörg Henkel,and Wayne Wolf. Princeton University, NEC USA, 2000.
• [LeHW 2000:2] Code Compression as a Variable in Hardware/Software Co-Design. Haris Lekatsas, JörgHenkel, Wayne Wolf. International Workshop on Hardware/Software Co-Design, 2000.
• [LeHW 2000:3] Code Compression for Low Power Embedded System Design. Haris Lekatsas, Jörg Henkel,and Wayne Wolf. Department of Electrical Engineering Princeton University.
• [LeHW 2001] H/S Embedded Systems: Design and simulation of a pipelined decompression architecture forembedded systems. Haris Lekatsas, Jörg Henkel, Wayne Wolf. Proceedings of theinternational symposium on Systems synthesis. September 2001.
• [LeWo 1998] Code Compression for Embedded Systems. Haris Lekatsas and Wayne Wolf. Department ofElectrical Engineering Princeton University, 1998.
• [LiDK 1999] A text-compression-based method for code size minimization in embedded systems. StanLiao, Srinivas Devadas, Kurt Keutzer. ACM Transactions on Design Automation ofElectronic Systems. January 1999
• [LiWu 2000] A Low-Power CAM Design for LZ Data Compression. Kun-Jin Lin and Cheng-Wen Wu.IEEE Transactions on Computers. October 2000.
• [readelf.html] readelf. http://wildsau.idv.uni-linz.ac.at/~k3032e4/virus-writing-HOWTO/_html/readelf.html.• [Sayo 2000] Introduction to Data Compression, 2nd edition . Khalid Sayood. Morgan Kaufmann
publishers. 2000.• [terran.txt] http://hcunix.org/hcunix/terran.txt.• [ViZA 2000] Dynamic Zero Compression for Cache Energy Reduction. Luis Villa, Michael Zhang, and
Krste Asanovic. 33rd International Symposium on Microarchitecture, December 2000.
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 36 of 39
• [WoCh 1992] Executing Compressed Programs on An Embedded RISC Architecture. Andrew Wolfe andAlex Chanin. Proceedings of the 25th Annual International Symposium on MicroArchitecture, pp81-91. Dec. 1992.
• [YSOO 1997] An Object Code Compression Approach to Embedded Processors. Yukihiro Yoshida, Bae-Yu Song, Hiroyuki Okuhata, Takao Onoye, and Isao Shirakawa. Proceedings of the 1997International Symposium on Lower Power Electronics Design. 1997.
• [YSOO 1996] Low-Power Consumption Architecture for Embedded Processor. Yukihiro Yoshida, Bae-YuSong, Hiroyuki Okuhata, Takao Onoye, and Isao Shirakawa. 2nd International Conference onASIC. 1996.
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 37 of 39
Appendix A: Result DataThe following 7 pages contain the complete result data found for the benchmark programs.
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 38 of 39
Appendix B: Source CodeThe following 11 pages contain the source code for the programs dcount.py and dcompress.py and the module
compr_inst.py that were used to collect data and perform compression.
Code Compression in Embedded Systems Miller_____________________________________________________________________________________________
____________________________________________________________________________________________Page 39 of 39
Appendix C: Typical Program OutputThe following 4 pages show the output for the program dcompress.py when run on the machine code for the
program vim.