K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

39
2004/12/2 APSEC@BUSAN 1 K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan) Binary-Level Lightweight Data Integration to Develop Program Understanding Tools for Embedded Software in C

description

Binary-Level Lightweight Data Integration to Develop Program Understanding Tools for Embedded Software in C. K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan). Overview. Problems: Imprecision in C tools. High development cost of C tools. - PowerPoint PPT Presentation

Transcript of K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

Page 1: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 1

K. Gondow (Titech, Japan)T. Suzuki (Elmic System Inc, Japan)H. Kawashima (JAIST, Japan)

Binary-Level Lightweight Data Integration to Develop Program Understanding Tools for Embedded Software in C

Page 2: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 2

Overview

Problems: Imprecision in C tools. High development cost of C tools.

Our solution: Binary-level lightweight data integration. As a testbed, DWARF2 used for developing

dxref, rxref: cross-referencers bscg: a call-graph extractor

Page 3: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 3

Imprecision in C tools (1/3)

e.g., GNU GLOBAL cannot identify a variable 'foo' and a label 'foo'. Users must select some one from the list. Because GNU GLOBAL partially analyzes

source code to run very fast.

int main (void) { int foo; foo: goto foo;}

foo 3 test.c int foo.cfoo 4 test.c foo: goto foo;

click candidate list

Page 4: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 4

Imprecision in C tools (2/3)

e.g., Murphy's study: "An Empirical Study of Static Call

Graph Extractors", by Murphy, et al., ICSE, 1996.

Tells "call graphs extracted by several broadly distributed tools vary significantly enough to surprise many experienced software engineers."

Page 5: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 5

Imprecision in C tools (3/3)

cflow∩Field

cflow-Field

Field-cflow

Quantitative results from mosaic, quoted from Murphy's paper.

Page 6: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 6

Why imprecision? (1/2)

Reason #1: many tools partially parse source code, resulting in incomplete analysis. e.g, GNU GLOBAL, cxref, LXR,

cscope, cflow...

At a glance, full-parsing seems to solve this problem, but...

Page 7: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 7

Why imprecision? (2/2)

Reason #2: C source code is difficult to fully analyze because of Compiler-specific extensions.

e.g., asm for inline assembly code Ambiguous behaviors in the C

standards. undefined, unspecified, implementation-

defined. e.g., padding in a structure.

Page 8: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 8

Compiler-specific extensions

Essential in C and embedded software. e.g., asm is used to obtain H/W error code.

e.g., long long is used in C89's <stdio.h> Make it hard to analyze source code.

Different compiler has different semantics.

void page_fault_handler (uint32_t error) { uint32_t cr2; asm volatile ("movl %%cr2,%0":"=r"(cr2)); ... /* IA-32 control register #2 */}

Page 9: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 9

Ambiguous behaviors in C (1/2)

Intentional and essential to keep C compilers fast and simple.

e.g., padding in a structure is an implementation-defined behavior. This makes pointer-analysis hard.

"Pointer analysis for programs with structures and casts", by Suan Hsi Yong, et al, PLDI'99.

Page 10: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 10

Different padding on different platforms.

To obtain precise dataflow, tools need to know the padding values of the compiler.

But it is hard...

struct S {char c; int *ip; } *p;struct T {char c; int i; } t;t.i = 0x1234;p = (struct S *)&t;printf ("%p\n", p->ip);

ip

ip

i

pad

din

g

struct S struct Sstruct Tc c c

Solaris8 (32bit)

Solaris8 (64bit)

Ambiguous behaviors in C (2/2)

depends on

not

Page 11: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 11

Possible solutions

To modify compilers (e.g. GCC) to emit their analyzed internal data. Seemingly high development cost. Many compilers to be modified.

To use binary information in executables emitted by compilers. Relatively easy, although it lacks

some information, e.g., statements.

Page 12: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 12

Our solution and result

Our solution: Uses DWARF2 debugging information

as binary information. Preliminary experiment:

Good result for our cross-referencers and call-graph extractor.

Better precision, although: some false negatives increased. quantitative results are not yet obtained.

Page 13: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 13

Demonstration

Using DWARF2, we implemented: two cross-referencers:

dxref: only uses DWARF2 Sample output: dxref

rxref: hybrid of dxref and GNU GLOBAL Sample output: dxref

a static call-graph extractor: bscg: uses DWARF2 and disassembler.

Sample outputs: fact, dxref, bash, bash

Page 14: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 14

DWARF2-XML

C code

compile

extract

common formatDWARF2-XML

textdatasymbol info.relocation info.debug info.

binaryELF/

DWARF2

data inte-

gration

use

dxref, rxref:cross-referencers

bscg:call graph extractor

Page 15: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 15

How bscg works(1)extract call instructions

by disassembling text.(2) convert addresses to symbols using DWARF2

(3) trim call graphs according to options

(4) output graph topologyin DOT of Graphviz

1234: call 5678 main: call fact

main fact

usage

digraph G { main -> fact; fact -> fact; }

Page 16: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 16

Advantages of bscg

Advantages of binary-level DI (explained later). eg., high applicability and few false positives.

Can identify inlined functions. Can extract a call from asm ("call fact"); Can exclude

library functions: e.g., printf system calls: e.g., open, fork functions in runtime systems: _start, _fini

Page 17: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 17

Disadvantages of bscg

No support for macro calls, signals, function pointers, optimization. gprof-callgraph.pl can handle function

pointers, since it uses dynamic information.

source-level ones (e.g., cflow) don't suffer from optimization problem.

Page 18: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 18

So, is bscg good?

Yes! (not the best, of course) Not easy to compare.

Page 19: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 19

What is binary-level DI? Provides common formats by extracting

information from binary code.

source code binary code

analyze

*.c*.c

Tools

a.outa.out

analyze

compile

commonformats

binaryDIsourc

eDI

DWARF2-XML

Page 20: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 20

Why binary-level DI?

Many advantages: High applicability Few false-positives. More true-positives for low-level

info. Low development cost

Can improve C tool's precision.

Page 21: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 21

What is lightweight DI?

Allows several common formats. To be practical! Hard to perfectly

integrate.light-

weight DI

heavy-weight DI

DWARF2-XML

Page 22: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 22

Summary

Imprecision in C tools. Our solution:

Binary-level lightweight data integration.

As a testbed, DWARF2 used for developing dxref, rxref: cross-referencers bscg: call-graph extractor

Page 23: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 23

Future works

Apply our technique to other tools: e.g., memory profilers, slicers, test

coverage tools, ... Develop new binary formats

suitable for lower CASE tools. tool-information carrying code.

cf. proof-carrying code, model-carrying code, schedule-carrying code.

Page 24: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 24

Page 25: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 25

Taxonomy of cross referencers.

Source-level Partial-parsing: GNU GLOBAL,

LXR, ... Full-parsing: Sapid, ACML

Binary-level Symbol tables: Visual Studio .NET(?) Debug info.: dxref Hybrid: rxref

Page 26: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 26

What is DWARF2?

A binary format for debugging information.

Primary target languages: C, C++, Fortran, Modula2, Pascal.

Includes: types, nested blocks, line numbers,

function/object names, addresses, stack frame information, ...

Page 27: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 27

DWARF2-XML

Our common format in XML for DWARF2.

A testbed of binary-level lightweight DI.

Makes it easier to process DWARF2. cf. libdwarf

About 15 times larger than DWARF2.

Page 28: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 28

DWARF2-XML example<section name=".debug_info"> <tag name="DW_TAG_lexical_block" offset="id:27"> <attribute name="DW_AT_low_pc" value="67328"/> <attribute name="DW_AT_high_pc" value="67356"/> ... <tag name="DW_TAG_variable" offset="id:27"> <attribute name="DW_AT_name" value="i"/> <attribute name="DW_AT_type"

value_ref="id:161"> <attribute name="DW_AT_location"> <description>DW_OP_fbreg:

-24</description></></></></> ... <tag name="DW_TAG_base_type" offset="id:161"> <attribute name="DW_AT_name" value="int"/> <attribute name="DW_AT_byte_size" value="4"/> <attribute name="DW_AT_encoding" value="5"> <description>signed</description></></></>

{ int i; ... }

addressrange

variablename

offset to

base ptr.

ID/IDREFlink

Page 29: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 29

DWARF2-XML file sizes About 15 times larger than DWARF2.

Size increase is almost cancelled by gzip.

Consumes much memory when using DOM. e.g., we cannot build DOM tree for gdb in our

environment. Tradeoff between memory consumption and low

development cost.

source a.out .debug_* DWARF2-XML

compressed by gzip

x_debug.c 27KB 77KB 50KB 1.1MB 58KBreadelf+.c 315KB 575KB 137KB 2.1MB 128KB

bash 1.2MB 2.9MB 705KB 16.3MB 815KBgdb 12MB 21.5MB 14.4MB 276MB 14MB

gdb's LOC is about 400,000.

Page 30: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 30

Execution speed

bscg is slower than the other, but acceptable for practical use. 12000 lines in 8.8 sec.

but too bad in the case of bash-2.03.

bscg has a problem in scalability due to heavy overhead of DOM library.

Page 31: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 31

Why XML?

Highly readable, portable, interoperable. plain-text and self-descriptiveness.

Powerful enough to describe complex structures and relations in programs. Nested tags and ID/IDREF links. DTD for checking XML documents. Flexibility to process semi-structured

documents. Easy to query/display/modify.

XML parsers, DOM/SAX, XPath. XPath's description is much smaller than

boring tree traversal code.

Page 32: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 32

Drawbacks in API integration

Insufficient abstraction. Many and various data structures/access

make it hard to well encapsulate them into a fixed API.

e.g., poor API in libdwarf to traverse a wide range of data tree. (only dwarf_siblingof and dwarf_child are provided.)

High cost to implement API in many languages.

High cost to learn how to use API.

e.g., libdwarf

Page 33: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 33

false/true positive/negative

false positives tool's incorrect output.

true positives tool's correct output.

false negatives tool's incorrect silence. tool should have produced output, but not.

true negatives tool's correct silence tool should not have produced output, and

not.

Page 34: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 34

bscg's graph trimming options

Page 35: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 35

Why lightweight DI?

To be practical! Hard to perfectly integrate.

Supported by the fact that most technologies gave up the perfect integration/definition. e.g., undefined behaviors in C. e.g., GNU BFD gives API integrating

different binary formats. useful, but not perfect. cannot convert ELF/DWARF2 into Windows PE.

Page 36: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 36

Why function pointer analysis is difficult in C?

Pointer arithmetic and casting. e.g., (int (*)())(base + offset)

Dynamic library e.g., handle = dlopen (libname,

RTLD_LAZY); func = dlsym (handle, funcname); f ();

Inline assembly code e.g., asm ("call foo");

Page 37: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 37

CASE tools development cost

Generally very high. individual parsers & analyzers. internal data is less interoperable

and portable IBM Eclipse

$40,000,000 (?)

Page 38: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 38

E.g., function pointer Cflow

apply calls f (false positive) gprof-callgraph.pl

apply calls add5 (true positive) Other tools (bscg)

apply calls ? (false negative)

int add5 (int x){ return x + 5; }int apply (int (*f)(int), int x){ return f (x); } int main (void){ return apply (add5, 10); }

Page 39: K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

2004/12/2 APSEC@BUSAN 39

Our homepage

http://www.sde.cs.titech.ac.jp/~gondow/dwarf2-xml/ DTD for DWARF2-XML Source code of readelf+, dxref,

rxref, bscg Some sample outputs