Mining Specifications (lots of) code specifications of correctness

Post on 12-Jan-2016

33 views 0 download

Tags:

description

Mining Specifications (lots of) code  specifications of correctness. Glenn Ammons Ras Bodík Jim Larus Univ. of Wisconsin Univ. of Wisconsin Microsoft Research. program. program. program. program. verifier. Specifications. Bugs!. - PowerPoint PPT Presentation

Transcript of Mining Specifications (lots of) code specifications of correctness

Glenn Ammons Ras Bodík Jim Larus

Univ. of Wisconsin Univ. of Wisconsin Microsoft Research

Mining Specifications

(lots of) code specifications of correctness

2

Motivation: why specifications?

Verification tools• find bugs early• make guarantees• scale with

programs• need specifications

verifier

Bugs!

program

Specifications

programprogramprogram

3

Language-usage specifications

verifier

Bugs!

program

•array accesses•memory allocation•type safety•...

programprogramprogram

Easy to write,big payoff

4

Library-usage specifications

verifier

program

•cut-and-paste (X11)•network server (socket API)•device drivers (kernel API)•...

programHarder to write,smaller payoff

Bugs!

5

Program specifications

verifier

program

•symbol table well-formed•IR well-formed•...

Hardest to write,smallest payoff

Bugs!

6

Solution: specification mining

Specification mining gleans specifications from artifacts of program development:

• From programs (static)?• From executions of test cases (dynamic)?• From other artifacts?

7

Mining from traces

Advantages:• No infeasible paths• Pointer/alias analysis is easy• Few bugs, as program passes its tests• Common behavior is correct behavior

...socket(domain = 2, type = 1, proto = 0, return = 7)accept(so = 7, addr = 0x40, addr_len = 0x50, return = 8)write(so = 8, buf = 0x100, len = 23, return = 23)read(so = 8, buf = 0x100, len = 12, return = 12)close(so = 8, return = 0)close(so = 7, return = 0)...

8

Output: a specification

socket(return = X)

accept(so = X, return = Y)

close(so = Y)close(so = X)

read(so = Y)

write(so = Y)

Specification says what programs should do:•Temporal dependences (accept follows socket)•Data dependences (accept input is socket output)

start

end

9

How we mine specifications

extract scenarios

standardizePFSA learner

...socket(domain = 2, type = 1, proto = 0, return = 7))...

ACEGB

ACEGB

ACEGB

...socket(domain = 2, type = 1, proto = 0, return = 7))...

...socket(domain = 2, type = 1, proto = 0, return = 7))...

socket(...)

accept(...)

read(...) write(...)

close(...)

socket(...)

accept(...)

read(...) write(...)

close(...)

socket(...)

accept(...)

read(...) write(...)

close(...)

Traces Scenarios(dep. graphs)

Strings

postprocessSpecification

PFSAsocket(return = X)

accept(so = X, return = Y)

close(so = Y)close(so = X)

read(so = Y)

write(so = Y)

start

end

..

A

B

EF

C

D

start

end

. .1010

10

10

10

20

20

20

20

40

10

Outline of the talk

• The specification mining problem

• Our specification mining system• Annotating traces with dependences

• Extracting and standardizing scenarios

• Probabilistic learning and postprocessing

• Experimental results• Related work

11

An impossible problem

C (all correct traces)

T (training traces)

Find a Turing machine thatgenerates C, given T.I (all traces)

Unsolvable:• No restrictions on C• No connection between C and T • Simple variants are also undecidable [Gold67]

12

A simpler problem

Find a PFSA that generatesan approximation of P.

0

1P a probabilitydistribution

Pro

bab

ilit

y

CorrectNoise

13

A simpler problem

Find a PFSA that generatesan approximation of P.

All scenarios0

1P a probabilitydistribution overall scenarios

Pro

bab

ilit

y

Correct scenarios Noise

14

A simpler problem

Find a PFSA that generatesan approximation of P.

Tractable, plus• Scenarios are small• Noise handled• Finite-state• Weights useful for postprocessing

All scenarios0

1P a probabilitydistribution overall scenarios

Pro

bab

ilit

y

Correct scenarios Noise

15

Outline of the talk

• The specification mining problem• Our specification mining system

• Annotating traces with dependences

• Extracting and standardizing scenarios

• Probabilistic learning and postprocessing

• Verifying traces• Experimental results• Related work

16

Dependence annotation

socket(domain = 2, type = 1, proto = 0, return = 7)accept(so = 7, addr = 0x40, addr_len = 0x50, return = 8)write(so = 8, buf = 0x100, len = 23, return = 23)close(so = 8, return = 0)close(so = 7, return = 0)

dependence annotatorTraces

Annotated traces

17

Dependence annotation

Definers:• socket.return• accept.return• close.so

Users:• accept.so• read.so• write.so• close.so

dependence annotatorTraces

Annotated traces

socket(domain = 2, type = 1, proto = 0, return = 7)accept(so = 7, addr = 0x40, addr_len = 0x50, return = 8)write(so = 8, buf = 0x100, len = 23, return = 23)close(so = 8, return = 0)close(so = 7, return = 0)

18

Dependence annotationdependence annotatorTraces

Annotated traces

Definers:• socket.return• accept.return• close.so

Users:• accept.so• read.so• write.so• close.so

socket(domain = 2, type = 1, proto = 0, return = 7)accept(so = 7, addr = 0x40, addr_len = 0x50, return = 8)write(so = 8, buf = 0x100, len = 23, return = 23)close(so = 8, return = 0)close(so = 7, return = 0)

19

Outline of the talk

• The specification mining problem

• Our specification mining system• Annotating traces with dependences

• Extracting and standardizing scenarios

• Probabilistic learning and postprocessing

• Experimental results• Related work

20

Extracting scenariosscenario extractor

Annotatedtraces

Seeds

Abstract scenarios

socket(domain = 2, type = 1, proto = 0, return = 7)accept(so = 7, addr = 0x40, addr_len = 0x50, return = 8)write(so = 8, buf = 0x100, len = 23, return = 23)close(so = 8, return = 0)close(so = 7, return = 0)

21

Extracting scenariosscenario extractor

Annotatedtraces

Seeds

Abstract scenarios

socket(domain = 2, type = 1, proto = 0, return = 7)accept(so = 7, addr = 0x40, addr_len = 0x50, return = 8)write(so = 8, buf = 0x100, len = 23, return = 23)close(so = 8, return = 0)close(so = 7, return = 0)

22

Extracting scenariosscenario extractor

Annotatedtraces

Seeds

Abstract scenarios

socket(domain = 2, type = 1, proto = 0, return = 7)accept(so = 7, addr = 0x40, addr_len = 0x50, return = 8)write(so = 8, buf = 0x100, len = 23, return = 23)close(so = 8, return = 0)close(so = 7, return = 0)

23

Simplifying scenariosscenario extractor

Annotatedtraces

Seeds

Abstract scenarios

socket(domain = 2, type = 1, proto = 0, return = 7) [seed]accept(so = 7, addr = 0x40, addr_len = 0x50, return = 8)write(so = 8, buf = 0x100, len = 23, return = 23)close(so = 8, return = 0)close(so = 7, return = 0)

24

Simplifying scenarios

socket(return = 7) [seed]accept(so = 7, return = 8)write(so = 8)close(so = 8)close(so = 7)

Drops attributesnot used independences.

scenario extractorAnnotatedtraces

Seeds

Abstract scenarios

25

Standardizing scenarios

Simplified scenarios

Equivalentscenarios

Abstractscenarios

Standardization

Two transformations:•Naming: foo(val = 7) foo(val = X)•Reordering: foo(); bar(); bar(); foo();

Finds the least standardized scenario, inlexicographic order

scenario extractorAnnotatedtraces

Seeds

Abstract scenarios

26

Standardizing scenariosscenario extractor

Annotatedtraces

Seeds

Abstract scenarios

socket(return = 7) [seed]accept(so = 7, return = 8)write(so = 8)read(so = 8)close(so = 8)close(so = 7)

Use-def and def-defdependences

27

Standardizing scenarios

Reorder

scenario extractorAnnotatedtraces

Seeds

Abstract scenarios

socket(return = 7) [seed]accept(so = 7, return = 8)read(so = 8)write(so = 8)close(so = 8)close(so = 7)

Use-def and def-defdependences

28

Standardizing scenarios

ReorderName

scenario extractorAnnotatedtraces

Seeds

Abstract scenarios

socket(return = X) [seed]accept(so = X, return = Y)read(so = Y)write(so = Y)close(so = Y)close(so = X)

Use-def and def-defdependences

29

Standardizing scenarios

AB

DEFG

Each interaction is a letter to the PFSA learner.

scenario extractorAnnotatedtraces

Seeds

Abstract scenarios

socket(return = X) [seed]accept(so = X, return = Y)read(so = Y)write(so = Y)close(so = Y)close(so = X)

30

Outline of the talk

• The specification mining problem

• Our specification mining system• Annotating traces with dependences

• Extracting and standardizing scenarios

• Probabilistic learning and postprocessing

• Experimental results• Related work

31

PFSA learning

Algorithm due to Raman et al.:1. Build a weighted retrieval tree2. Merge similar states

automaton learnerAbstractscenarios

Specification

32

PFSA learning

Algorithm due to Raman et al.:1. Build a weighted retrieval tree2. Merge similar states

automaton learnerAbstractscenarios

Specification

A

B

C

EF D

F

100

99

100

99

1

G G

1

99

99

33

PFSA learning

B

C

ED

F

100

99

100

99

1

A

automaton learnerAbstractscenarios

Specification

Algorithm due to Raman et al.:1. Build a weighted retrieval tree2. Merge similar states

G1

G99

34

PFSA learning

B

C

ED

F

100

99

100

99

1

A

automaton learnerAbstractscenarios

Specification

Algorithm due to Raman et al.:1. Build a weighted retrieval tree2. Merge similar states

G100

35

Postprocessing: coring

B

C

ED

F

100

99

100

99

1

A

automaton learnerAbstractscenarios

Specification

1. Remove infrequent transitions2. Convert PFSA to NFA

G100

36

Postprocessing: coring

B

C

ED

F

A

automaton learnerAbstractscenarios

Specification

1. Remove infrequent transitions2. Convert PFSA to NFA

G

37

Outline of the talk

• The specification mining problem

• Our specification mining system• Annotating traces with dependences

• Extracting and standardizing scenarios

• Probabilistic learning and postprocessing

• Experimental results• Related work

38

Where to find bugs?

• in programs (static verification)?

• or in traces (dynamic verification)?

39

How we verify specifications

extract scenarios

standardizeCheck automaton

membership

...socket(domain = 2, type = 1, proto = 0, return = 7))...

ACEGB

ACEGB

ACEGB

...socket(domain = 2, type = 1, proto = 0, return = 7))...

...socket(domain = 2, type = 1, proto = 0, return = 7))...

socket(...)

accept(...)

read(...) write(...)

close(...)

socket(...)

accept(...)

read(...) write(...)

close(...)

socket(...)

accept(...)

read(...) write(...)

close(...)

Traces Scenarios(dep. graphs)

Strings

40

Verifying traces

...socket(return = 7)accept(so = 7, return = 8)write(so = 8)read(so = 8)close(so = 8)close(so = 7)...

...socket(return = 7)accept(so = 7, return = 8)write(so = 8)read(so = 8)close(so = 8)...

OK (both sockets closed) Bug! (socket 7 not closed)

socket(return = X) [seed]

accept(so = X, return = Y)

close(fd = Y)close(fd = X)

read(so = Y)

write(so = Y)

41

Attempted to mine and verify two published X11 rules

Experimental results

Challenge: small, buggy training sets (16 programs)

42

Learning by trial and error

Start with a rule learned from one, trusted trace.Then:

Randomly select an unused trace

Trace obeys rule?

Add trace to trainingset; learn a new rule

Expert: is trace buggy?

yes

no

no (rule too specific)

Report bug

yes

43

1. A timestamp-passing rule• 4 traces did not need inspection• learned the rule! (compact: 7 states)• bugs in 2 out of 16 programs (ups, e93)• English specification was incomplete (3 traces)• expert and corer agreed on 81% of the hot core

2. SetOwner(x) must be followed by GetSelection(x)

• failed to learn the rule (very small learning set) but

• bugs in 2 out of 5 programs (xemacs, ups)

Results

44

Outline of the talk

• The specification mining problem

• Our specification mining system• Annotating traces with dependences

• Extracting and standardizing scenarios

• Probabilistic learning and postprocessing

• Experimental results• Related work

45

Related workArithmetic pre/post conditions

• Daikon [Ernst et al], Houdini [Flanagan and Leino]• properties orthogonal from us • eventually, we may need to include and learn some

arithmetic relationships

Temporal relationships over calls • intrusion detection: [Ghosh et al], [Wagner and Dean]

• software processes: [Cook and Wolf]

• error checking: [Engler et al SOSP 2001]• lexical and syntactic pattern matching • user must write templates (e.g., <a> always follows

<b>)

• design patterns: [Reiss and Renieris]

46

Conclusion

• Introduced specification mining, a new approach for learning correctness specifications

• Refined the problem into a problem of probabilistic learning from traces

• Developed and demonstrated a practical specifications miner

47

End of talk

48

How we mine specifications

tracer rundependence annotator

Program

Instrumentedprogram

Test inputs

Traces Annotatedtraces

...socket(domain = 2, type = 1, proto = 0, return:T0 = 7)[SETUP socket:T0 7]accept(so:T0 = 7, addr = 0x40, addr_len = 0x50, return:T0 = 8)[USE socket:T0 8]close(so:T0 = 8, return = 0)close(so:T0 = 7, return = 0)...

49

How we mine specifications

Program

int s = socket(AF_INET, SOCK_STREAM, 0); [DO SETUP]while(cond1) { int ns = accept(s, &addr, &len); while(cond2) { [USE NS] if (cond3) return; } close(ns); }close(s);

50

How we mine specifications

tracer

Program

Instrumentedprogram

int s = socket(AF_INET, SOCK_STREAM, 0); [DO SETUP]while(cond1) { int ns = accept(s, &addr, &len); while(cond2) { [USE NS] if (cond3) return; } close(ns); }close(s);

51

How we mine specifications

tracer run

Program

Instrumentedprogram

Test inputs

Traces

...socket(domain = 2, type = 1, proto = 0, return = 7)[SETUP socket 7]accept(so = 7, addr = 0x40, addr_len = 0x50, return = 8)[USE socket 8]close(so = 8, return = 0)close(so = 7, return = 0)...

52

How we mine specifications

tracer rundependence annotator

Program

Instrumentedprogram

Test inputs

Traces Annotatedtraces

...socket(domain = 2, type = 1, proto = 0, return = 7)[SETUP socket 7]accept(so = 7, addr = 0x40, addr_len = 0x50, return = 8)[USE socket 8]close(so = 8, return = 0)close(so = 7, return = 0)...

53

How we mine specifications

tracer runscenario

extractordependence annotator

Program

Instrumentedprogram

Test inputs

Traces Annotatedtraces

Scenarioseeds

Abstractscenarios

socket(return = X) [seed][SETUP socket X]accept(so = X, return = Y)[USE socket Y]close(so = Y)close(so = X)

54

How we mine specifications

tracer runscenario

extractorautomaton

learnerdependence annotator

Program

Instrumentedprogram

Test inputs

Traces Annotatedtraces

Scenarioseeds

Abstractscenarios

Specification

socket(return = X) [seed]

[SETUP X]

accept(so = X, return = Y)

close(fd = Y)close(fd = X)

[USE Y]

55

Reducing the problem

C (all correct traces)

T (training traces)

The problem: find anautomaton that generatesC, given T.

I (all traces)

Issues:•What if C is not r.e.?•Checkers and learnersneed finite specs.

56

Reducing the problem

C (all correct traces)

T (training traces)

The problem: find anautomaton that generatesC, given T.

I (all traces)

Issues:•What if C is not r.e.?•Checkers and learnersneed finite specs.

57

Reducing the problem

The problem: find anautomaton that generatesC, given T. Assume thatC is regular.

Issue:•What if the program isnot regular?

C (all correct traces, regular)

T (training traces)

I (all traces)

I

C

T

Unrestricted

58

Reducing the problem

The problem: find anautomaton that generatesCS, given TS. Assume thatthe size of scenarios isbounded.

Issue:•No connectionbetween CS and TS!

CS (all correct scenarios, regular)

TS (training scenarios)

IS (all scenarios, bounded size)

I

C

T

Unrestricted RegularI

C

T

59

Reducing the problem

The problem: find anautomaton that generatesCS, given TS. Assume thatTS presents each element ofCS at least once.

Issue:•Undecidable (Gold67)

CS (all correct scenarios, regular)

TS = c0, c1, ...

IS (all scenarios, bounded size)

I

C

T

Unrestricted RegularI

C

T

IS

CS

TS

Scenarios

60

Reducing the problem

The problem: find a PFSAthat generates P’, whereP and P’ are close (by somedistance metric). AssumeP is generated by a PFSA.

I

C

T

Unrestricted RegularI

C

T

ScenariosIS

CS

Completepresentation

IS (all scenarios)

TS = c0, c1, ...

IS

CS

TS

0

1P a probabilitydistribution over IS,generated by a PFSA

61

Digression: postprocessing

• PFSA = NFA with weights• Specification = NFA• Convert PFSA to specification:

1. Find hot core (that is, drop noise)• drop infrequent scenarios• drop infrequent parts of scenarios

2. Drop weights

62

Preparing input traces

socket(domain = 2, type = 1, proto = 0, return:T0 = 7)[SETUP socket:T1 7]accept(so:T2 = 7, addr = 0x40, addr_len = 0x50, return:T3 = 8)[USE socket:T4 8]close(so:T5 = 8, return = 0)close(so:T5 = 7, return = 0)

63

Preparing input traces

socket(domain = 2, type = 1, proto = 0, return:T0 = 7)[SETUP socket:T0 7]accept(so:T2 = 7, addr = 0x40, addr_len = 0x50, return:T3 = 8)[USE socket:T4 8]close(so:T5 = 8, return = 0)close(so:T5 = 7, return = 0)

64

Preparing input traces

socket(domain = 2, type = 1, proto = 0, return:T0 = 7)[SETUP socket:T0 7]accept(so:T0 = 7, addr = 0x40, addr_len = 0x50, return:T3 = 8)[USE socket:T4 8]close(so:T5 = 8, return = 0)close(so:T5 = 7, return = 0)

65

Preparing input traces

socket(domain = 2, type = 1, proto = 0, return:T0 = 7)[SETUP socket:T0 7]accept(so:T0 = 7, addr = 0x40, addr_len = 0x50, return:T3 = 8)[USE socket:T4 8]close(so:T0 = 8, return = 0)close(so:T0 = 7, return = 0)

66

Preparing input traces

socket(domain = 2, type = 1, proto = 0, return:T0 = 7)[SETUP socket:T0 7]accept(so:T0 = 7, addr = 0x40, addr_len = 0x50, return:T3 = 8)[USE socket:T3 8]close(so:T0 = 8, return = 0)close(so:T0 = 7, return = 0)

67

Preparing input traces

socket(domain = 2, type = 1, proto = 0, return:T0 = 7)[SETUP socket:T0 7]accept(so:T0 = 7, addr = 0x40, addr_len = 0x50, return:T0 = 8)[USE socket:T0 8]close(so:T0 = 8, return = 0)close(so:T0 = 7, return = 0)

68

Extracting scenarios

socket(domain = 2, type = 1, proto = 0, return:T0 = 7)[SETUP socket:T0 7]accept(so:T0 = 7, addr = 0x40, addr_len = 0x50, return:T0 = 8)[USE socket:T0 8]close(so:T0 = 8, return = 0)close(so:T0 = 7, return = 0)

scenario extractorAnnotatedtraces

Seeds

Abstract scenarios

69

Extracting scenarios

socket(domain = 2, type = 1, proto = 0, return:T0 = 7)[SETUP socket:T0 7]accept(so:T0 = 7, addr = 0x40, addr_len = 0x50, return:T0 = 8)[USE socket:T0 8]close(so:T0 = 8, return = 0)close(so:T0 = 7, return = 0)

scenario extractorAnnotatedtraces

Seeds

Abstract scenarios

70

Extracting scenarios

socket(domain = 2, type = 1, proto = 0, return:T0 = 7)[SETUP socket:T0 7]accept(so:T0 = 7, addr = 0x40, addr_len = 0x50, return:T0 = 8)[USE socket:T0 8]close(so:T0 = 8, return = 0)close(so:T0 = 7, return = 0)

scenario extractorAnnotatedtraces

Seeds

Abstract scenarios

71

Simplifying scenarios

socket(domain = 2, type = 1, proto = 0, return:T0 = 7) [seed][SETUP socket:T0 7]accept(so:T0 = 7, addr = 0x40, addr_len = 0x50, return:T0 = 8)[USE socket:T0 8]close(so:T0 = 8, return = 0)close(so:T0 = 7, return = 0)

scenario extractorAnnotatedtraces

Seeds

Abstract scenarios

72

Simplifying scenarios

socket(return:T0 = 7) [seed][SETUP socket:T0 7]accept(so:T0 = 7, return:T0 = 8)[USE socket:T0 8]close(so:T0 = 8)close(so:T0 = 7)

Drop untypedattributes.

scenario extractorAnnotatedtraces

Seeds

Abstract scenarios

73

Standardizing scenarios

Standardization puts equivalent scenarios into a canonicalabstract form:

Simplified scenarios

Equivalentscenarios

Abstractscenarios

Standardization

A search using two transformations:•Naming: foo(val = 7) foo(val = X)•Reordering: foo(); bar(); bar(); foo();

scenario extractorAnnotatedtraces

Seeds

Abstract scenarios

74

Standardizing scenarios

socket(return:T0 = 7) [seed][SETUP socket:T0 7]accept(so:T0 = 7, return:T0 = 8)[USE Y]close(so:T0 = 8)close(so:T0 = 7)

scenario extractorAnnotatedtraces

Seeds

Abstract scenarios

75

Standardizing scenarios

socket(return:T0 = 7) [seed][SETUP socket:T0 7]accept(so:T0 = 7, return:T0 = 8)write(so:T0 = 8)read(so:T0 = 8)close(so:T0 = 8)close(so:T0 = 7)

scenario extractorAnnotatedtraces

Seeds

Abstract scenarios

76

Standardizing scenarios

socket(return:T0 = 7) [seed][SETUP socket:T0 7]accept(so:T0 = 7, return:T0 = 8)read(so:T0 = 8)write(so:T0 = 8)close(so:T0 = 8)close(so:T0 = 7)

Reorder

scenario extractorAnnotatedtraces

Seeds

Abstract scenarios

77

Standardizing scenarios

socket(return:T0 = X) [seed][SETUP socket:T0 X]accept(so:T0 = X, return:T0 = Y)read(so:T0 = Y)write(so:T0 = Y)close(so:T0 = Y)close(so:T0 = X)

ReorderName

scenario extractorAnnotatedtraces

Seeds

Abstract scenarios

78

Standardizing scenarios

socket(return:T0 = X) [seed][SETUP socket:T0 X]accept(so:T0 = X, return:T0 = Y)read(so:T0 = Y)write(so:T0 = Y)close(so:T0 = Y)close(so:T0 = X)

ABC

DEFG

Each interaction is a letter to the PFSA learner.

scenario extractorAnnotatedtraces

Seeds

Abstract scenarios

79

Coring

Coring removes PFSA transitions that occur infrequentlyand converts the PFSA into an NFA.

[SETUP X]

accept(so = X, return = Y)

close(fd = Y)[USE Y]

close(fd = X)

socket(return = X) [seed]

automaton learnerAbstractscenarios

Specification

80

Verification

Do all traces of a program P satisfy a specification A?

81

Verification

Do all traces of a program P satisfy a specification A?Does a trace T

Definition: T satisfies A if every seed in T is surroundedby a scenario that satisfies A.

82

Verification

Do all traces of a program P satisfy a specification A?Does a trace TDoes a scenario S

Language of A

Abstract scenariossatisfying A

Simplified scenariossatisfying A

Concrete scenariossatisfying A

SimplificationStandardization

S?

83

Experiments

• What we wanted to find out• Hypothesis 1: the process will find bugs and

reduce the number of traces that the expert must inspect.

• Hypothesis 2: the miner’s final specification will match the English rule.

• Hypothesis 3: the corer and the human will agree on the hot core.

• Gathered traces from 16 programs:• 5 programs in the X11 distribution and

• 11 contributed programs

84

Testing vs. verification

testing:

programinputinputinput is the output correct?

inputinputinput

propertyproperty

verification:

checkerproperty does property hold?

programX11sockets

sample properties:• allocated memory is freed.• locks are released.• …

85

Testing vs. verification

Completeness (“coverage”):• verification (if sound) guarantees that program

contains no bugs of a well-specified class.

testing verification

aspects all some

control some all

data some all

our focus

86

Verification: recent successes

Recent successes. specifications languages: temporal logics, automata, … abstractors: SLAM, FeaVer checkers: model checking, theorem proving, type

systems

What’s still missing?? specifications

property holds?

program

checker

abstractprogram

L1

abstractor

formal specificationof correctness

L2

property

87

So who formulates specifications?

Programmers? Probably not.

Why they won’t: • too busy: yet another language to learn?• specifications aren’t cool.• specification languages are hard: LTL, anyone?

Why they shouldn’t:• may misunderstand usage rules.• may not know all usage rules.

Mining Specifications: Convenient and easy: anyone can do it Like in data mining, discover surprise rules.

88

Advantages of mining

Exploits the massive programmers’ effort reflected in the code.

• Programmers resolved many problems:

• incomplete system requirements.

• incomplete API documentation.

• implementation-dependent API rules.• Want redundancy? (without redundant programming)

• ask multiple programmers (and vote).

Exploits the testers’ effort in devising test inputs

89

Our output: a specification

x = socket()

bind(x)

listen(x)

y = accept(x)

write(y)

close(y)

close(x)

read(y)

90

How do we mine?

Underlying premise:

Even bad software is debugged enough to show hints of correct behavior.

Maxim: Common usage is the correct usage.

91

Mining = machine learningReduce the problem into the well-known

problem of learning regular languages.

Obstacles:1. source code is too detailed and hard to analyze2. what is “common” behavior?

Solutions:

1. learn from dynamic behavior

2. learn probabilistically

learn from traces into probabilistic FSMs

92

Input: trace(s)7 = socket(2, 1, 0);bind(7, 0x400120, 16);listen(7, 5);8 = accept(7, 0x400200, 0x400240);read(8, 0x400320, 255);write(8, 0x400320, 12);read(8, 0x400320, 255);write(8, 0x400320, 7);close(8);10 = accept(7, 0x400200, 0x400240);read(10, 0x400320, 255);write(10, 0x400320, 13);close(10);close(7);……

x = socket()

bind(x)

listen(x)

y = accept(x)

write(y)

close(y)

close(x)

read(y)

7 = socket(2, 1, 0);bind(7, 0x400120, 16);listen(7, 5);8 = accept(7, 0x400200, 0x400240);read(8, 0x400320, 255);write(8, 0x400320, 12);read(8, 0x400320, 255);write(8, 0x400320, 7);close(8);10 = accept(7, 0x400200, 0x400240);read(10, 0x400320, 255);write(10, 0x400320, 13);close(10);close(7);……

7 = socket(2, 1, 0);bind(7, 0x400120, 16);listen(7, 5);8 = accept(7, 0x400200, 0x400240);read(8, 0x400320, 255);write(8, 0x400320, 12);read(8, 0x400320, 255);write(8, 0x400320, 7);close(8);10 = accept(7, 0x400200, 0x400240);read(10, 0x400320, 255);write(10, 0x400320, 13);close(10);close(7);……

93

The mining algorithm

dynamicexecution

(traces)

trace abstraction

usage scenarios

(strings)

(off-the-shelf)

RegExp learner

generalizedscenarios

(probabilistic FSA)

user: extract heavy core(and approve)

specification(NFA)

dynamic checker

dynamic exe.to be checked

(trace)

OK/bug

94

Trace abstraction: 4 challenges• Traces interleave useful and useless events.

• sockets created by accept are independent, …

• Specifications must include both temporal and value-flow constraints.

• Only some of API calls’ arguments impose “true” dependences.• accept does not alter the state of the bound socket,

• Specifications may impose only partial order.• filling in fields of a structure before a call, …

95

Finding dependendences7 = socket(2, 1, 0);bind(7, 0x400120, 16);listen(7, 5);8 = accept(7, 0x400200, 0x400240);read(8, 0x400320, 255);write(8, 0x400320, 12);read(8, 0x400320, 255);write(8, 0x400320, 7);close(8);10 = accept(7, 0x400200, 0x400240);read(10, 0x400320, 255);write(10, 0x400320, 13);close(10);close(7);……

Some args and return valuesare handles to data structures.Calls may

•write through the handle•read through the handle•read and write

Def-use dependences connectwriters to readers

h(_, )

a( , )d( , )b(_, )

e( )

Trace abstraction

h(3, 5) c(10)a(4, 5)d(4, 7)b(0, 5)f(10)h(8, 11)e(7)f(50)d(15, 1) c(7)a(9, 11)b(6, 7)d(9, 14)f(20)e(7)…

h(_, X)

a(Y, X)b(_, X)d(Y, Z)

e(Z)

h(_, X) a(Y,

X)b(_, X)d(Y,

Z)e(Z)

h(_, 5) c(10)a(4, 5)d(4, 7)b(_, 5)f(10)h(_, 11)e(7)f(_)d(_, _) c(7)a(9, 11)b(_, 11)d(9, _)e(_)f(_)…

h(_, X)

a(Y, X)d(Y, Z)b(_, X)

e(Z)

h(_, X) a(Y,

X)b(_, X)d(Y,

Z)

97

The output PFSA

h(_, X) a(Y, X) b(_, X) d(Y, Z) e(Z)

2 2 2 1 1

d(Y, Z)

1

98

Renaming and reordering the chop

outline of the algorithm

input: a chop (a dag of data dependences)output: the canonical chop

1. reorder: list all possible chop schedules• trick: only list those with calls in lexicographic order

2. rename: abstract arguments in each schedule3. select lexicographically least schedule

lexicographic order:a(…) b(…) < b(…) b(…)a(X) b(…) < a(Y) b(…)

99

Checking: the meaning of the spec

means:whenever seed(x) is executed, it must be preceded by a(x), b(x) and followed by c(x).

does not mean:a(x) must be followed by b(x), seed(x), c(x) (because a is not a seed).

seed(x) c(x)b(x)a(x)

100

Dynamic checking

• Used in our experiments

• checker mirrors the learner:

specification(NFA)

dynamic checker

for each seed in the trace extract a chop if some substring from chop in NFA

seed verified! else

extract a larger chop(up to a bound)

fail if no chop verifies

dynamic executionto be checked

(trace)

OK/bug

101

Static checking

Conversion to a “checkable” specification:

seed(x) c(x)b(x)a(x)

seed(x)

c(x)b(x)a(x)

^b(x)

^seed(x)

OK

bug!

^c(x) | end

seed(x)

102

Related workArithmetic pre/post conditions

• Daikon, Houdini• properties orthogonal from us • eventually, we may need to include and learn some

arithmetic relationships

Temporal relationships over calls • intrusion detection: [Ghosh et al], [Wagner and Dean]

• software processes: [Cook and Wolf]

• error checking: [Engler et al SOSP 2001]• lexical and syntactic pattern matching • user must write templates (e.g., <a> always follows

<b>)

106

Summary• Semi-automatically formulating well-formed,

non-trivial specifications is an important part of the verification tool chain.

• Contributions:• introduced specifications mining

• phrased it as probabilistic learning from dynamic traces

• decomposed it into a sequence of subproblems (using an off-the-shelf learner)

• developed dynamic checker

• found bugs

107

The supply/demand pyramids

LTL

C

C++

Java

Visual Basic

javascript, html, XML

skill(supply)

effort(demand)

s/w development

requirements

analysis

verification and testing