Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with:...

20
Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes

Transcript of Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with:...

Page 1: Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes.

Static Analysis of String Encoders and Decoders

Presented By:

Loris D’Antoni

Joint work with:

Margus Veanes

Page 2: Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes.

2

Motivation

• String Encoders and Decoders are– Ubiquitous: transformation from Unicode text files

in the Internet to in-memory representation of text– Hard to write: they use unintuitive logic in order to

enable efficiency– Hard to verify: big state space, alphabets are very

big (216 elements). Previous techniques blow up for small decoders.

Page 3: Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes.

3

A simple example: BASE64 encoder

3 Bytes 4 Base64 characters

• Decoder similar (every 4 encodes 3)• Uses bit manipulations to be efficient• How do we model it and prove it correct?

Text content M a n

Bytes 77 97 110

Bit Pattern 0 1 0 0 1 1 0 1 0 1 1 0 0 0 0 1 0 1 1 0 1 1 1 0

Index 19 22 5 46

Base64 Encoded T W F u

Page 4: Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes.

4

What Properties do we check?

Encoder, Decoder denoted by E,D• E o D = I • D o E = I• dom(E) = bytes• dom(D) = Base64 bytes

• We need– Equivalence checking– Function Composition (our model should be closed

under composition)

Page 5: Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes.

5

Bek code

program base64encode(input){ return iter(x in input)[q:=0;r:=0;]{ case (x>0xFF): raise InvalidCharacter; case (q==0): yield (base64(x>>2)); q:=1; r:=(x&3)<<4; case (q==1): yield (base64(r|(x>>4))); q:=2; r:=(x&0xF)<<2; case (q==2): yield (base64((r|(x>>6))), base64(x&0x3F)); q:=0;

r:=0; end case (q==1): yield (base64(r),'=','='); end case (q==2): yield (base64(r),'='); };}

How do we analyze this code?

Page 6: Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes.

6

Trust me! It is tricky![12/12/12 11:35:49 PM] Margus Veanes: I think it is doable, smth that is like ([A-Z2-7]{4}... )*[12/12/12 11:35:57 PM] Loris D'Antoni: ok ill try[12/12/12 11:36:22 PM] Margus Veanes: then you can ry to see the difference compared to the domain of the decoder[12/12/12 11:37:42 PM] Loris D'Antoni: it seems that also on this counterex it doesn't work[12/12/12 11:37:43 PM] Loris D'Antoni: DP2A====[12/12/12 11:37:50 PM] Loris D'Antoni: which maybe it's a bad one in this sense[12/12/12 11:37:52 PM] Loris D'Antoni: ill check now[12/12/12 11:40:45 PM] Margus Veanes: actually the domain of the decoder looks wrong, it allows 8 and 9[12/12/12 11:40:46 PM] Margus Veanes: http://www.rise4fun.com/Bek/Cy3[12/12/12 11:40:58 PM] Loris D'Antoni: yeh i fixed that in my version

…COUPLE OF HACKS LATER…

[12/13/12 12:24:02 AM] Loris D'Antoni: ok, found bug and fixed it, now proved them correct. Will work on others tomorrow. Was very silly but hard to spot[12/13/12 12:24:35 AM] Margus Veanes: ... this is why the analysis we can do is useful :-)[12/13/12 12:24:45 AM] Loris D'Antoni: yeh i was mapping[12/13/12 12:24:46 AM] Loris D'Antoni: 26..31 ==> 2..7[12/13/12 12:24:58 AM] Loris D'Antoni: instead of 26..31 ==> '2'..'7'

Brief DEMO

Page 7: Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes.

7

Attempt 1: Finite Transducers

M Ma

n / [TWFu]

M / []a / []

…..…..

…..

• Finite set of states• Each transition reads an input symbol and outputs a

sequence of symbols• Mapping from strings into strings• Blue state (final), for which the mapping is defined

28 edges out of every state and 216 states

Decidable equivalence and

closure under composition

Page 8: Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes.

8

Attempt 2: Symbolic Finite Transducers [POPL12]

M Ma

λx. x==‘M’ / [λx. x>>2]

λx. x==‘a’ / [λx. x>>4,…]

…..…..

• Guards are predicates over any decidable theory instead of single characters

• Output is a function of the input• In this case uses theory of bit-vectors• Better reflects implementation operations• Analysis is still decidable (equivalence, composition)• We did not improve much: still state explosion

Supports symbolic updates such as bit-

vectors

Page 9: Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes.

9

Attempt 3: Symbolic Transducers [POPL12]

1 2

True / [r|(x>>6), x&0x3F],r := 0

True / [x>>2], r := (x&3)<<4

True / [r|(x>>4)],r := (x&0xF)<<2

0

• Register can store values and is updated in transitions• Inputs and outputs can inspect and use register value• Logic is the same as for implementation!!• No state explosion!!• Closed under sequential composition• Analysis (equivalence) is undecidable in general…• We need a way to eliminate the registers

Registers

Registers

Page 10: Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes.

10

Register Elimination: the naïve way

1 2

x / [r|(x>>6), x&0x3F],r := 0

x / [x>>2], r := (x&3)<<4

x / [r|(x>>4)],r := (x&0xF)<<2

0

M Ma

n / [(((((M&3)<<4)|(a>>4))&0xF)<<2)|(n>>6), n&0x3F]

M / [M>>2]a / [((M&3)<<4)|(a>>4)]

…..…..

Via enumeration: State Explosion, but automatic

Can do analysis, but very slow…

Doesn’t work if alphabet infinite:

waste of Symbolic analysisWe need a Better model

ST

SFT

Page 11: Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes.

11

Text content M a n

Byte 77 97 110

Bit Pattern 0 1 0 0 1 1 0 1 0 1 1 0 0 0 0 1 0 1 1 0 1 1 1 0

Index 19 22 5 46

Base64 Encoded T W F u

A simple example: BASE64

3 Bytes 4 Base64 characters

• Decoder similar (every 4 encodes 3)• Uses bit manipulations to be efficient• How do we model it and prove it correct?

Page 12: Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes.

12

Extended Symbolic Finite Transducers[x1,x2,x3] /

[x1>>2, ((x1&3)<<4)|(x2>>4), ((x2&0xF)<<2)|(x3>>6), x3&0x3F]

0

• No state explosion• Analysis can be done for several interesting

cases (in particular for encoders)• But, how do we pass from STs to ESFTs?

Read sequences of symbols

Output is a function of all the 3 symbols

Page 13: Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes.

Register Elimination: the good way 1/2

1 2

x / [r|(x>>6), x&0x3F],r := 0

x / [x>>2], r := (x&3)<<4

x / [r|(x>>4)],r := (x&0xF)<<2

0

[x1,x2] / [r|(x1>>4), ((x1&0xF)<<2)|(x2>>6), x2&0x3F], r:=0

0

ST

ESFT

13

1x / [x>>2],

r := (x&3)<<4

2

Page 14: Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes.

Register Elimination: the good way 2/2

[x1,x2,x3] / [x1>>2, ((x1&3)<<4)|(x2>>4), ((x2&0xF)<<2)|(x3>>6), x3&0x3F]

0

Fast and supports infinite alphabets

Not always possible, but works for

encoders/decoders

14

[x1,x2] / [r|(x1>>4), ((x1&0xF)<<2)|(x2>>6), x2&0x3F], r:=0

0 1x / [x>>2],

r := (x&3)<<4

1

Page 15: Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes.

15

Composition of ESFTs

ESFT E

ESFT D

ST E’

ST D’

ST E’oD’ ESFT EoD

Use of registers to remember values

Uses ST closure under composition

Register elimination

Not closed in general

Page 16: Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes.

16

Equivalence Semi-Decision ProcedureFirst we check equivalence on domain intersection (hard)

then we check domain equivalence (easier in this case).

(λ(x1,x2).True)/[x1,x2] λ(x).True/ [x]

10

0,1

λ(x1,x2).True/([x1,x2],[x1,x2])

We build a product

transducer

Page 17: Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes.

17

Unicode Case Study

• We analyzed UTF8 to UTF16 encoder (E) and decoder (D)

Test Running Time

Dom(E) = UTF16 47 ms

Dom(EoD) = UTF16 109 ms

Dom(D) = UTF8 156 ms

Dom(DoE) = UTF8 320 ms

EoD=Identity (naive) 82,000 ms

DoE=Identity (naive) 134,000 ms

EoD=Identity (new algorithm) 123 ms

DoE=Identity (new algorithm) 215 ms

Complete analysis in less than a second

Page 18: Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes.

18

Result Summary

• ESFTs a new transducer model for representing encoders and decoders

• A new register elimination algorithm from ST to ESFTs, independent from input alphabet

• Correctness analysis of real programs: Unicode, Base64 encoders and decoders

• Automatic Javascript code generation of the verified code

• Check it out http://rise4fun.com/Bek/ • Transducers are cool!!

Page 19: Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes.

19

Future Work

• Understand the theory of ESFTs (coming soon)– Composition closure, equivalence…

• Extend the model to tree transformations– Widely used in NLP

• Analyze more complex scenarios– List manipulating programs

Page 20: Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes.

20

Thank you

Loris D’[email protected]

Questions?